[LU-13449] recovery-small test_140b: FAIL: recovery took too long 4362 20 Created: 13/Apr/20  Updated: 23/Oct/21  Resolved: 17/Feb/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.14.0, Lustre 2.12.6
Fix Version/s: Lustre 2.14.0

Type: Bug Priority: Minor
Reporter: Maloo Assignee: Andreas Dilger
Resolution: Fixed Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for jianyu <yujian@whamcloud.com>

This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/f1c30a21-d730-4d65-a379-d78f2e5fa72a

test_140b failed with the following error:

recovery took too long 4362 20

<<Please provide additional information about the failure here>>

VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
recovery-small test_140b - recovery took too long 4362 20



 Comments   
Comment by Jian Yu [ 13/Apr/20 ]

The failure occurred consistently on master branch:
https://testing.whamcloud.com/test_sets/f1c30a21-d730-4d65-a379-d78f2e5fa72a
https://testing.whamcloud.com/test_sets/f28a48fa-4f64-47d4-a58b-85b2a76e6037
https://testing.whamcloud.com/test_sets/3c3c41f6-1b79-4ccb-8ee8-9a39e079c7c7

Comment by Alexey Lyashkov [ 23/Apr/20 ]

It looks like a bug in environment or bash. $SECONDS return a something wrong.

Comment by Andreas Dilger [ 23/Apr/20 ]

Looking at one of the test results, the entire test took 58s to finish (as is also confirmed by the subtest banners):
https://testing.whamcloud.com/test_sets/f1c30a21-d730-4d65-a379-d78f2e5fa72a

== recovery-small test 140b: local mount is excluded from recovery ==== 10:41:23 (1586713283)
:
:
trevis-64vm1: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
trevis-64vm2: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
 recovery-small test_140b: @@@@@@ FAIL: recovery took too long 4362 20
FAIL 140b (58s)
== recovery-small test 141: do not lose locks on MGS restart ==== 10:42:21 (1586713341)

There does appear to be a typo in the error message (note 'bsfore'), but that does not affect the actual check:

@@ -2973,8 +2973,8 @@ test_140b() {
        local before=$SECONDS
        fail mds1
        local after=$SECONDS
        (( $after-$before < $TIMEOUT*2 )) ||
-               error "recovery took too long" $((after-bsfore)) $TIMEOUT
+               error "recovery took too long $after-$before > $TIMEOUT"
 }
 run_test 140b "local mount is excluded from recovery"

Maybe the timeout needs to be scaled by the MDT count? Some further investigation is needed to see where the time is being spent, and if this indicates a defect in the "localrecov" code, or is unrelated to the test failure.

Comment by Alexey Lyashkov [ 24/Apr/20 ]

Andreas,

console messages say import stay in active in 0s, if you look a console time stamps this test was executed much less 5300s reported in fail.
i think $before is zero or something similar, so my suggestion stop a use $SECONDS and replace with date -%s instead.

Comment by Andreas Dilger [ 25/Apr/20 ]

shadow the error message itself is incorrect. The "bsfore" variable is unset (note 's' instead of 'e'), since it should be "before", so "$((after-bsfore))" is just printed as "$after" (which is the "4362" printed in the error message). I don't think there is anything wrong with $SECONDS, just that the test took 58s to complete, but $TIMEOUT is only 20s, or even 40s that is being checked.

Comment by Gerrit Updater [ 25/Apr/20 ]

Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38366
Subject: LU-13449 tgt: fix recovery timer comparisons
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 6f041a32a487c7f31c80525175a27e2990cbbf05

Comment by Chris Horn [ 25/Apr/20 ]

+1 on master https://testing.whamcloud.com/test_sessions/edb20dfa-e703-4bb7-a756-2d3d49bd74a7

Comment by Chris Horn [ 06/May/20 ]

+1 on master https://testing.whamcloud.com/test_sessions/4c20c06a-f6a6-48e7-8cdf-34cc5e8524a4

Comment by Gerrit Updater [ 07/May/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38366/
Subject: LU-13449 tgt: fix recovery timer comparisons
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 814bb394843434883a94fe6432cd8c656035a3e1

Comment by Andreas Dilger [ 14/Sep/20 ]

+1 on master.
https://testing.whamcloud.com/test_sets/3cc6a545-5e58-426d-adf2-3f083d542c2f

This test has failed 21/188 runs on all branches in the past week. It looks like most of those failures are on older branches that have the old error messages, but there are 4/121 failures on master in the past week with the fixed error messages. Most of those are:

recovery took too long 42 > 40

which means that recovery did finish, just a few seconds later than expected. Should that be considered an error that needs to be investigated? Fixing the test to "pass" in that case is easy, but may be removing the entire value of the test itself?

There were a few other failures like:

recovery-small.sh: line 2977: ((: 08: value too great for base (error token is "08")
recovery-small test_140b: @@@@@@ FAIL: recovery took too long 08 > 40 
recovery-small.sh: line 2977: ((: 09: value too great for base (error token is "09")
recovery-small test_140b: @@@@@@ FAIL: recovery took too long 09 > 40

which looks like bash was unhappy with the "octal" format of "08" and "09", since the error message is printed in MM:SS format "Recovery over after 0:08, of 2 clients 2 recovered".

        local recovery=$(do_facet mds1 dmesg |
                         awk -F: '/Recovery over after/ { print $4 }' |
                         cut -d, -f1 | tail -1)
        (( $recovery < $TIMEOUT*2 )) ||
                error "recovery took too long $recovery > $((TIMEOUT * 2))"

Also, it seems that this check is ignoring the minutes part of the timeout, which is not ideal. It would be better to "unparse" the MM:SS format and handle it properly as seconds. Something like:

        # Lustre: tfs-MDT0000: Recovery over after 0:03, of 2 clients 2 rec...
        local recovery=$(do_facet mds1 dmesg |
                         awk '/Recovery over after/ { print $6 }' | tail -1 |
                         awk -F: '{ print $1 * 60 + $2 }')
        (( recovery < TIMEOUT * 2 )) ||
                error "recovery took too long $recovery > $((TIMEOUT * 2))"

appears to work, but it would be good to decide whether a "+5" should be added into the check or not?

Comment by Gerrit Updater [ 14/Sep/20 ]

Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/39909
Subject: LU-13449 tests: fix recovery-small test_140b check
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 1f932570fc9aec1e0f35f4d3f4c93a1af2dd1e37

Comment by Andreas Dilger [ 14/Sep/20 ]

Hmm, looking at the MDS console logs for one of the "42 > 40" failures I see:
https://testing.whamcloud.com/sub_tests/72a6a27d-09f4-44fa-b6e8-435416f8c536

[  216.334657] Lustre: lustre-MDT0000: Recovery over after 1:42, of 3 clients 2 recovered and 1 was evicted.

so it is clear that this is way over the expected recovery time and probably needs some larger fix.

Comment by Gerrit Updater [ 16/Dec/20 ]

Sebastien Buisson (sbuisson@ddn.com) uploaded a new patch: https://review.whamcloud.com/41000
Subject: LU-13449 tgt: fix recovery timer comparisons
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: ce247bcbe51e597fda2cc0c552ba55251467aa1b

Comment by Gerrit Updater [ 08/Feb/21 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39909/
Subject: LU-13449 tests: fix recovery-small test_140b check
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 463a70eb645df7519efeab49e558cbe6f4ecafa5

Generated at Sat Feb 10 03:01:22 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.