[LU-13449] recovery-small test_140b: FAIL: recovery took too long 4362 20 Created: 13/Apr/20 Updated: 23/Oct/21 Resolved: 17/Feb/21 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.14.0, Lustre 2.12.6 |
| Fix Version/s: | Lustre 2.14.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Maloo | Assignee: | Andreas Dilger |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
This issue was created by maloo for jianyu <yujian@whamcloud.com> This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/f1c30a21-d730-4d65-a379-d78f2e5fa72a test_140b failed with the following error: recovery took too long 4362 20 <<Please provide additional information about the failure here>> VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV |
| Comments |
| Comment by Jian Yu [ 13/Apr/20 ] |
|
The failure occurred consistently on master branch: |
| Comment by Alexey Lyashkov [ 23/Apr/20 ] |
|
It looks like a bug in environment or bash. $SECONDS return a something wrong. |
| Comment by Andreas Dilger [ 23/Apr/20 ] |
|
Looking at one of the test results, the entire test took 58s to finish (as is also confirmed by the subtest banners): == recovery-small test 140b: local mount is excluded from recovery ==== 10:41:23 (1586713283) : : trevis-64vm1: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec trevis-64vm2: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec recovery-small test_140b: @@@@@@ FAIL: recovery took too long 4362 20 FAIL 140b (58s) == recovery-small test 141: do not lose locks on MGS restart ==== 10:42:21 (1586713341) There does appear to be a typo in the error message (note 'bsfore'), but that does not affect the actual check:
@@ -2973,8 +2973,8 @@ test_140b() {
local before=$SECONDS
fail mds1
local after=$SECONDS
(( $after-$before < $TIMEOUT*2 )) ||
- error "recovery took too long" $((after-bsfore)) $TIMEOUT
+ error "recovery took too long $after-$before > $TIMEOUT"
}
run_test 140b "local mount is excluded from recovery"
Maybe the timeout needs to be scaled by the MDT count? Some further investigation is needed to see where the time is being spent, and if this indicates a defect in the "localrecov" code, or is unrelated to the test failure. |
| Comment by Alexey Lyashkov [ 24/Apr/20 ] |
|
Andreas, console messages say import stay in active in 0s, if you look a console time stamps this test was executed much less 5300s reported in fail. |
| Comment by Andreas Dilger [ 25/Apr/20 ] |
|
shadow the error message itself is incorrect. The "bsfore" variable is unset (note 's' instead of 'e'), since it should be "before", so "$((after-bsfore))" is just printed as "$after" (which is the "4362" printed in the error message). I don't think there is anything wrong with $SECONDS, just that the test took 58s to complete, but $TIMEOUT is only 20s, or even 40s that is being checked. |
| Comment by Gerrit Updater [ 25/Apr/20 ] |
|
Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38366 |
| Comment by Chris Horn [ 25/Apr/20 ] |
|
+1 on master https://testing.whamcloud.com/test_sessions/edb20dfa-e703-4bb7-a756-2d3d49bd74a7 |
| Comment by Chris Horn [ 06/May/20 ] |
|
+1 on master https://testing.whamcloud.com/test_sessions/4c20c06a-f6a6-48e7-8cdf-34cc5e8524a4 |
| Comment by Gerrit Updater [ 07/May/20 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38366/ |
| Comment by Andreas Dilger [ 14/Sep/20 ] |
|
+1 on master. This test has failed 21/188 runs on all branches in the past week. It looks like most of those failures are on older branches that have the old error messages, but there are 4/121 failures on master in the past week with the fixed error messages. Most of those are: recovery took too long 42 > 40 which means that recovery did finish, just a few seconds later than expected. Should that be considered an error that needs to be investigated? Fixing the test to "pass" in that case is easy, but may be removing the entire value of the test itself? There were a few other failures like: recovery-small.sh: line 2977: ((: 08: value too great for base (error token is "08") recovery-small test_140b: @@@@@@ FAIL: recovery took too long 08 > 40 recovery-small.sh: line 2977: ((: 09: value too great for base (error token is "09") recovery-small test_140b: @@@@@@ FAIL: recovery took too long 09 > 40 which looks like bash was unhappy with the "octal" format of "08" and "09", since the error message is printed in MM:SS format "Recovery over after 0:08, of 2 clients 2 recovered".
local recovery=$(do_facet mds1 dmesg |
awk -F: '/Recovery over after/ { print $4 }' |
cut -d, -f1 | tail -1)
(( $recovery < $TIMEOUT*2 )) ||
error "recovery took too long $recovery > $((TIMEOUT * 2))"
Also, it seems that this check is ignoring the minutes part of the timeout, which is not ideal. It would be better to "unparse" the MM:SS format and handle it properly as seconds. Something like:
# Lustre: tfs-MDT0000: Recovery over after 0:03, of 2 clients 2 rec...
local recovery=$(do_facet mds1 dmesg |
awk '/Recovery over after/ { print $6 }' | tail -1 |
awk -F: '{ print $1 * 60 + $2 }')
(( recovery < TIMEOUT * 2 )) ||
error "recovery took too long $recovery > $((TIMEOUT * 2))"
appears to work, but it would be good to decide whether a "+5" should be added into the check or not? |
| Comment by Gerrit Updater [ 14/Sep/20 ] |
|
Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/39909 |
| Comment by Andreas Dilger [ 14/Sep/20 ] |
|
Hmm, looking at the MDS console logs for one of the "42 > 40" failures I see: [ 216.334657] Lustre: lustre-MDT0000: Recovery over after 1:42, of 3 clients 2 recovered and 1 was evicted. so it is clear that this is way over the expected recovery time and probably needs some larger fix. |
| Comment by Gerrit Updater [ 16/Dec/20 ] |
|
Sebastien Buisson (sbuisson@ddn.com) uploaded a new patch: https://review.whamcloud.com/41000 |
| Comment by Gerrit Updater [ 08/Feb/21 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39909/ |