[LU-15553] replay-vbr test 12a fails with 'test_12a failed with 4' Created: 11/Feb/22 Updated: 19/Dec/23 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.15.0, Lustre 2.15.4 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | James Nunez (Inactive) | Assignee: | Lai Siyao |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
replay-vbr test_12a started failing with 'test_12a failed with 4' on August 4, 2021 for Lustre 2.14.53.7 with logs at https://testing.whamcloud.com/test_sets/17efe0ba-7e4a-4e7f-b7f5-02383e1314c5. We’ve seen this test fail for ZFS and ldiskfs, but, so far, always DNE. Looking at a recent failure at https://testing.whamcloud.com/test_sets/014ce4c3-c654-47f9-9333-1c58ebf545c3, the suite_log shows CMD: onyx-24vm7 e2label /dev/mapper/mds1_flakey 2>/dev/null Started lustre-MDT0000 CMD: onyx-55vm7.onyx.whamcloud.com unlinkmany /mnt/lustre/f12a.replay-vbr- 25 - unlinked 0 (time 1643080125 ; total 0 ; last 0) total: 25 unlinks in 0 seconds: inf unlinks/second CMD: onyx-55vm7.onyx.whamcloud.com unlinkmany /mnt/lustre/f12a.replay-vbr-3- 25 - unlinked 0 (time 1643080125 ; total 0 ; last 0) total: 25 unlinks in 0 seconds: inf unlinks/second CMD: onyx-55vm7.onyx.whamcloud.com checkstat -v /mnt/lustre/d12a.replay-vbr/f12a.replay-vbr replay-vbr test_12a: @@@@@@ FAIL: test_12a failed with 4 Trace dump: = /usr/lib64/lustre/tests/test-framework.sh:6391:error() = /usr/lib64/lustre/tests/test-framework.sh:6695:run_one() Looking at the code for this test, 1152 # All 50 files should have been replayed 1153 do_node $CLIENT1 unlinkmany $DIR/$tfile- 25 || return 2 1154 do_node $CLIENT1 unlinkmany $DIR/$tfile-3- 25 || return 3 1155 do_node $CLIENT1 $CHECKSTAT $DIR/$tdir/$tfile && return 4 1156 1157 return 0 1158 } 1159 run_test 12a "lost data due to missed REMOTE client during replay" The call to checkstat is what produces this error. |
| Comments |
| Comment by Andreas Dilger [ 19/Mar/22 ] |
|
James, it looks like this subtest is failing fairly intermittently, and all of the other subtests are passing consistently. It would make sense to push a patch to add this one subtest to ALWAYS_EXCEPT and then move the subtest into an enforced review session, so that at least we get the benefit of the other subtests being run. It would also be useful to generate a list of patches that landed on the day this subtest started failing to see if there are a few likely culprits, and push trial reversion patches with Test-Parameters: fortestonly testlist=replay-vbr env=ONLY=12a,ONLY_REPEAT=40 to confirm/deny whether the revert allows the subtest to pass. |
| Comment by Minh Diep [ 29/Dec/22 ] |
|
it started with https://testing.whamcloud.com/test_sessions/365a290f-d02b-4ca4-b2d6-64b90c815137 on 1/10/2022 on build https://build.whamcloud.com/job/lustre-master/4253 (2.14.56.67) |
| Comment by Gerrit Updater [ 13/Jul/23 ] |
|
"Lai Siyao <lai.siyao@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51653 |
| Comment by Gerrit Updater [ 14/Jul/23 ] |
|
"Lai Siyao <lai.siyao@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51665 |
| Comment by Gerrit Updater [ 14/Jul/23 ] |
|
"Lai Siyao <lai.siyao@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51669 |
| Comment by Gerrit Updater [ 14/Jul/23 ] |
|
"Lai Siyao <lai.siyao@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51671 |
| Comment by Gerrit Updater [ 27/Jul/23 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/51669/ |
| Comment by Gerrit Updater [ 03/Aug/23 ] |
|
"Lai Siyao <lai.siyao@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51861 |