[LU-15553] replay-vbr test 12a fails with 'test_12a failed with 4' Created: 11/Feb/22  Updated: 19/Dec/23

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.15.0, Lustre 2.15.4
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: Lai Siyao
Resolution: Unresolved Votes: 0
Labels: None

Issue Links:
Related
is related to LU-9096 sanity test_253: File creation failed... Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

replay-vbr test_12a started failing with 'test_12a failed with 4' on August 4, 2021 for Lustre 2.14.53.7 with logs at https://testing.whamcloud.com/test_sets/17efe0ba-7e4a-4e7f-b7f5-02383e1314c5. We’ve seen this test fail for ZFS and ldiskfs, but, so far, always DNE.

Looking at a recent failure at https://testing.whamcloud.com/test_sets/014ce4c3-c654-47f9-9333-1c58ebf545c3, the suite_log shows

CMD: onyx-24vm7 e2label /dev/mapper/mds1_flakey 2>/dev/null
Started lustre-MDT0000
CMD: onyx-55vm7.onyx.whamcloud.com unlinkmany /mnt/lustre/f12a.replay-vbr- 25
 - unlinked 0 (time 1643080125 ; total 0 ; last 0)
total: 25 unlinks in 0 seconds: inf unlinks/second
CMD: onyx-55vm7.onyx.whamcloud.com unlinkmany /mnt/lustre/f12a.replay-vbr-3- 25
 - unlinked 0 (time 1643080125 ; total 0 ; last 0)
total: 25 unlinks in 0 seconds: inf unlinks/second
CMD: onyx-55vm7.onyx.whamcloud.com checkstat -v /mnt/lustre/d12a.replay-vbr/f12a.replay-vbr
 replay-vbr test_12a: @@@@@@ FAIL: test_12a failed with 4 
  Trace dump:
  = /usr/lib64/lustre/tests/test-framework.sh:6391:error()
  = /usr/lib64/lustre/tests/test-framework.sh:6695:run_one()

Looking at the code for this test,

1152     # All 50 files should have been replayed
1153     do_node $CLIENT1 unlinkmany $DIR/$tfile- 25 || return 2
1154     do_node $CLIENT1 unlinkmany $DIR/$tfile-3- 25 || return 3
1155     do_node $CLIENT1 $CHECKSTAT $DIR/$tdir/$tfile && return 4
1156 
1157     return 0
1158 }
1159 run_test 12a "lost data due to missed REMOTE client during replay"

The call to checkstat is what produces this error.



 Comments   
Comment by Andreas Dilger [ 19/Mar/22 ]

James, it looks like this subtest is failing fairly intermittently, and all of the other subtests are passing consistently. It would make sense to push a patch to add this one subtest to ALWAYS_EXCEPT and then move the subtest into an enforced review session, so that at least we get the benefit of the other subtests being run.

It would also be useful to generate a list of patches that landed on the day this subtest started failing to see if there are a few likely culprits, and push trial reversion patches with

Test-Parameters: fortestonly testlist=replay-vbr env=ONLY=12a,ONLY_REPEAT=40

to confirm/deny whether the revert allows the subtest to pass.

Comment by Minh Diep [ 29/Dec/22 ]

it started with https://testing.whamcloud.com/test_sessions/365a290f-d02b-4ca4-b2d6-64b90c815137 on 1/10/2022 on build https://build.whamcloud.com/job/lustre-master/4253 (2.14.56.67)

Comment by Gerrit Updater [ 13/Jul/23 ]

"Lai Siyao <lai.siyao@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51653
Subject: LU-15553 test: mkdir_on_mdt0 in replay-vbr.sh
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: fe16cb12a2383440ecd9ca1076a78447a1ec13a2

Comment by Gerrit Updater [ 14/Jul/23 ]

"Lai Siyao <lai.siyao@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51665
Subject: LU-15553 test: mkdir_on_mdt0 in conf-sanity.sh
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 7107b9b4bdfad41a8d4e1182bb698c23dcb2baa5

Comment by Gerrit Updater [ 14/Jul/23 ]

"Lai Siyao <lai.siyao@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51669
Subject: LU-15553 test: mkdir_on_mdt0 in recovery-small.sh
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 3aa1bb3d82ec128aa3a6f330974be5be0d646b25

Comment by Gerrit Updater [ 14/Jul/23 ]

"Lai Siyao <lai.siyao@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51671
Subject: LU-15553 test: replace mkdir with mkdir_on_mdt0
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 488f1d5e5933138470677ddfa49b0b3a9343bd50

Comment by Gerrit Updater [ 27/Jul/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/51669/
Subject: LU-15553 test: mkdir_on_mdt0 in recovery-small.sh
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 3b0d2821845cf87ae7f03bf41ceae00237d94121

Comment by Gerrit Updater [ 03/Aug/23 ]

"Lai Siyao <lai.siyao@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51861
Subject: LU-15553 test: mkdir_on_mdt0 in sanity
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 237934f85f428f23ab637da8408276c525e0fede

Generated at Sat Feb 10 03:19:18 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.