[LU-14992] replay-vbr test 7a fails with 'Test 7a.2 failed' Created: 08/Sep/21  Updated: 20/Dec/23  Resolved: 24/Sep/22

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.15.0
Fix Version/s: Lustre 2.16.0

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: Jian Yu
Resolution: Fixed Votes: 0
Labels: DNE
Environment:

DNE


Issue Links:
Duplicate
Related
is related to LU-15042 sanity test_133b: The counter for set... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

replay-vbr test_7a started failing with the error message 'Test 7a.2 failed' on 03 AUG 2021 and fails 100% of the time for the full-dne-part-2 test sessions. For Lustre 2.14.53 build #4206, this test does NOT fail. For Lustre 2.14.53.7 build #4207, this test fails 100% of the time for DNE test sessions.

Looking at a recent failure at https://testing.whamcloud.com/test_sets/9b87da49-9024-48ba-91b9-e5d006b73d65, we see the following in the suite_log

test_7a.2 first: createmany -o /mnt/lustre/d7a.replay-vbr/f7a.replay-vbr- 1
CMD: trevis-68vm5.trevis.whamcloud.com createmany -o /mnt/lustre/d7a.replay-vbr/f7a.replay-vbr- 1
total: 1 open/close in 0.00 seconds: 284.40 ops/second
test_7a.2 lost: rm /mnt/lustre2/d7a.replay-vbr/f7a.replay-vbr-0
CMD: trevis-68vm6 rm /mnt/lustre2/d7a.replay-vbr/f7a.replay-vbr-0
test_7a.2 last: mkdir /mnt/lustre/d7a.replay-vbr/f7a.replay-vbr-0
CMD: trevis-68vm5.trevis.whamcloud.com mkdir /mnt/lustre/d7a.replay-vbr/f7a.replay-vbr-0
CMD: trevis-68vm6 grep -c /mnt/lustre2' ' /proc/mounts
Stopping client trevis-68vm6 /mnt/lustre2 (opts:)
CMD: trevis-68vm6 lsof -t /mnt/lustre2
pdsh@trevis-68vm5: trevis-68vm6: ssh exited with exit code 1
CMD: trevis-68vm6 umount  /mnt/lustre2 2>&1
Failing mds1 on trevis-68vm8
CMD: trevis-68vm8 grep -c /mnt/lustre-mds1' ' /proc/mounts || true
Stopping /mnt/lustre-mds1 (opts:) on trevis-68vm8
CMD: trevis-68vm8 umount -d /mnt/lustre-mds1
CMD: trevis-68vm8 lsmod | grep lnet > /dev/null &&
lctl dl | grep ' ST ' || true
reboot facets: mds1
Failover mds1 to trevis-68vm8
CMD: trevis-68vm8 hostname
mount facets: mds1
CMD: trevis-68vm8 dmsetup status /dev/mapper/mds1_flakey >/dev/null 2>&1
CMD: trevis-68vm8 dmsetup status /dev/mapper/mds1_flakey 2>&1
CMD: trevis-68vm8 dmsetup table /dev/mapper/mds1_flakey
CMD: trevis-68vm8 dmsetup suspend --nolockfs --noflush /dev/mapper/mds1_flakey
CMD: trevis-68vm8 dmsetup load /dev/mapper/mds1_flakey --table \"0 4194304 linear 252:0 0\"
CMD: trevis-68vm8 dmsetup resume /dev/mapper/mds1_flakey
CMD: trevis-68vm8 test -b /dev/mapper/mds1_flakey
CMD: trevis-68vm8 e2label /dev/mapper/mds1_flakey
Starting mds1: -o localrecov  /dev/mapper/mds1_flakey /mnt/lustre-mds1
CMD: trevis-68vm8 mkdir -p /mnt/lustre-mds1; mount -t lustre -o localrecov  /dev/mapper/mds1_flakey /mnt/lustre-mds1
CMD: trevis-68vm8 /usr/sbin/lctl get_param -n health_check
CMD: trevis-68vm8 PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/lib64/openmpi/bin:/usr/share/Modules/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/usr/sbin:/sbin:/bin::/sbin:/bin:/usr/sbin: NAME=autotest_config bash rpc.sh set_default_debug \"vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck\" \"all\" 4 
trevis-68vm8: CMD: trevis-68vm8 /usr/sbin/lctl get_param -n version 2>/dev/null
trevis-68vm8: CMD: trevis-68vm8 /usr/sbin/lctl get_param -n version 2>/dev/null
trevis-68vm8: CMD: trevis-68vm7 /usr/sbin/lctl get_param -n version 2>/dev/null
trevis-68vm8: CMD: trevis-68vm8.trevis.whamcloud.com /usr/sbin/lctl get_param -n version 2>/dev/null
trevis-68vm8: trevis-68vm8.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
CMD: trevis-68vm8 e2label /dev/mapper/mds1_flakey 				2>/dev/null | grep -E ':[a-zA-Z]{3}[0-9]{4}'
pdsh@trevis-68vm5: trevis-68vm8: ssh exited with exit code 1
CMD: trevis-68vm8 e2label /dev/mapper/mds1_flakey 2>/dev/null
Started lustre-MDT0000
CMD: trevis-68vm5.trevis.whamcloud.com lctl get_param -n at_max
affected facets: mds1
CMD: trevis-68vm8 PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/lib64/openmpi/bin:/usr/share/Modules/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/usr/sbin:/sbin:/bin::/sbin:/bin:/usr/sbin: NAME=autotest_config bash rpc.sh _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475 
trevis-68vm8: CMD: trevis-68vm8 /usr/sbin/lctl get_param -n version 2>/dev/null
trevis-68vm8: CMD: trevis-68vm8 /usr/sbin/lctl get_param -n version 2>/dev/null
trevis-68vm8: CMD: trevis-68vm7 /usr/sbin/lctl get_param -n version 2>/dev/null
trevis-68vm8: CMD: trevis-68vm8.trevis.whamcloud.com /usr/sbin/lctl get_param -n version 2>/dev/null
trevis-68vm8: trevis-68vm8.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475
trevis-68vm8: *.lustre-MDT0000.recovery_status status: COMPLETE
Waiting for orphan cleanup...
CMD: trevis-68vm8 /usr/sbin/lctl list_param osp.*osc*.old_sync_processed 2> /dev/null
osp.lustre-OST0000-osc-MDT0000.old_sync_processed
osp.lustre-OST0000-osc-MDT0002.old_sync_processed
osp.lustre-OST0001-osc-MDT0000.old_sync_processed
osp.lustre-OST0001-osc-MDT0002.old_sync_processed
osp.lustre-OST0002-osc-MDT0000.old_sync_processed
osp.lustre-OST0002-osc-MDT0002.old_sync_processed
osp.lustre-OST0003-osc-MDT0000.old_sync_processed
osp.lustre-OST0003-osc-MDT0002.old_sync_processed
osp.lustre-OST0004-osc-MDT0000.old_sync_processed
osp.lustre-OST0004-osc-MDT0002.old_sync_processed
osp.lustre-OST0005-osc-MDT0000.old_sync_processed
osp.lustre-OST0005-osc-MDT0002.old_sync_processed
osp.lustre-OST0006-osc-MDT0000.old_sync_processed
osp.lustre-OST0006-osc-MDT0002.old_sync_processed
osp.lustre-OST0007-osc-MDT0000.old_sync_processed
osp.lustre-OST0007-osc-MDT0002.old_sync_processed
wait 40 secs maximumly for trevis-68vm8,trevis-68vm9 mds-ost sync done.
CMD: trevis-68vm8,trevis-68vm9 /usr/sbin/lctl get_param -n osp.*osc*.old_sync_processed
 replay-vbr test_7a: @@@@@@ FAIL: Test 7a.2 failed 
  Trace dump:
  = /usr/lib64/lustre/tests/test-framework.sh:6237:error()
  = /usr/lib64/lustre/tests/replay-vbr.sh:727:test_7a()

This must be a results of patch https://review.whamcloud.com/38553; 3e04b0fd6c3dd36372f33c54ea5f401c27485d60 “LU-13417 mdd: set default LMV on ROOT”. We may need to use the routine mkdir_on_mdt0() as a temporary fix

Logs for more failures are at
https://testing.whamcloud.com/test_sets/17efe0ba-7e4a-4e7f-b7f5-02383e1314c5
https://testing.whamcloud.com/test_sets/a00d3625-d4b0-48ef-88b1-e50707d75462
https://testing.whamcloud.com/test_sets/08c2b227-3285-438e-87b6-1d34e147a412



 Comments   
Comment by James Nunez (Inactive) [ 08/Sep/21 ]

I think there are a few other tests that hit this same issue; checking MDT0/MDS0 for stats or other information and mkdir may have created the directory on a different MDT. We can create separate tickets for each failure if this turns out to be wrong.

sanity test 133a/b - error 'The counter for mkdir on mds1 was not incremented'
https://testing.whamcloud.com/test_sets/f7481079-7fbe-47af-abbb-d376b877700b
https://testing.whamcloud.com/test_sets/802ba343-7f6e-4d01-beba-ac1060836f5f

Comment by Gerrit Updater [ 13/Sep/21 ]

"James Nunez <jnunez@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/44902
Subject: LU-14992 tests: sanity/replay-vbr mkdir on MDT0
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 4e78399b3e3b8504ea811ae93d6cb61fb4317533

Comment by James Nunez (Inactive) [ 25/Oct/21 ]

From the test results for patch https://review.whamcloud.com/44902, it looks like the sanity 133a/133b failures are a separate issue. Ticket LU-15042 was opened to track the sanity test failures.

Comment by Colin Faber [ 07/Jun/22 ]

Hi yujian 

Can you take a look?

 

Thank you!

Comment by Gerrit Updater [ 24/Sep/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/44902/
Subject: LU-14992 tests: sanity/replay-vbr mkdir on MDT0
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: f0324c5c2f4390d6d7e93ed799e95d8eef4704f4

Comment by Peter Jones [ 24/Sep/22 ]

Landed for 2.16

Comment by Gerrit Updater [ 27/Nov/22 ]

"Neil Brown <neilb@suse.de>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49252
Subject: LU-14992 tests: add more mkdir_on_mdt0 calls
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 53979bf9a3a41c68df30a28c7555711f2eb4d20e

Comment by Gerrit Updater [ 13/Dec/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49252/
Subject: LU-14992 tests: add more mkdir_on_mdt0 calls
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: d56ea0c80a959ebd9b393f2da048cc179cb16127

Comment by Gerrit Updater [ 14/Dec/22 ]

"Minh Diep <mdiep@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49404
Subject: LU-14992 tests: sanity/replay-vbr mkdir on MDT0
Project: fs/lustre-release
Branch: b2_15
Current Patch Set: 1
Commit: 7e2edf8157fc715887fb8e10dff5c92fab81fefb

Comment by Gerrit Updater [ 24/Aug/23 ]

"xinliang <xinliang.liu@linaro.org>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52065
Subject: LU-14992 tests: add more mkdir_on_mdt0 calls
Project: fs/lustre-release
Branch: b2_15
Current Patch Set: 1
Commit: 93f16ff3f5a1744a8bd62e07eada5ff2a4dd7bca

Comment by Gerrit Updater [ 20/Dec/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/52065/
Subject: LU-14992 tests: add more mkdir_on_mdt0 calls
Project: fs/lustre-release
Branch: b2_15
Current Patch Set:
Commit: 04775bc14fe12d15784ccfec78d0ed1975cbc45f

Generated at Sat Feb 10 03:14:31 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.