[LU-16456] Interop conf-sanity test_132: Can not take the layout lock Created: 09/Jan/23  Updated: 14/Dec/23  Resolved: 11/Apr/23

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.15.2
Fix Version/s: Lustre 2.16.0, Lustre 2.15.3

Type: Bug Priority: Minor
Reporter: Maloo Assignee: Andreas Dilger
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-14598 Too many FIDs to precreate OST replac... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for sarah <sarah@whamcloud.com>

This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/ddbab4a8-cf2a-4e76-a840-43789b47ce46

test_132 failed with the following error:

conf-sanity test 132: hsm_actions processed after failover
:
Can not take the layout lock
[27701.164931] Lustre: DEBUG MARKER: dmesg
[27701.945320] Lustre: DEBUG MARKER: grep -c /mnt/lustre-mds1' ' /proc/mounts || true
[27702.659163] Lustre: DEBUG MARKER: umount -d -f /mnt/lustre-mds1
[27709.401721] Lustre: 1441099:0:(client.c:2282:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1672867911/real 1672867911]  req@000000007bf97720 x1754124222843648/t0(0) o251->MGC10.240.29.71@tcp@0@lo:26/25 lens 224/224 e 0 to 1 dl 1672867917 ref 2 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'umount.0'
[27709.407170] Lustre: 1441099:0:(client.c:2282:ptlrpc_expire_one_request()) Skipped 9 previous similar messages
[27709.834788] Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null &&
[27709.834788] lctl dl | grep ' ST ' || true
[27710.549943] Lustre: DEBUG MARKER: modprobe dm-flakey;
[27710.549943] 			 dmsetup targets | grep -q flakey
[27711.259838] Lustre: DEBUG MARKER: tunefs.lustre --param mdt.hsm_control=enabled /dev/mapper/mds1_flakey
[27711.633812] LDISKFS-fs (dm-3): mounted filesystem with ordered data mode. Opts: errors=remount-ro
[27712.085320] Lustre: DEBUG MARKER: mkdir -p /mnt/lustre-mds1
[27712.808080] Lustre: DEBUG MARKER: modprobe dm-flakey;
[27712.808080] 			 dmsetup targets | grep -q flakey
[27713.574469] Lustre: DEBUG MARKER: dmsetup status /dev/mapper/mds1_flakey >/dev/null 2>&1
[27714.291452] Lustre: DEBUG MARKER: dmsetup status /dev/mapper/mds1_flakey 2>&1
[27715.016281] Lustre: DEBUG MARKER: test -b /dev/mapper/mds1_flakey
[27715.729223] Lustre: DEBUG MARKER: e2label /dev/mapper/mds1_flakey
[27716.803220] Lustre: DEBUG MARKER: mkdir -p /mnt/lustre-mds1; mount -t lustre -o localrecov  /dev/mapper/mds1_flakey /mnt/lustre-mds1
[27717.198084] LDISKFS-fs (dm-3): mounted filesystem with ordered data mode. Opts: errors=remount-ro
[27717.281247] LDISKFS-fs (dm-3): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,no_mbcache,nodelalloc
[27717.336762] Lustre: Found index 0 for lustre-MDT0000, updating log
[27737.944800] Lustre: 1442465:0:(mdt_coordinator.c:1114:mdt_hsm_cdt_start()) lustre-MDT0000: trying to init HSM before MDD
[27737.947032] LustreError: 1442465:0:(mdt_coordinator.c:1125:mdt_hsm_cdt_start()) lustre-MDT0000: cannot take the layout locks needed for registered restore: -2
[27738.455316] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n health_check
[27739.172322] Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/us
[27740.057501] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null
[27740.646869] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null
[27741.696006] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null
[27742.140823] Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-113vm4.onyx.whamcloud.com: executing set_default_debug -1 all 4
[27742.141829] Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-113vm4.onyx.whamcloud.com: executing set_default_debug -1 all 4
[27742.593286] Lustre: DEBUG MARKER: onyx-113vm4.onyx.whamcloud.com: executing set_default_debug -1 all 4
[27742.594102] Lustre: DEBUG MARKER: onyx-113vm4.onyx.whamcloud.com: executing set_default_debug -1 all 4
[27743.025929] Lustre: DEBUG MARKER: e2label /dev/mapper/mds1_flakey 				2>/dev/null | grep -E ':[a-zA-Z]{3}[0-9]{4}'
[27743.741089] Lustre: DEBUG MARKER: e2label /dev/mapper/mds1_flakey 2>/dev/null
[27744.505150] Lustre: DEBUG MARKER: dmesg
[27745.394611] Lustre: DEBUG MARKER: /usr/sbin/lctl mark  conf-sanity test_132: @@@@@@ FAIL: Can not take the layout lock 
[27745.799850] Lustre: DEBUG MARKER: conf-sanity test_132: @@@@@@ FAIL: Can not take the layout lock
[27746.276110] Lustre: DEBUG MARKER: /usr/sbin/lctl dk > /autotest/autotest-1/2023-01-04/lustre-b2_15_full-part-3_47_63_cbc12095-043f-4981-9a9a-632861982003//conf-sanity.test_132.debug_log.$(hostname -s).1672867954.log;

VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
conf-sanity test_132 - Can not take the layout lock



 Comments   
Comment by Andreas Dilger [ 10/Jan/23 ]

This has been failing since 2022-09-27. It looks like it may be related to test_122b (LU-14598) failing first along with test_129, since they have all failed exactly the same days and number of times in the past 3 months, or possibly they are all just test interop bugs and these are the only days that interop testing was run?

I've pushed patch patch: https://review.whamcloud.com/49583 "LU-14598 tests: skip conf-sanity test_122b in interop" to fix that issue, possibly it will fix the other fallout as well?

Comment by Gerrit Updater [ 11/Jan/23 ]

"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49601
Subject: LU-16456 tests: skip conf-sanity test_129/132 in interop
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: ae3c44aa38dd9ea77a2a501aa5086760ce62534e

Comment by Gerrit Updater [ 11/Jan/23 ]

"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49602
Subject: LU-16456 tests: skip conf-sanity test_129/132 in interop
Project: fs/lustre-release
Branch: b2_15
Current Patch Set: 1
Commit: 17f36bcc56255da9290d21ec71575eb56ea66f7c

Comment by Andreas Dilger [ 11/Jan/23 ]

The test_133 failure is also because the test was added in 2.14.57 and is testing new functionality that doesn't exist in 2.14.0. Same with test_129 (no bug was filed for that).

I've pushed a patch that will skip both tests.

Comment by Gerrit Updater [ 27/Jan/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49601/
Subject: LU-16456 tests: skip conf-sanity test_129/132 in interop
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 7e566c6a1f9d5324718ebc7149153f3272363b9c

Comment by Gerrit Updater [ 11/Apr/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49602/
Subject: LU-16456 tests: skip conf-sanity test_129/132 in interop
Project: fs/lustre-release
Branch: b2_15
Current Patch Set:
Commit: 151afb445080d9a3f81fa617371b20e56afb9759

Generated at Sat Feb 10 03:27:10 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.