[LU-11336] replay-single test 80d hangs on MDT unmount Created: 05/Sep/18  Updated: 11/Mar/20

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: James Nunez (Inactive) Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: DNE, zfs
Environment:

DNE/ZFS


Issue Links:
Related
is related to LU-10740 replay-single test_2d: FAIL: checksta... Resolved
is related to LU-11418 hung threads on MDT and MDT won't umount Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

replay-single test_80d hangs on MDT unmount. Looking at a recent hang, at https://testing.whamcloud.com/test_sets/ffb8585c-a706-11e8-80f7-52540065bddc, the last thing we see in the test log is

CMD: trevis-44vm10 /usr/sbin/lctl --device lustre-MDT0001 notransno
CMD: trevis-44vm10 /usr/sbin/lctl --device lustre-MDT0001 readonly
CMD: trevis-44vm10 /usr/sbin/lctl mark mds2 REPLAY BARRIER on lustre-MDT0001
Failing mds1 on trevis-44vm9
CMD: trevis-44vm9 grep -c /mnt/lustre-mds1' ' /proc/mounts || true
Stopping /mnt/lustre-mds1 (opts:) on trevis-44vm9
CMD: trevis-44vm9 umount -d /mnt/lustre-mds1
CMD: trevis-44vm9 lsmod | grep lnet > /dev/null &&
lctl dl | grep ' ST ' || true
CMD: trevis-44vm9 ! zpool list -H lustre-mdt1 >/dev/null 2>&1 ||
			grep -q ^lustre-mdt1/ /proc/mounts ||
			zpool export  lustre-mdt1
Failing mds2 on trevis-44vm10
CMD: trevis-44vm10 grep -c /mnt/lustre-mds2' ' /proc/mounts || true
Stopping /mnt/lustre-mds2 (opts:) on trevis-44vm10
CMD: trevis-44vm10 umount -d /mnt/lustre-mds2

Looking at the console log for the MDS (vm10), we see the some errors and a call trace

[ 5775.673375] Lustre: DEBUG MARKER: /usr/sbin/lctl --device lustre-MDT0001 notransno
[ 5776.031453] Lustre: DEBUG MARKER: /usr/sbin/lctl --device lustre-MDT0001 readonly
[ 5776.390963] Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds2 REPLAY BARRIER on lustre-MDT0001
[ 5776.569652] Lustre: DEBUG MARKER: mds2 REPLAY BARRIER on lustre-MDT0001
[ 5778.383410] Lustre: DEBUG MARKER: grep -c /mnt/lustre-mds2' ' /proc/mounts || true
[ 5778.731629] Lustre: DEBUG MARKER: umount -d /mnt/lustre-mds2
[ 5778.917429] LustreError: 8724:0:(ldlm_resource.c:1146:ldlm_resource_complain()) lustre-MDT0000-osp-MDT0001: namespace resource [0x20000abf8:0xa:0x0].0x0 (ffff96bab691e6c0) refcount nonzero (1) after lock cleanup; forcing cleanup.
[ 5778.922402] LustreError: 8724:0:(ldlm_resource.c:1146:ldlm_resource_complain()) lustre-MDT0000-osp-MDT0001: namespace resource [0x20000abf8:0xa:0x0].0x0 (ffff96bab691e6c0) refcount nonzero (1) after lock cleanup; forcing cleanup.
[ 5783.926373] LustreError: 0-0: Forced cleanup waiting for lustre-MDT0000-osp-MDT0001 namespace with 1 resources in use, (rc=-110)
[ 5830.783393] LNet: Service thread pid 27370 was inactive for 62.11s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
[ 5830.785144] Pid: 27370, comm: mdt00_000 3.10.0-862.9.1.el7_lustre.x86_64 #1 SMP Fri Aug 17 20:37:05 UTC 2018
[ 5830.786119] Call Trace:
[ 5830.786419]  [<ffffffffc0f14dca>] top_trans_wait_result+0xa6/0x155 [ptlrpc]
[ 5830.787317]  [<ffffffffc0ef6c0b>] top_trans_stop+0x42b/0x930 [ptlrpc]
[ 5830.788062]  [<ffffffffc131d80c>] lod_trans_stop+0x25c/0x340 [lod]
[ 5830.788958]  [<ffffffffc13ca9ae>] mdd_trans_stop+0x2e/0x174 [mdd]
[ 5830.789654]  [<ffffffffc13af627>] mdd_create+0x967/0x13f0 [mdd]
[ 5830.790307]  [<ffffffffc124cc80>] mdt_create+0x820/0xc40 [mdt]
[ 5830.791044]  [<ffffffffc124d20b>] mdt_reint_create+0x16b/0x350 [mdt]
[ 5830.791709]  [<ffffffffc124ece3>] mdt_reint_rec+0x83/0x210 [mdt]
[ 5830.792402]  [<ffffffffc122e1d2>] mdt_reint_internal+0x6b2/0xa80 [mdt]
[ 5830.793112]  [<ffffffffc12391e7>] mdt_reint+0x67/0x140 [mdt]
[ 5830.793705]  [<ffffffffc0ee627a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
[ 5830.794488]  [<ffffffffc0e8940b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
[ 5830.795329]  [<ffffffffc0e8cc44>] ptlrpc_main+0xb14/0x1fb0 [ptlrpc]
[ 5830.796151]  [<ffffffffb88bb621>] kthread+0xd1/0xe0
[ 5830.796694]  [<ffffffffb8f205f7>] ret_from_fork_nospec_end+0x0/0x39
[ 5830.797409]  [<ffffffffffffffff>] 0xffffffffffffffff
[ 5830.798027] LustreError: dumping log to /tmp/lustre-log.1535036577.27370
[ 9434.653754] SysRq : Changing Loglevel
[ 9434.654263] Loglevel set to 8

This test only hnags in this way for DNE/ZFS testing.

It’s not easy to detect when this issue started because of issues with the kernel call trace , but this test did not fail nor hang in July 2018 and the first signs of replay-single test 80d hanging is on August 6, 2018.

Here are links to logs for some hangs:
https://testing.whamcloud.com/test_sets/06129818-af39-11e8-bd05-52540065bddc
https://testing.whamcloud.com/test_sets/0339a0e2-ae2a-11e8-bfed-52540065bddc
https://testing.whamcloud.com/test_sets/021c99b8-ad2f-11e8-bfed-52540065bddc
https://testing.whamcloud.com/test_sets/fd1d1dd8-abe0-11e8-80f7-52540065bddc
https://testing.whamcloud.com/test_sets/057c01e4-9f6e-11e8-b0aa-52540065bddc


Generated at Sat Feb 10 02:42:59 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.