Description
replay-single test_80d hangs on MDT unmount. Looking at a recent hang, at https://testing.whamcloud.com/test_sets/ffb8585c-a706-11e8-80f7-52540065bddc, the last thing we see in the test log is
CMD: trevis-44vm10 /usr/sbin/lctl --device lustre-MDT0001 notransno CMD: trevis-44vm10 /usr/sbin/lctl --device lustre-MDT0001 readonly CMD: trevis-44vm10 /usr/sbin/lctl mark mds2 REPLAY BARRIER on lustre-MDT0001 Failing mds1 on trevis-44vm9 CMD: trevis-44vm9 grep -c /mnt/lustre-mds1' ' /proc/mounts || true Stopping /mnt/lustre-mds1 (opts:) on trevis-44vm9 CMD: trevis-44vm9 umount -d /mnt/lustre-mds1 CMD: trevis-44vm9 lsmod | grep lnet > /dev/null && lctl dl | grep ' ST ' || true CMD: trevis-44vm9 ! zpool list -H lustre-mdt1 >/dev/null 2>&1 || grep -q ^lustre-mdt1/ /proc/mounts || zpool export lustre-mdt1 Failing mds2 on trevis-44vm10 CMD: trevis-44vm10 grep -c /mnt/lustre-mds2' ' /proc/mounts || true Stopping /mnt/lustre-mds2 (opts:) on trevis-44vm10 CMD: trevis-44vm10 umount -d /mnt/lustre-mds2
Looking at the console log for the MDS (vm10), we see the some errors and a call trace
[ 5775.673375] Lustre: DEBUG MARKER: /usr/sbin/lctl --device lustre-MDT0001 notransno [ 5776.031453] Lustre: DEBUG MARKER: /usr/sbin/lctl --device lustre-MDT0001 readonly [ 5776.390963] Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds2 REPLAY BARRIER on lustre-MDT0001 [ 5776.569652] Lustre: DEBUG MARKER: mds2 REPLAY BARRIER on lustre-MDT0001 [ 5778.383410] Lustre: DEBUG MARKER: grep -c /mnt/lustre-mds2' ' /proc/mounts || true [ 5778.731629] Lustre: DEBUG MARKER: umount -d /mnt/lustre-mds2 [ 5778.917429] LustreError: 8724:0:(ldlm_resource.c:1146:ldlm_resource_complain()) lustre-MDT0000-osp-MDT0001: namespace resource [0x20000abf8:0xa:0x0].0x0 (ffff96bab691e6c0) refcount nonzero (1) after lock cleanup; forcing cleanup. [ 5778.922402] LustreError: 8724:0:(ldlm_resource.c:1146:ldlm_resource_complain()) lustre-MDT0000-osp-MDT0001: namespace resource [0x20000abf8:0xa:0x0].0x0 (ffff96bab691e6c0) refcount nonzero (1) after lock cleanup; forcing cleanup. [ 5783.926373] LustreError: 0-0: Forced cleanup waiting for lustre-MDT0000-osp-MDT0001 namespace with 1 resources in use, (rc=-110) [ 5830.783393] LNet: Service thread pid 27370 was inactive for 62.11s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: [ 5830.785144] Pid: 27370, comm: mdt00_000 3.10.0-862.9.1.el7_lustre.x86_64 #1 SMP Fri Aug 17 20:37:05 UTC 2018 [ 5830.786119] Call Trace: [ 5830.786419] [<ffffffffc0f14dca>] top_trans_wait_result+0xa6/0x155 [ptlrpc] [ 5830.787317] [<ffffffffc0ef6c0b>] top_trans_stop+0x42b/0x930 [ptlrpc] [ 5830.788062] [<ffffffffc131d80c>] lod_trans_stop+0x25c/0x340 [lod] [ 5830.788958] [<ffffffffc13ca9ae>] mdd_trans_stop+0x2e/0x174 [mdd] [ 5830.789654] [<ffffffffc13af627>] mdd_create+0x967/0x13f0 [mdd] [ 5830.790307] [<ffffffffc124cc80>] mdt_create+0x820/0xc40 [mdt] [ 5830.791044] [<ffffffffc124d20b>] mdt_reint_create+0x16b/0x350 [mdt] [ 5830.791709] [<ffffffffc124ece3>] mdt_reint_rec+0x83/0x210 [mdt] [ 5830.792402] [<ffffffffc122e1d2>] mdt_reint_internal+0x6b2/0xa80 [mdt] [ 5830.793112] [<ffffffffc12391e7>] mdt_reint+0x67/0x140 [mdt] [ 5830.793705] [<ffffffffc0ee627a>] tgt_request_handle+0xaea/0x1580 [ptlrpc] [ 5830.794488] [<ffffffffc0e8940b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [ 5830.795329] [<ffffffffc0e8cc44>] ptlrpc_main+0xb14/0x1fb0 [ptlrpc] [ 5830.796151] [<ffffffffb88bb621>] kthread+0xd1/0xe0 [ 5830.796694] [<ffffffffb8f205f7>] ret_from_fork_nospec_end+0x0/0x39 [ 5830.797409] [<ffffffffffffffff>] 0xffffffffffffffff [ 5830.798027] LustreError: dumping log to /tmp/lustre-log.1535036577.27370 [ 9434.653754] SysRq : Changing Loglevel [ 9434.654263] Loglevel set to 8
This test only hnags in this way for DNE/ZFS testing.
It’s not easy to detect when this issue started because of issues with the kernel call trace , but this test did not fail nor hang in July 2018 and the first signs of replay-single test 80d hanging is on August 6, 2018.
Here are links to logs for some hangs:
https://testing.whamcloud.com/test_sets/06129818-af39-11e8-bd05-52540065bddc
https://testing.whamcloud.com/test_sets/0339a0e2-ae2a-11e8-bfed-52540065bddc
https://testing.whamcloud.com/test_sets/021c99b8-ad2f-11e8-bfed-52540065bddc
https://testing.whamcloud.com/test_sets/fd1d1dd8-abe0-11e8-80f7-52540065bddc
https://testing.whamcloud.com/test_sets/057c01e4-9f6e-11e8-b0aa-52540065bddc