[LU-11336] replay-single test 80d hangs on MDT unmount Created: 05/Sep/18 Updated: 11/Mar/20 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | James Nunez (Inactive) | Assignee: | WC Triage |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | DNE, zfs | ||
| Environment: |
DNE/ZFS |
||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
replay-single test_80d hangs on MDT unmount. Looking at a recent hang, at https://testing.whamcloud.com/test_sets/ffb8585c-a706-11e8-80f7-52540065bddc, the last thing we see in the test log is CMD: trevis-44vm10 /usr/sbin/lctl --device lustre-MDT0001 notransno CMD: trevis-44vm10 /usr/sbin/lctl --device lustre-MDT0001 readonly CMD: trevis-44vm10 /usr/sbin/lctl mark mds2 REPLAY BARRIER on lustre-MDT0001 Failing mds1 on trevis-44vm9 CMD: trevis-44vm9 grep -c /mnt/lustre-mds1' ' /proc/mounts || true Stopping /mnt/lustre-mds1 (opts:) on trevis-44vm9 CMD: trevis-44vm9 umount -d /mnt/lustre-mds1 CMD: trevis-44vm9 lsmod | grep lnet > /dev/null && lctl dl | grep ' ST ' || true CMD: trevis-44vm9 ! zpool list -H lustre-mdt1 >/dev/null 2>&1 || grep -q ^lustre-mdt1/ /proc/mounts || zpool export lustre-mdt1 Failing mds2 on trevis-44vm10 CMD: trevis-44vm10 grep -c /mnt/lustre-mds2' ' /proc/mounts || true Stopping /mnt/lustre-mds2 (opts:) on trevis-44vm10 CMD: trevis-44vm10 umount -d /mnt/lustre-mds2 Looking at the console log for the MDS (vm10), we see the some errors and a call trace [ 5775.673375] Lustre: DEBUG MARKER: /usr/sbin/lctl --device lustre-MDT0001 notransno [ 5776.031453] Lustre: DEBUG MARKER: /usr/sbin/lctl --device lustre-MDT0001 readonly [ 5776.390963] Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds2 REPLAY BARRIER on lustre-MDT0001 [ 5776.569652] Lustre: DEBUG MARKER: mds2 REPLAY BARRIER on lustre-MDT0001 [ 5778.383410] Lustre: DEBUG MARKER: grep -c /mnt/lustre-mds2' ' /proc/mounts || true [ 5778.731629] Lustre: DEBUG MARKER: umount -d /mnt/lustre-mds2 [ 5778.917429] LustreError: 8724:0:(ldlm_resource.c:1146:ldlm_resource_complain()) lustre-MDT0000-osp-MDT0001: namespace resource [0x20000abf8:0xa:0x0].0x0 (ffff96bab691e6c0) refcount nonzero (1) after lock cleanup; forcing cleanup. [ 5778.922402] LustreError: 8724:0:(ldlm_resource.c:1146:ldlm_resource_complain()) lustre-MDT0000-osp-MDT0001: namespace resource [0x20000abf8:0xa:0x0].0x0 (ffff96bab691e6c0) refcount nonzero (1) after lock cleanup; forcing cleanup. [ 5783.926373] LustreError: 0-0: Forced cleanup waiting for lustre-MDT0000-osp-MDT0001 namespace with 1 resources in use, (rc=-110) [ 5830.783393] LNet: Service thread pid 27370 was inactive for 62.11s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: [ 5830.785144] Pid: 27370, comm: mdt00_000 3.10.0-862.9.1.el7_lustre.x86_64 #1 SMP Fri Aug 17 20:37:05 UTC 2018 [ 5830.786119] Call Trace: [ 5830.786419] [<ffffffffc0f14dca>] top_trans_wait_result+0xa6/0x155 [ptlrpc] [ 5830.787317] [<ffffffffc0ef6c0b>] top_trans_stop+0x42b/0x930 [ptlrpc] [ 5830.788062] [<ffffffffc131d80c>] lod_trans_stop+0x25c/0x340 [lod] [ 5830.788958] [<ffffffffc13ca9ae>] mdd_trans_stop+0x2e/0x174 [mdd] [ 5830.789654] [<ffffffffc13af627>] mdd_create+0x967/0x13f0 [mdd] [ 5830.790307] [<ffffffffc124cc80>] mdt_create+0x820/0xc40 [mdt] [ 5830.791044] [<ffffffffc124d20b>] mdt_reint_create+0x16b/0x350 [mdt] [ 5830.791709] [<ffffffffc124ece3>] mdt_reint_rec+0x83/0x210 [mdt] [ 5830.792402] [<ffffffffc122e1d2>] mdt_reint_internal+0x6b2/0xa80 [mdt] [ 5830.793112] [<ffffffffc12391e7>] mdt_reint+0x67/0x140 [mdt] [ 5830.793705] [<ffffffffc0ee627a>] tgt_request_handle+0xaea/0x1580 [ptlrpc] [ 5830.794488] [<ffffffffc0e8940b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [ 5830.795329] [<ffffffffc0e8cc44>] ptlrpc_main+0xb14/0x1fb0 [ptlrpc] [ 5830.796151] [<ffffffffb88bb621>] kthread+0xd1/0xe0 [ 5830.796694] [<ffffffffb8f205f7>] ret_from_fork_nospec_end+0x0/0x39 [ 5830.797409] [<ffffffffffffffff>] 0xffffffffffffffff [ 5830.798027] LustreError: dumping log to /tmp/lustre-log.1535036577.27370 [ 9434.653754] SysRq : Changing Loglevel [ 9434.654263] Loglevel set to 8 This test only hnags in this way for DNE/ZFS testing. It’s not easy to detect when this issue started because of issues with the kernel call trace , but this test did not fail nor hang in July 2018 and the first signs of replay-single test 80d hanging is on August 6, 2018. Here are links to logs for some hangs: |