[LU-7992] HSM: mutual deadlock when killing a copytool Created: 06/Apr/16 Updated: 29/Mar/17 Resolved: 21/Apr/16 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.5.3, Lustre 2.8.0 |
| Fix Version/s: | Lustre 2.9.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Frank Zago (Inactive) | Assignee: | John Hammond |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||
| Severity: | 3 | ||||
| Rank (Obsolete): | 9223372036854775807 | ||||
| Description |
|
If something tries to kill a copytool is while it is performing some requests, it will deadlock. It won't finish ever, and remains unkillable even with -KILL. The Lustre client has to be rebooted to fix the situation. The Lustre kernel thread: [62266.422298] LNet: Service thread pid 30493 was inactive for 62.18s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: [62266.424940] Pid: 30493, comm: ldlm_cb03_001 [62266.426214] Call Trace: [62266.428573] [<ffffffff81661eb9>] schedule+0x29/0x70 [62266.429763] [<ffffffff811f3266>] pipe_wait+0x76/0xd0 [62266.430928] [<ffffffff810a8040>] ? autoremove_wake_function+0x0/0x40 [62266.432062] [<ffffffff811f357f>] pipe_write+0x25f/0x600 [62266.433174] [<ffffffff8101cd35>] ? native_sched_clock+0x35/0x80 [62266.434270] [<ffffffff8101cd89>] ? sched_clock+0x9/0x10 [62266.435351] [<ffffffff81310823>] ? number.isra.2+0x323/0x360 [62266.436418] [<ffffffff811e9d7d>] do_sync_write+0x8d/0xd0 [62266.437455] [<ffffffff811ea5d0>] vfs_write+0xc0/0x1f0 [62266.438570] [<ffffffffa07ba119>] libcfs_kkuc_msg_put+0xa9/0x250 [obdclass] [62266.439598] [<ffffffffa07ba33e>] libcfs_kkuc_group_put+0x7e/0x190 [obdclass] [62266.440604] [<ffffffffa0ba69bd>] mdc_set_info_async+0x7ad/0x820 [mdc] [62266.441692] [<ffffffffa09d1c81>] ldlm_callback_handler.part.24+0x1781/0x2140 [ptlrpc] [62266.442721] [<ffffffffa068d937>] ? libcfs_debug_msg+0x57/0x80 [libcfs] [62266.443712] [<ffffffffa09d267a>] ldlm_callback_handler+0x3a/0xd0 [ptlrpc] [62266.444699] [<ffffffffa09ff081>] ptlrpc_server_handle_request+0x231/0xad0 [ptlrpc] [62266.445672] [<ffffffffa09fcba5>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc] [62266.446599] [<ffffffff810e5b7f>] ? lock_release_holdtime.part.24+0xf/0x180 [62266.447533] [<ffffffffa0a032c0>] ptlrpc_main+0xab0/0x1e30 [ptlrpc] [62266.448416] [<ffffffff810b2ba4>] ? finish_task_switch+0x44/0x180 [62266.449313] [<ffffffffa0a02810>] ? ptlrpc_main+0x0/0x1e30 [ptlrpc] [62266.450138] [<ffffffff810a6e6d>] kthread+0xed/0x100 [62266.450950] [<ffffffff810a6d80>] ? kthread+0x0/0x100 [62266.451757] [<ffffffff8166d4d8>] ret_from_fork+0x58/0x90 [62266.452539] [<ffffffff810a6d80>] ? kthread+0x0/0x100 The copy tool: [62338.455850] INFO: task tas_cmm:25225 blocked for more than 120 seconds. [62338.457167] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [62338.458499] tas_cmm D ffff88062d8ec650 0 25225 1 0x00000084 [62338.459736] ffff8800a0463a78 0000000000000046 ffff88027685ddc0 ffff8800a0463fd8 [62338.460889] ffff8800a0463fd8 ffff8800a0463fd8 ffff88027685ddc0 ffff88027685ddc0 [62338.462227] ffffffffa0806640 ffffffffa0806648 ffffffff00000000 ffffffffa0806690 [62338.463510] Call Trace: [62338.464693] [<ffffffff81661eb9>] schedule+0x29/0x70 [62338.465447] [<ffffffff81663555>] rwsem_down_write_failed+0x105/0x1c0 [62338.466116] [<ffffffff81312400>] ? pointer.isra.19+0x190/0x4c0 [62338.466954] [<ffffffff81314f43>] call_rwsem_down_write_failed+0x13/0x20 [62338.468232] [<ffffffff81661135>] ? down_write+0x85/0x9c [62338.469526] [<ffffffffa07ba4c9>] ? libcfs_kkuc_group_rem+0x79/0x290 [obdclass] [62338.470827] [<ffffffffa07ba4c9>] libcfs_kkuc_group_rem+0x79/0x290 [obdclass] [62338.471585] [<ffffffffa094e895>] ? obd_iocontrol+0xd5/0x350 [lmv] [62338.472276] [<ffffffffa0954bbe>] lmv_iocontrol+0x78e/0x2720 [lmv] [62338.472973] [<ffffffff810bd815>] ? local_clock+0x25/0x30 [62338.473657] [<ffffffff810e78cd>] ? __lock_acquire.isra.31+0x2ad/0xba0 [62338.474318] [<ffffffff8101cd35>] ? native_sched_clock+0x35/0x80 [62338.475000] [<ffffffff8101cd89>] ? sched_clock+0x9/0x10 [62338.475678] [<ffffffff810e78cd>] ? __lock_acquire.isra.31+0x2ad/0xba0 [62338.476329] [<ffffffff811cd743>] ? kmem_cache_alloc_trace+0x1b3/0x1d0 [62338.477018] [<ffffffffa118b7cf>] ? copy_and_ioctl.constprop.22+0x2f/0x4f0 [lustre] [62338.477714] [<ffffffffa118b98b>] copy_and_ioctl.constprop.22+0x1eb/0x4f0 [lustre] [62338.478393] [<ffffffffa1191a88>] ll_dir_ioctl+0x2438/0x6420 [lustre] [62338.479074] [<ffffffff81657e30>] ? __slab_free+0x10e/0x277 [62338.479771] [<ffffffff8101cd35>] ? native_sched_clock+0x35/0x80 [62338.480410] [<ffffffff8101cd89>] ? sched_clock+0x9/0x10 [62338.481077] [<ffffffff810bd815>] ? local_clock+0x25/0x30 [62338.481736] [<ffffffff810e78cd>] ? __lock_acquire.isra.31+0x2ad/0xba0 [62338.482361] [<ffffffff8101cd35>] ? native_sched_clock+0x35/0x80 [62338.483012] [<ffffffff8101cd89>] ? sched_clock+0x9/0x10 [62338.483662] [<ffffffff810bd815>] ? local_clock+0x25/0x30 [62338.484280] [<ffffffff810e5b7f>] ? lock_release_holdtime.part.24+0xf/0x180 [62338.484939] [<ffffffff810af97f>] ? lg_local_unlock+0x1f/0x60 [62338.485565] [<ffffffff8120b98e>] ? mntput_no_expire+0x3e/0x120 [62338.486220] [<ffffffff811fe795>] do_vfs_ioctl+0x2e5/0x4d0 [62338.486863] [<ffffffff811fea21>] SyS_ioctl+0xa1/0xc0 [62338.487451] [<ffffffff8166d589>] system_call_fastpath+0x16/0x1b Lock debug reports: [62362.788685] 1 lock held by ldlm_cb03_001/30493:
[62362.789459] #0: (kg_sem){......}, at: [<ffffffffa07ba307>] libcfs_kkuc_group_put+0x47/0x190 [obdclass]
[62362.800051] 1 lock held by tas_cmm/25225:
[62362.800886] #0: (kg_sem){......}, at: [<ffffffffa07ba4c9>] libcfs_kkuc_group_rem+0x79/0x290 [obdclass]
It appears that ldlm_cb03_001 block on the pipe_wait, while holding the lock that the copytool needs to exit. The copytool can't read the message because it is finishing and waiting for the lock that ldlm_cb03_001 has. |
| Comments |
| Comment by Frank Zago (Inactive) [ 06/Apr/16 ] |
|
This is easily reproducible. Create a bunch of files, start the copytool then kill it. |
| Comment by John Hammond [ 07/Apr/16 ] |
|
I have a patch for this which I'll push after LUG. |
| Comment by Frank Zago (Inactive) [ 07/Apr/16 ] |
|
You can fix bugs in less time it takes me to write the report |
| Comment by Gerrit Updater [ 11/Apr/16 ] |
|
John L. Hammond (john.hammond@intel.com) uploaded a new patch: http://review.whamcloud.com/19442 |
| Comment by Gerrit Updater [ 14/Apr/16 ] |
|
Frank Zago (fzago@cray.com) uploaded a new patch: http://review.whamcloud.com/19576 |
| Comment by Gerrit Updater [ 21/Apr/16 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/19442/ |
| Comment by Joseph Gmitter (Inactive) [ 21/Apr/16 ] |
|
Landed to master for 2.9.0 |