Details
-
Bug
-
Resolution: Fixed
-
Minor
-
Lustre 2.5.3, Lustre 2.8.0
-
None
-
3
-
9223372036854775807
Description
If something tries to kill a copytool is while it is performing some requests, it will deadlock. It won't finish ever, and remains unkillable even with -KILL. The Lustre client has to be rebooted to fix the situation.
The Lustre kernel thread:
[62266.422298] LNet: Service thread pid 30493 was inactive for 62.18s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: [62266.424940] Pid: 30493, comm: ldlm_cb03_001 [62266.426214] Call Trace: [62266.428573] [<ffffffff81661eb9>] schedule+0x29/0x70 [62266.429763] [<ffffffff811f3266>] pipe_wait+0x76/0xd0 [62266.430928] [<ffffffff810a8040>] ? autoremove_wake_function+0x0/0x40 [62266.432062] [<ffffffff811f357f>] pipe_write+0x25f/0x600 [62266.433174] [<ffffffff8101cd35>] ? native_sched_clock+0x35/0x80 [62266.434270] [<ffffffff8101cd89>] ? sched_clock+0x9/0x10 [62266.435351] [<ffffffff81310823>] ? number.isra.2+0x323/0x360 [62266.436418] [<ffffffff811e9d7d>] do_sync_write+0x8d/0xd0 [62266.437455] [<ffffffff811ea5d0>] vfs_write+0xc0/0x1f0 [62266.438570] [<ffffffffa07ba119>] libcfs_kkuc_msg_put+0xa9/0x250 [obdclass] [62266.439598] [<ffffffffa07ba33e>] libcfs_kkuc_group_put+0x7e/0x190 [obdclass] [62266.440604] [<ffffffffa0ba69bd>] mdc_set_info_async+0x7ad/0x820 [mdc] [62266.441692] [<ffffffffa09d1c81>] ldlm_callback_handler.part.24+0x1781/0x2140 [ptlrpc] [62266.442721] [<ffffffffa068d937>] ? libcfs_debug_msg+0x57/0x80 [libcfs] [62266.443712] [<ffffffffa09d267a>] ldlm_callback_handler+0x3a/0xd0 [ptlrpc] [62266.444699] [<ffffffffa09ff081>] ptlrpc_server_handle_request+0x231/0xad0 [ptlrpc] [62266.445672] [<ffffffffa09fcba5>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc] [62266.446599] [<ffffffff810e5b7f>] ? lock_release_holdtime.part.24+0xf/0x180 [62266.447533] [<ffffffffa0a032c0>] ptlrpc_main+0xab0/0x1e30 [ptlrpc] [62266.448416] [<ffffffff810b2ba4>] ? finish_task_switch+0x44/0x180 [62266.449313] [<ffffffffa0a02810>] ? ptlrpc_main+0x0/0x1e30 [ptlrpc] [62266.450138] [<ffffffff810a6e6d>] kthread+0xed/0x100 [62266.450950] [<ffffffff810a6d80>] ? kthread+0x0/0x100 [62266.451757] [<ffffffff8166d4d8>] ret_from_fork+0x58/0x90 [62266.452539] [<ffffffff810a6d80>] ? kthread+0x0/0x100
The copy tool:
[62338.455850] INFO: task tas_cmm:25225 blocked for more than 120 seconds. [62338.457167] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [62338.458499] tas_cmm D ffff88062d8ec650 0 25225 1 0x00000084 [62338.459736] ffff8800a0463a78 0000000000000046 ffff88027685ddc0 ffff8800a0463fd8 [62338.460889] ffff8800a0463fd8 ffff8800a0463fd8 ffff88027685ddc0 ffff88027685ddc0 [62338.462227] ffffffffa0806640 ffffffffa0806648 ffffffff00000000 ffffffffa0806690 [62338.463510] Call Trace: [62338.464693] [<ffffffff81661eb9>] schedule+0x29/0x70 [62338.465447] [<ffffffff81663555>] rwsem_down_write_failed+0x105/0x1c0 [62338.466116] [<ffffffff81312400>] ? pointer.isra.19+0x190/0x4c0 [62338.466954] [<ffffffff81314f43>] call_rwsem_down_write_failed+0x13/0x20 [62338.468232] [<ffffffff81661135>] ? down_write+0x85/0x9c [62338.469526] [<ffffffffa07ba4c9>] ? libcfs_kkuc_group_rem+0x79/0x290 [obdclass] [62338.470827] [<ffffffffa07ba4c9>] libcfs_kkuc_group_rem+0x79/0x290 [obdclass] [62338.471585] [<ffffffffa094e895>] ? obd_iocontrol+0xd5/0x350 [lmv] [62338.472276] [<ffffffffa0954bbe>] lmv_iocontrol+0x78e/0x2720 [lmv] [62338.472973] [<ffffffff810bd815>] ? local_clock+0x25/0x30 [62338.473657] [<ffffffff810e78cd>] ? __lock_acquire.isra.31+0x2ad/0xba0 [62338.474318] [<ffffffff8101cd35>] ? native_sched_clock+0x35/0x80 [62338.475000] [<ffffffff8101cd89>] ? sched_clock+0x9/0x10 [62338.475678] [<ffffffff810e78cd>] ? __lock_acquire.isra.31+0x2ad/0xba0 [62338.476329] [<ffffffff811cd743>] ? kmem_cache_alloc_trace+0x1b3/0x1d0 [62338.477018] [<ffffffffa118b7cf>] ? copy_and_ioctl.constprop.22+0x2f/0x4f0 [lustre] [62338.477714] [<ffffffffa118b98b>] copy_and_ioctl.constprop.22+0x1eb/0x4f0 [lustre] [62338.478393] [<ffffffffa1191a88>] ll_dir_ioctl+0x2438/0x6420 [lustre] [62338.479074] [<ffffffff81657e30>] ? __slab_free+0x10e/0x277 [62338.479771] [<ffffffff8101cd35>] ? native_sched_clock+0x35/0x80 [62338.480410] [<ffffffff8101cd89>] ? sched_clock+0x9/0x10 [62338.481077] [<ffffffff810bd815>] ? local_clock+0x25/0x30 [62338.481736] [<ffffffff810e78cd>] ? __lock_acquire.isra.31+0x2ad/0xba0 [62338.482361] [<ffffffff8101cd35>] ? native_sched_clock+0x35/0x80 [62338.483012] [<ffffffff8101cd89>] ? sched_clock+0x9/0x10 [62338.483662] [<ffffffff810bd815>] ? local_clock+0x25/0x30 [62338.484280] [<ffffffff810e5b7f>] ? lock_release_holdtime.part.24+0xf/0x180 [62338.484939] [<ffffffff810af97f>] ? lg_local_unlock+0x1f/0x60 [62338.485565] [<ffffffff8120b98e>] ? mntput_no_expire+0x3e/0x120 [62338.486220] [<ffffffff811fe795>] do_vfs_ioctl+0x2e5/0x4d0 [62338.486863] [<ffffffff811fea21>] SyS_ioctl+0xa1/0xc0 [62338.487451] [<ffffffff8166d589>] system_call_fastpath+0x16/0x1b
Lock debug reports:
[62362.788685] 1 lock held by ldlm_cb03_001/30493:
[62362.789459] #0: (kg_sem){......}, at: [<ffffffffa07ba307>] libcfs_kkuc_group_put+0x47/0x190 [obdclass]
[62362.800051] 1 lock held by tas_cmm/25225:
[62362.800886] #0: (kg_sem){......}, at: [<ffffffffa07ba4c9>] libcfs_kkuc_group_rem+0x79/0x290 [obdclass]
It appears that ldlm_cb03_001 block on the pipe_wait, while holding the lock that the copytool needs to exit. The copytool can't read the message because it is finishing and waiting for the lock that ldlm_cb03_001 has.