Description
Jobs that perform partial file system walks like dsync jobs run for Lustre file system migrations may read many directories and stat many files - for trees belong to heavy users, easily many millions.
When the job finishes and the epilog tries to drop caches, the node then tries to release all of the locks as fast as possible.
When the lru_size tunable is set to 0 (size managed by Lustre) the number of locks grows very large.
The number of cancel requests overwhelms the MDT, causing other requests to time out, producing hangs on other operations, etc.
On the MDT console log we start to see lots of "Could not add any time...not sending early reply" messages, so the server was clearly overwhelmed. Those are followed by "Dropping timed-out request" messages, and then lots of "ldlm_canceld: This server is not able to keep up with request traffic", and then eventually watchdog stack traces.
Actual forward progress slows to a crawl. We tried simply leaving the MDT alone and ensuring there was no new traffic, and after an hour the situation had not changed.
Setting lru_size to something small, e.g. 1000, prevented this behavior.
The specific patch stack is tagged 2.15.5_6.llnl (see https://github.com/LLNL/lustre/releases/tag/2.15.5_6.llnl)
We saw the following stack traces on MDTs while this was occurring:
CPU 8 RIP 0010:native_queued_spin_lock_slowpath+0x147/0x1c0 [2093055.320176] Call Trace: [2093055.320177] <NMI> [2093055.320177] ? nmi_cpu_backtrace.cold.8+0x36/0x4f [2093055.320178] ? nmi_cpu_backtrace_handler+0xd/0x20 [2093055.320178] ? nmi_handle+0x6a/0x130 [2093055.320179] ? default_do_nmi+0x49/0x120 [2093055.320179] ? do_nmi+0x19c/0x210 [2093055.320180] ? end_repeat_nmi+0x16/0x69 [2093055.320180] ? native_queued_spin_lock_slowpath+0x147/0x1c0 [2093055.320180] ? native_queued_spin_lock_slowpath+0x147/0x1c0 [2093055.320181] ? native_queued_spin_lock_slowpath+0x147/0x1c0 [2093055.320182] </NMI> [2093055.320182] _raw_spin_lock+0x25/0x30 [2093055.320183] ptlrpc_del_exp_list+0x36/0xc0 [ptlrpc] [2093055.320183] ptlrpc_server_hpreq_fini+0x52/0x120 [ptlrpc] [2093055.320184] ptlrpc_server_handle_req_in+0x3c0/0x8e0 [ptlrpc] [2093055.320184] ptlrpc_main+0xbce/0x1580 [ptlrpc] [2093055.320185] ? ptlrpc_wait_event+0x5d0/0x5d0 [ptlrpc] [2093055.320185] kthread+0x14c/0x170 [2093055.320186] ? set_kthread_struct+0x50/0x50 [2093055.320187] ret_from_fork+0x35/0x40
and
CPU 12 RIP 0010:native_queued_spin_lock_slowpath+0x144/0x1c0 [2093055.320414] Call Trace: [2093055.320414] <NMI> [2093055.320415] ? nmi_cpu_backtrace.cold.8+0x36/0x4f [2093055.320415] ? nmi_cpu_backtrace_handler+0xd/0x20 [2093055.320416] ? nmi_handle+0x6a/0x130 [2093055.320416] ? default_do_nmi+0x49/0x120 [2093055.320417] ? do_nmi+0x19c/0x210 [2093055.320417] ? end_repeat_nmi+0x16/0x69 [2093055.320418] ? native_queued_spin_lock_slowpath+0x144/0x1c0 [2093055.320419] ? native_queued_spin_lock_slowpath+0x144/0x1c0 [2093055.320419] ? native_queued_spin_lock_slowpath+0x144/0x1c0 [2093055.320420] </NMI> [2093055.320420] _raw_spin_lock_bh+0x29/0x30 [2093055.320421] ptlrpc_server_request_add+0x26f/0xaa0 [ptlrpc] [2093055.320422] ? tgt_ses_key_init+0x48/0x170 [ptlrpc] [2093055.320422] ? keys_fill+0xc8/0x120 [obdclass] [2093055.320423] ? ptlrpc_at_add_timed+0x115/0x280 [ptlrpc]
Attachments
Issue Links
- is related to
-
LU-18072 Lock cancel resending overwhelms ldlm canceld thread
-
- Open
-
-
LU-17181 lu_sites_guard sem caused a page reclaim starvation.
-
- Resolved
-
-
LU-18229 BLAST and CANCELLING lock still can be batched with others in one cancel RPC
-
- Resolved
-
-
LU-16285 Prolong the lock BL timeout
-
- Resolved
-
-
LU-18111 Don't drop expired cancel request
-
- Resolved
-