[LU-18881] MDT overwhelmed by lock cancel requests - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Duplicate
Priority: Minor
Fix Version/s: None
Affects Version/s: Lustre 2.15.5
Labels:
- llnl

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

Jobs that perform partial file system walks like dsync jobs run for Lustre file system migrations may read many directories and stat many files - for trees belong to heavy users, easily many millions.

When the job finishes and the epilog tries to drop caches, the node then tries to release all of the locks as fast as possible.

When the lru_size tunable is set to 0 (size managed by Lustre) the number of locks grows very large.

The number of cancel requests overwhelms the MDT, causing other requests to time out, producing hangs on other operations, etc.

On the MDT console log we start to see lots of "Could not add any time...not sending early reply" messages, so the server was clearly overwhelmed. Those are followed by "Dropping timed-out request" messages, and then lots of "ldlm_canceld: This server is not able to keep up with request traffic", and then eventually watchdog stack traces.

Actual forward progress slows to a crawl. We tried simply leaving the MDT alone and ensuring there was no new traffic, and after an hour the situation had not changed.

Setting lru_size to something small, e.g. 1000, prevented this behavior.

The specific patch stack is tagged 2.15.5_6.llnl (see https://github.com/LLNL/lustre/releases/tag/2.15.5_6.llnl)

We saw the following stack traces on MDTs while this was occurring:

 CPU  8
RIP 0010:native_queued_spin_lock_slowpath+0x147/0x1c0
[2093055.320176] Call Trace:
[2093055.320177]  <NMI>
[2093055.320177]  ? nmi_cpu_backtrace.cold.8+0x36/0x4f
[2093055.320178]  ? nmi_cpu_backtrace_handler+0xd/0x20
[2093055.320178]  ? nmi_handle+0x6a/0x130
[2093055.320179]  ? default_do_nmi+0x49/0x120
[2093055.320179]  ? do_nmi+0x19c/0x210
[2093055.320180]  ? end_repeat_nmi+0x16/0x69
[2093055.320180]  ? native_queued_spin_lock_slowpath+0x147/0x1c0
[2093055.320180]  ? native_queued_spin_lock_slowpath+0x147/0x1c0
[2093055.320181]  ? native_queued_spin_lock_slowpath+0x147/0x1c0
[2093055.320182]  </NMI>
[2093055.320182]  _raw_spin_lock+0x25/0x30
[2093055.320183]  ptlrpc_del_exp_list+0x36/0xc0 [ptlrpc]
[2093055.320183]  ptlrpc_server_hpreq_fini+0x52/0x120 [ptlrpc]
[2093055.320184]  ptlrpc_server_handle_req_in+0x3c0/0x8e0 [ptlrpc]
[2093055.320184]  ptlrpc_main+0xbce/0x1580 [ptlrpc]
[2093055.320185]  ? ptlrpc_wait_event+0x5d0/0x5d0 [ptlrpc]
[2093055.320185]  kthread+0x14c/0x170
[2093055.320186]  ? set_kthread_struct+0x50/0x50
[2093055.320187]  ret_from_fork+0x35/0x40

and

CPU  12
RIP 0010:native_queued_spin_lock_slowpath+0x144/0x1c0
[2093055.320414] Call Trace:
[2093055.320414]  <NMI>
[2093055.320415]  ? nmi_cpu_backtrace.cold.8+0x36/0x4f
[2093055.320415]  ? nmi_cpu_backtrace_handler+0xd/0x20
[2093055.320416]  ? nmi_handle+0x6a/0x130
[2093055.320416]  ? default_do_nmi+0x49/0x120
[2093055.320417]  ? do_nmi+0x19c/0x210
[2093055.320417]  ? end_repeat_nmi+0x16/0x69
[2093055.320418]  ? native_queued_spin_lock_slowpath+0x144/0x1c0
[2093055.320419]  ? native_queued_spin_lock_slowpath+0x144/0x1c0
[2093055.320419]  ? native_queued_spin_lock_slowpath+0x144/0x1c0
[2093055.320420]  </NMI>
[2093055.320420]  _raw_spin_lock_bh+0x29/0x30
[2093055.320421]  ptlrpc_server_request_add+0x26f/0xaa0 [ptlrpc]
[2093055.320422]  ? tgt_ses_key_init+0x48/0x170 [ptlrpc]
[2093055.320422]  ? keys_fill+0xc8/0x120 [obdclass]
[2093055.320423]  ? ptlrpc_at_add_timed+0x115/0x280 [ptlrpc]

Attachments

Issue Links

is related to

LU-17181 lu_sites_guard sem caused a page reclaim starvation.

Resolved

LU-18072 Lock cancel resending overwhelms ldlm canceld thread

Resolved

LU-18229 BLAST and CANCELLING lock still can be batched with others in one cancel RPC

Resolved

LU-16285 Prolong the lock BL timeout

Resolved

LU-18111 Don't drop expired cancel request

Resolved

Activity

People

Assignee:: Peter Jones

Reporter:: Olaf Faaland

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 01/Apr/25 10:16 PM

Updated:: 12/May/25 4:49 PM

Resolved:: 12/May/25 1:13 PM