[LU-19090] ldlm_bl_timeout can potentially grow indefinitely - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Unresolved
Priority: Major
Fix Version/s: None
Affects Version/s: Lustre 2.12.9, Lustre 2.15.5
Labels:
- ldlm

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

We observed scenario where locks are contended over root directory, when a large portion of the client instances got terminated. We suspect the issue stems from these clients failed to shutdown cleanly with a graceful umount. MDT waits for ldlm callback timer to expire, evict the terminated clients, before it can grant the lock to other clients.

00010000:00020000:3.0:1748902881.121737:1120:5114:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 917s: evicting client at <redacted>@tcp ns: mdt-<redacted>-MDT0000_UUID lock: ffff80066781af40/0x64ab58becd5a0966 lrc: 3/0,0 mode: PR/PR res: [0x200000007:0x1:0x0].0x0 bits 0x13/0x0 rrc: 109 type: IBT flags: 0x60200400000020 nid: <redacted>@tcp remote: 0xdabdae476163dae expref: 8 pid: 792 timeout: 0 lvb_type: 0

Looking more into the kernel logs, we saw ldlm_bl_timeout can be extended to 900+ seconds with adaptive timeout.

time64_t ldlm_bl_timeout(struct ldlm_lock *lock)

Unknown macro: { time64_t timeout; if (AT_OFF) return obd_timeout / 2; /* Since these are non-updating timeouts, we should be conservative. * Take more than usually, 150% * It would be nice to have some kind of "early reply" mechanism for * lock callbacks too... */ timeout = at_get(&lock->l_export->exp_bl_lock_at); return max(timeout + (timeout >> 1), (time64_t)ldlm_enqueue_min); }

It does not seem like the ldlm_bl_timeout would respect at_max. We would like to propose capping this timeout with either at_max, or adding an explicit ldlm_enqueue_max.

e.g. min(at_max, max(ldlm_enqueue_min, timeout))

Attachments

Activity

People

Assignee:: WC Triage

Reporter:: Sichen Xiao

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 04/Jun/25 5:51 PM

Updated:: 04/Jun/25 7:45 PM