Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-19090

ldlm_bl_timeout can potentially grow indefinitely

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • None
    • Lustre 2.12.9, Lustre 2.15.5
    • 3
    • 9223372036854775807

    Description

      We observed scenario where locks are contended over root directory, when a large portion of the client instances got terminated. We suspect the issue stems from these clients failed to shutdown cleanly with a graceful umount. MDT waits for ldlm callback timer to expire, evict the terminated clients, before it can grant the lock to other clients.

      00010000:00020000:3.0:1748902881.121737:1120:5114:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 917s: evicting client at <redacted>@tcp ns: mdt-<redacted>-MDT0000_UUID lock: ffff80066781af40/0x64ab58becd5a0966 lrc: 3/0,0 mode: PR/PR res: [0x200000007:0x1:0x0].0x0 bits 0x13/0x0 rrc: 109 type: IBT flags: 0x60200400000020 nid: <redacted>@tcp remote: 0xdabdae476163dae expref: 8 pid: 792 timeout: 0 lvb_type: 0

      Looking more into the kernel logs, we saw ldlm_bl_timeout can be extended to 900+ seconds with adaptive timeout. 

      time64_t ldlm_bl_timeout(struct ldlm_lock *lock)

      Unknown macro: { time64_t timeout;   if (AT_OFF) return obd_timeout / 2;   /* Since these are non-updating timeouts, we should be conservative. * Take more than usually, 150% * It would be nice to have some kind of "early reply" mechanism for * lock callbacks too... */ timeout = at_get(&lock->l_export->exp_bl_lock_at); return max(timeout + (timeout >> 1), (time64_t)ldlm_enqueue_min); }

      It does not seem like the ldlm_bl_timeout would respect at_max. We would like to propose capping this timeout with either at_max, or adding an explicit ldlm_enqueue_max.

      e.g. min(at_max, max(ldlm_enqueue_min, timeout))

      Attachments

        Activity

          People

            wc-triage WC Triage
            sichenx Sichen Xiao
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: