Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-14582

nested LDLM locks cause evictions due to RPC-in-flight limit

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • Lustre 2.17.0
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      clients get evicted from MDS during racer runs quite often. this is due to nested LDLM locks in DNE setup. say, thread T is working on a client side:

      • lmv_intent_lock() tries to get LOOKUP|UPDATE for a given object
      • MDS1 returns just LOOKUP + FID (lock L1)
      • lmv_intent_remote() asks MDS2 for UPDATE and attributes still holding the lock from MDS1 (potentially lock L2)
        if the first object is contended, then the following situation is possible:
      • all slots (obd_get_request_slots()) are busy with LDLM_ENQUEUE against the first object
      • the corresponding LDLM resource on MDS1 can't grant those locks due to some one else incompatible with granted lock L1
      • T can't proceed with L2 due to slots busy, thus can't release L1
      Lustre: mdt00_027: service thread pid 13168 was inactive for 40.338 seconds. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      Pid: 13168, comm: mdt00_027 4.18.0 #36 SMP Thu Mar 25 14:56:29 MSK 2021
      Call Trace:
      [<0>] ldlm_completion_ast+0x77c/0x8d0 [ptlrpc]
      [<0>] ldlm_cli_enqueue_fini+0x9fc/0xe90 [ptlrpc]
      [<0>] ldlm_cli_enqueue+0x4d9/0x990 [ptlrpc]
      [<0>] osp_md_object_lock+0x154/0x290 [osp]
      [<0>] lod_object_lock+0x11a/0x780 [lod]
      [<0>] mdt_remote_object_lock_try+0x140/0x370 [mdt]
      [<0>] mdt_remote_object_lock+0x1a/0x20 [mdt]
      [<0>] mdt_reint_unlink+0x70d/0x2060 [mdt]
      [<0>] mdt_reint_rec+0x117/0x240 [mdt]
      [<0>] mdt_reint_internal+0x90c/0xab0 [mdt]
      [<0>] mdt_reint+0x57/0x100 [mdt]
      [<0>] tgt_request_handle+0xbe0/0x1970 [ptlrpc]
      [<0>] ptlrpc_main+0x134f/0x30e0 [ptlrpc]
      [<0>] kthread+0x100/0x140
      [<0>] ret_from_fork+0x24/0x30
      [<0>] 0xffffffffffffffff
      Lustre: mdt00_001: service thread pid 7729 was inactive for 65.046 seconds. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      Pid: 10908, comm: mdt00_037 4.18.0 #36 SMP Thu Mar 25 14:56:29 MSK 2021
      Lustre: Skipped 1 previous similar message
      Call Trace:
      [<0>] ldlm_completion_ast+0x77c/0x8d0 [ptlrpc]
      [<0>] ldlm_cli_enqueue_local+0x27d/0x7f0 [ptlrpc]
      [<0>] mdt_object_local_lock+0x479/0xad0 [mdt]
      [<0>] mdt_object_lock_internal+0x1d3/0x3f0 [mdt]
      [<0>] mdt_getattr_name_lock+0xdb5/0x1f80 [mdt]
      [<0>] mdt_intent_getattr+0x25b/0x420 [mdt]
      [<0>] mdt_intent_policy+0x659/0xee0 [mdt]
      [<0>] ldlm_lock_enqueue+0x418/0x9b0 [ptlrpc]
      [<0>] ldlm_handle_enqueue0+0x5d8/0x16c0 [ptlrpc]
      [<0>] tgt_enqueue+0x9f/0x200 [ptlrpc]
      [<0>] tgt_request_handle+0xbe0/0x1970 [ptlrpc]
      [<0>] ptlrpc_main+0x134f/0x30e0 [ptlrpc]
      [<0>] kthread+0x100/0x140
      [<0>] ret_from_fork+0x24/0x30
      [<0>] 0xffffffffffffffff
      Pid: 7729, comm: mdt00_001 4.18.0 #36 SMP Thu Mar 25 14:56:29 MSK 2021
      Call Trace:
      [<0>] ldlm_completion_ast+0x77c/0x8d0 [ptlrpc]
      [<0>] ldlm_cli_enqueue_local+0x27d/0x7f0 [ptlrpc]
      [<0>] mdt_object_local_lock+0x539/0xad0 [mdt]
      [<0>] mdt_object_lock_internal+0x1d3/0x3f0 [mdt]
      [<0>] mdt_getattr_name_lock+0x78c/0x1f80 [mdt]
      [<0>] mdt_intent_getattr+0x25b/0x420 [mdt]
      [<0>] mdt_intent_policy+0x659/0xee0 [mdt]
      [<0>] ldlm_lock_enqueue+0x418/0x9b0 [ptlrpc]
      [<0>] ldlm_handle_enqueue0+0x5d8/0x16c0 [ptlrpc]
      [<0>] tgt_enqueue+0x9f/0x200 [ptlrpc]
      [<0>] tgt_request_handle+0xbe0/0x1970 [ptlrpc]
      [<0>] ptlrpc_main+0x134f/0x30e0 [ptlrpc]
      [<0>] kthread+0x100/0x140
      [<0>] ret_from_fork+0x24/0x30
      [<0>] 0xffffffffffffffff
      Lustre: mdt00_018: service thread pid 10527 was inactive for 65.062 seconds. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one.
      LustreError: 7718:0:(ldlm_lockd.c:260:expired_lock_main()) ### lock callback timer expired after 101s: evicting client at 0@lo  ns: mdt-lustre-MDT0001_UUID lock: 00000000bd306e1a/0xfbfedfd6efc4a594 lrc: 3/0,0 mode: PR/PR res: [0x240000403:0x172f:0x0].0x0 bits 0x13/0x0 rrc: 3 type: IBT gid 0 flags: 0x60200400000020 nid: 0@lo remote: 0xfbfedfd6efc498ba expref: 705 pid: 11012 timeout: 739 lvb_type: 0
      

      Attachments

        Activity

          People

            wc-triage WC Triage
            bzzz Alex Zhuravlev
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated: