Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6603

lock enqueue dead lock for remote directory

    XMLWordPrintable

Details

    • Bug
    • Resolution: Won't Fix
    • Critical
    • Lustre 2.8.0
    • Lustre 2.7.0, Lustre 2.5.3, Lustre 2.8.0
    • 3
    • 9223372036854775807

    Description

      During racer, I saw a few trace like

      ls            D 0000000000000007     0 103841      1 0x00000080
       ffff8801f2e676d8 0000000000000086 ffff8801f2e67628 ffffffffa1723875
       ffff8801f2e676b8 ffffffffa174afa2 0000000000000000 0000000000000000
       ffffffffa1804880 ffff8801c660d000 ffff8801c8238638 ffff8801f2e67fd8
      Call Trace:
       [<ffffffffa1723875>] ? lustre_msg_buf+0x55/0x60 [ptlrpc]
       [<ffffffffa174afa2>] ? __req_capsule_get+0x162/0x6e0 [ptlrpc]
       [<ffffffffa1727da0>] ? lustre_swab_mdt_rec_reint+0x0/0xc0 [ptlrpc]
       [<ffffffff8152bba6>] __mutex_lock_slowpath+0x96/0x210
       [<ffffffffa197ef59>] ? mdc_open_pack+0x1b9/0x250 [mdc]
       [<ffffffff8152b6cb>] mutex_lock+0x2b/0x50
       [<ffffffffa1982802>] mdc_enqueue+0x222/0x1a40 [mdc]
       [<ffffffffa1984202>] mdc_intent_lock+0x1e2/0x593 [mdc]
       [<ffffffffa083b920>] ? ll_md_blocking_ast+0x0/0x7d0 [lustre]
       [<ffffffffa16f8460>] ? ldlm_completion_ast+0x0/0x9b0 [ptlrpc]
       [<ffffffffa194cd11>] ? lmv_fld_lookup+0xf1/0x440 [lmv]
       [<ffffffffa1949b57>] lmv_intent_remote+0x337/0xa90 [lmv]
       [<ffffffffa083b920>] ? ll_md_blocking_ast+0x0/0x7d0 [lustre]
       [<ffffffffa194cb43>] lmv_intent_lock+0x1a23/0x1b00 [lmv]
       [<ffffffff811749e3>] ? kmem_cache_alloc_trace+0x1a3/0x1b0
       [<ffffffffa0837c89>] ? ll_i2suppgid+0x19/0x30 [lustre]
       [<ffffffffa0849fa7>] ? ll_mdscapa_get+0x57/0x220 [lustre]
       [<ffffffffa081c2a6>] ? ll_prep_md_op_data+0x236/0x550 [lustre]
       [<ffffffffa083b920>] ? ll_md_blocking_ast+0x0/0x7d0 [lustre]
       [<ffffffffa083d629>] ll_lookup_it+0x249/0xdb0 [lustre]
       [<ffffffffa083e219>] ll_lookup_nd+0x89/0x5e0 [lustre]
       [<ffffffff8119e0f5>] do_lookup+0x1a5/0x230
       [<ffffffff8119ed84>] __link_path_walk+0x7a4/0x1000
       [<ffffffff8119f89a>] path_walk+0x6a/0xe0
       [<ffffffff8119faab>] filename_lookup+0x6b/0xc0
       [<ffffffff8122db26>] ? security_file_alloc+0x16/0x20
       [<ffffffff811a0f84>] do_filp_open+0x104/0xd20
       [<ffffffffa080b36c>] ? ll_file_release+0x2fc/0xb40 [lustre]
       [<ffffffff8129980a>] ? strncpy_from_user+0x4a/0x90
       [<ffffffff811ae432>] ? alloc_fd+0x92/0x160
       [<ffffffff8118b237>] do_sys_open+0x67/0x130
       [<ffffffff8100c675>] ? math_state_restore+0x45/0x60
       [<ffffffff8118b340>] sys_open+0x20/0x30
       [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
      

      So for cross-MDT directory, after client get LOOKUP lock from the master MDT, it will hold it, then send enqueue request (for UPDATE lock) to the slave MDT (child MDT), if it can not get the RPC lock of the client and being blocked, then LOOKUP lock will be holding on the client side.

      In the mean time, if another thread hold the RPC lock, but enqueue the LOOKUP lock on the MDT, it will cause DEAD lock.

      So we should either use different PORTAL or do not do rpc_lock for cross-ref RPC.

      Attachments

        Activity

          People

            di.wang Di Wang
            di.wang Di Wang
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: