Details
-
Bug
-
Resolution: Won't Fix
-
Critical
-
Lustre 2.7.0, Lustre 2.5.3, Lustre 2.8.0
-
3
-
9223372036854775807
Description
During racer, I saw a few trace like
ls D 0000000000000007 0 103841 1 0x00000080 ffff8801f2e676d8 0000000000000086 ffff8801f2e67628 ffffffffa1723875 ffff8801f2e676b8 ffffffffa174afa2 0000000000000000 0000000000000000 ffffffffa1804880 ffff8801c660d000 ffff8801c8238638 ffff8801f2e67fd8 Call Trace: [<ffffffffa1723875>] ? lustre_msg_buf+0x55/0x60 [ptlrpc] [<ffffffffa174afa2>] ? __req_capsule_get+0x162/0x6e0 [ptlrpc] [<ffffffffa1727da0>] ? lustre_swab_mdt_rec_reint+0x0/0xc0 [ptlrpc] [<ffffffff8152bba6>] __mutex_lock_slowpath+0x96/0x210 [<ffffffffa197ef59>] ? mdc_open_pack+0x1b9/0x250 [mdc] [<ffffffff8152b6cb>] mutex_lock+0x2b/0x50 [<ffffffffa1982802>] mdc_enqueue+0x222/0x1a40 [mdc] [<ffffffffa1984202>] mdc_intent_lock+0x1e2/0x593 [mdc] [<ffffffffa083b920>] ? ll_md_blocking_ast+0x0/0x7d0 [lustre] [<ffffffffa16f8460>] ? ldlm_completion_ast+0x0/0x9b0 [ptlrpc] [<ffffffffa194cd11>] ? lmv_fld_lookup+0xf1/0x440 [lmv] [<ffffffffa1949b57>] lmv_intent_remote+0x337/0xa90 [lmv] [<ffffffffa083b920>] ? ll_md_blocking_ast+0x0/0x7d0 [lustre] [<ffffffffa194cb43>] lmv_intent_lock+0x1a23/0x1b00 [lmv] [<ffffffff811749e3>] ? kmem_cache_alloc_trace+0x1a3/0x1b0 [<ffffffffa0837c89>] ? ll_i2suppgid+0x19/0x30 [lustre] [<ffffffffa0849fa7>] ? ll_mdscapa_get+0x57/0x220 [lustre] [<ffffffffa081c2a6>] ? ll_prep_md_op_data+0x236/0x550 [lustre] [<ffffffffa083b920>] ? ll_md_blocking_ast+0x0/0x7d0 [lustre] [<ffffffffa083d629>] ll_lookup_it+0x249/0xdb0 [lustre] [<ffffffffa083e219>] ll_lookup_nd+0x89/0x5e0 [lustre] [<ffffffff8119e0f5>] do_lookup+0x1a5/0x230 [<ffffffff8119ed84>] __link_path_walk+0x7a4/0x1000 [<ffffffff8119f89a>] path_walk+0x6a/0xe0 [<ffffffff8119faab>] filename_lookup+0x6b/0xc0 [<ffffffff8122db26>] ? security_file_alloc+0x16/0x20 [<ffffffff811a0f84>] do_filp_open+0x104/0xd20 [<ffffffffa080b36c>] ? ll_file_release+0x2fc/0xb40 [lustre] [<ffffffff8129980a>] ? strncpy_from_user+0x4a/0x90 [<ffffffff811ae432>] ? alloc_fd+0x92/0x160 [<ffffffff8118b237>] do_sys_open+0x67/0x130 [<ffffffff8100c675>] ? math_state_restore+0x45/0x60 [<ffffffff8118b340>] sys_open+0x20/0x30 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
So for cross-MDT directory, after client get LOOKUP lock from the master MDT, it will hold it, then send enqueue request (for UPDATE lock) to the slave MDT (child MDT), if it can not get the RPC lock of the client and being blocked, then LOOKUP lock will be holding on the client side.
In the mean time, if another thread hold the RPC lock, but enqueue the LOOKUP lock on the MDT, it will cause DEAD lock.
So we should either use different PORTAL or do not do rpc_lock for cross-ref RPC.