[LU-6603] lock enqueue dead lock for remote directory Created: 15/May/15  Updated: 09/Sep/16  Resolved: 26/Aug/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.7.0, Lustre 2.5.3, Lustre 2.8.0
Fix Version/s: Lustre 2.8.0

Type: Bug Priority: Critical
Reporter: Di Wang Assignee: Di Wang
Resolution: Won't Fix Votes: 0
Labels: dne2

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

During racer, I saw a few trace like

ls            D 0000000000000007     0 103841      1 0x00000080
 ffff8801f2e676d8 0000000000000086 ffff8801f2e67628 ffffffffa1723875
 ffff8801f2e676b8 ffffffffa174afa2 0000000000000000 0000000000000000
 ffffffffa1804880 ffff8801c660d000 ffff8801c8238638 ffff8801f2e67fd8
Call Trace:
 [<ffffffffa1723875>] ? lustre_msg_buf+0x55/0x60 [ptlrpc]
 [<ffffffffa174afa2>] ? __req_capsule_get+0x162/0x6e0 [ptlrpc]
 [<ffffffffa1727da0>] ? lustre_swab_mdt_rec_reint+0x0/0xc0 [ptlrpc]
 [<ffffffff8152bba6>] __mutex_lock_slowpath+0x96/0x210
 [<ffffffffa197ef59>] ? mdc_open_pack+0x1b9/0x250 [mdc]
 [<ffffffff8152b6cb>] mutex_lock+0x2b/0x50
 [<ffffffffa1982802>] mdc_enqueue+0x222/0x1a40 [mdc]
 [<ffffffffa1984202>] mdc_intent_lock+0x1e2/0x593 [mdc]
 [<ffffffffa083b920>] ? ll_md_blocking_ast+0x0/0x7d0 [lustre]
 [<ffffffffa16f8460>] ? ldlm_completion_ast+0x0/0x9b0 [ptlrpc]
 [<ffffffffa194cd11>] ? lmv_fld_lookup+0xf1/0x440 [lmv]
 [<ffffffffa1949b57>] lmv_intent_remote+0x337/0xa90 [lmv]
 [<ffffffffa083b920>] ? ll_md_blocking_ast+0x0/0x7d0 [lustre]
 [<ffffffffa194cb43>] lmv_intent_lock+0x1a23/0x1b00 [lmv]
 [<ffffffff811749e3>] ? kmem_cache_alloc_trace+0x1a3/0x1b0
 [<ffffffffa0837c89>] ? ll_i2suppgid+0x19/0x30 [lustre]
 [<ffffffffa0849fa7>] ? ll_mdscapa_get+0x57/0x220 [lustre]
 [<ffffffffa081c2a6>] ? ll_prep_md_op_data+0x236/0x550 [lustre]
 [<ffffffffa083b920>] ? ll_md_blocking_ast+0x0/0x7d0 [lustre]
 [<ffffffffa083d629>] ll_lookup_it+0x249/0xdb0 [lustre]
 [<ffffffffa083e219>] ll_lookup_nd+0x89/0x5e0 [lustre]
 [<ffffffff8119e0f5>] do_lookup+0x1a5/0x230
 [<ffffffff8119ed84>] __link_path_walk+0x7a4/0x1000
 [<ffffffff8119f89a>] path_walk+0x6a/0xe0
 [<ffffffff8119faab>] filename_lookup+0x6b/0xc0
 [<ffffffff8122db26>] ? security_file_alloc+0x16/0x20
 [<ffffffff811a0f84>] do_filp_open+0x104/0xd20
 [<ffffffffa080b36c>] ? ll_file_release+0x2fc/0xb40 [lustre]
 [<ffffffff8129980a>] ? strncpy_from_user+0x4a/0x90
 [<ffffffff811ae432>] ? alloc_fd+0x92/0x160
 [<ffffffff8118b237>] do_sys_open+0x67/0x130
 [<ffffffff8100c675>] ? math_state_restore+0x45/0x60
 [<ffffffff8118b340>] sys_open+0x20/0x30
 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b

So for cross-MDT directory, after client get LOOKUP lock from the master MDT, it will hold it, then send enqueue request (for UPDATE lock) to the slave MDT (child MDT), if it can not get the RPC lock of the client and being blocked, then LOOKUP lock will be holding on the client side.

In the mean time, if another thread hold the RPC lock, but enqueue the LOOKUP lock on the MDT, it will cause DEAD lock.

So we should either use different PORTAL or do not do rpc_lock for cross-ref RPC.



 Comments   
Comment by Di Wang [ 26/Aug/15 ]

This will not be an issue after multiple-slot patch is landed.

Generated at Sat Feb 10 02:01:38 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.