[LU-6603] lock enqueue dead lock for remote directory Created: 15/May/15 Updated: 09/Sep/16 Resolved: 26/Aug/15 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.7.0, Lustre 2.5.3, Lustre 2.8.0 |
| Fix Version/s: | Lustre 2.8.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Di Wang | Assignee: | Di Wang |
| Resolution: | Won't Fix | Votes: | 0 |
| Labels: | dne2 | ||
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
During racer, I saw a few trace like ls D 0000000000000007 0 103841 1 0x00000080 ffff8801f2e676d8 0000000000000086 ffff8801f2e67628 ffffffffa1723875 ffff8801f2e676b8 ffffffffa174afa2 0000000000000000 0000000000000000 ffffffffa1804880 ffff8801c660d000 ffff8801c8238638 ffff8801f2e67fd8 Call Trace: [<ffffffffa1723875>] ? lustre_msg_buf+0x55/0x60 [ptlrpc] [<ffffffffa174afa2>] ? __req_capsule_get+0x162/0x6e0 [ptlrpc] [<ffffffffa1727da0>] ? lustre_swab_mdt_rec_reint+0x0/0xc0 [ptlrpc] [<ffffffff8152bba6>] __mutex_lock_slowpath+0x96/0x210 [<ffffffffa197ef59>] ? mdc_open_pack+0x1b9/0x250 [mdc] [<ffffffff8152b6cb>] mutex_lock+0x2b/0x50 [<ffffffffa1982802>] mdc_enqueue+0x222/0x1a40 [mdc] [<ffffffffa1984202>] mdc_intent_lock+0x1e2/0x593 [mdc] [<ffffffffa083b920>] ? ll_md_blocking_ast+0x0/0x7d0 [lustre] [<ffffffffa16f8460>] ? ldlm_completion_ast+0x0/0x9b0 [ptlrpc] [<ffffffffa194cd11>] ? lmv_fld_lookup+0xf1/0x440 [lmv] [<ffffffffa1949b57>] lmv_intent_remote+0x337/0xa90 [lmv] [<ffffffffa083b920>] ? ll_md_blocking_ast+0x0/0x7d0 [lustre] [<ffffffffa194cb43>] lmv_intent_lock+0x1a23/0x1b00 [lmv] [<ffffffff811749e3>] ? kmem_cache_alloc_trace+0x1a3/0x1b0 [<ffffffffa0837c89>] ? ll_i2suppgid+0x19/0x30 [lustre] [<ffffffffa0849fa7>] ? ll_mdscapa_get+0x57/0x220 [lustre] [<ffffffffa081c2a6>] ? ll_prep_md_op_data+0x236/0x550 [lustre] [<ffffffffa083b920>] ? ll_md_blocking_ast+0x0/0x7d0 [lustre] [<ffffffffa083d629>] ll_lookup_it+0x249/0xdb0 [lustre] [<ffffffffa083e219>] ll_lookup_nd+0x89/0x5e0 [lustre] [<ffffffff8119e0f5>] do_lookup+0x1a5/0x230 [<ffffffff8119ed84>] __link_path_walk+0x7a4/0x1000 [<ffffffff8119f89a>] path_walk+0x6a/0xe0 [<ffffffff8119faab>] filename_lookup+0x6b/0xc0 [<ffffffff8122db26>] ? security_file_alloc+0x16/0x20 [<ffffffff811a0f84>] do_filp_open+0x104/0xd20 [<ffffffffa080b36c>] ? ll_file_release+0x2fc/0xb40 [lustre] [<ffffffff8129980a>] ? strncpy_from_user+0x4a/0x90 [<ffffffff811ae432>] ? alloc_fd+0x92/0x160 [<ffffffff8118b237>] do_sys_open+0x67/0x130 [<ffffffff8100c675>] ? math_state_restore+0x45/0x60 [<ffffffff8118b340>] sys_open+0x20/0x30 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b So for cross-MDT directory, after client get LOOKUP lock from the master MDT, it will hold it, then send enqueue request (for UPDATE lock) to the slave MDT (child MDT), if it can not get the RPC lock of the client and being blocked, then LOOKUP lock will be holding on the client side. In the mean time, if another thread hold the RPC lock, but enqueue the LOOKUP lock on the MDT, it will cause DEAD lock. So we should either use different PORTAL or do not do rpc_lock for cross-ref RPC. |
| Comments |
| Comment by Di Wang [ 26/Aug/15 ] |
|
This will not be an issue after multiple-slot patch is landed. |