[LU-15491] rename deadlock Created: 28/Jan/22 Updated: 15/Aug/23 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.15.0 |
| Fix Version/s: | Lustre 2.16.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Oleg Drokin | Assignee: | Oleg Drokin |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||
| Severity: | 3 | ||||
| Rank (Obsolete): | 9223372036854775807 | ||||
| Description |
|
thread 1: mv dir1/10 dir2/10 This is reproduced with non-dne racer run here: http://testing.linuxhacker.ru:3333/lustre-reports/21289/testresults/racer-long-zfs-centos7_x86_64-centos7_x86_64-retry3/ deadlock report reports: Deadlock!
ungranted lock
-- Lock: 0xffff8800a93f5b00/0x539a6d335586e266 (pid: 6606)
Node: local, ns lustre-MDT0000
Resource: [0x200000401:0x9fd9:0x0].0
Req mode: EX, grant mode: --, read: 0, write: 1
Bits: 0x3
waits for granted lock
-- Lock: 0xffff8800af61e1c0/0x539a6d335586e235 (pid: 14160)
Node: local, ns lustre-MDT0000
Resource: [0x200000401:0x9fd9:0x0].0
Req mode: EX, grant mode: EX, read: 0, write: 1
Bits: 0x21
but holder is blocked waiting on another lock
-- Lock: 0xffff880114890900/0x539a6d335586e258 (pid: 14160)
Node: local, ns lustre-MDT0000
Resource: [0x200000401:0xa0b7:0x0].0
Req mode: EX, grant mode: --, read: 0, write: 1
Bits: 0x3
that is held by the first thread wanting the first lock in the chain
-- Lock: 0xffff8800a93f61c0/0x539a6d335586e23c (pid: 6606)
Node: local, ns lustre-MDT0000
Resource: [0x200000401:0xa0b7:0x0].0
Req mode: EX, grant mode: EX, read: 0, write: 1
Bits: 0x21
Deadlock!
ungranted lock
-- Lock: 0xffff880114890900/0x539a6d335586e258 (pid: 14160)
Node: local, ns lustre-MDT0000
Resource: [0x200000401:0xa0b7:0x0].0
Req mode: EX, grant mode: --, read: 0, write: 1
Bits: 0x3
waits for granted lock
-- Lock: 0xffff8800a93f61c0/0x539a6d335586e23c (pid: 6606)
Node: local, ns lustre-MDT0000
Resource: [0x200000401:0xa0b7:0x0].0
Req mode: EX, grant mode: EX, read: 0, write: 1
Bits: 0x21
but holder is blocked waiting on another lock
-- Lock: 0xffff8800a93f5b00/0x539a6d335586e266 (pid: 6606)
Node: local, ns lustre-MDT0000
Resource: [0x200000401:0x9fd9:0x0].0
Req mode: EX, grant mode: --, read: 0, write: 1
Bits: 0x3
that is held by the first thread wanting the first lock in the chain
-- Lock: 0xffff8800af61e1c0/0x539a6d335586e235 (pid: 14160)
Node: local, ns lustre-MDT0000
Resource: [0x200000401:0x9fd9:0x0].0
Req mode: EX, grant mode: EX, read: 0, write: 1
Bits: 0x21
thread info for pid 6606 at 0xffff8800d3c6e000: mti_rr = {
rr_opcode = REINT_RENAME,
rr_open_handle = 0x0,
rr_lease_handle = 0x0,
rr_fid1 = 0xffff880116b15000,
rr_fid2 = 0xffff880116b15010,
rr_name = {
ln_name = 0xffff880116b15060 "10",
ln_namelen = 2
},
rr_tgt_name = {
ln_name = 0xffff880116b15068 "10",
ln_namelen = 2
},
crash> p *(struct ll_fid *)0xffff880116b15000
$3 = {
id = 8589935617,
generation = 1,
f_type = 0
}
crash> p *(struct ll_fid *)0xffff880116b15010
$4 = {
id = 8589935618,
generation = 41273,
f_type = 0
}
thread info for pid 14160 at 0xffff88010f554000 mti_rr = {
rr_opcode = REINT_RENAME,
rr_open_handle = 0x0,
rr_lease_handle = 0x0,
rr_fid1 = 0xffff880116b15c20,
rr_fid2 = 0xffff880116b15c30,
rr_name = {
ln_name = 0xffff880116b15c80 "11",
ln_namelen = 2
},
rr_tgt_name = {
ln_name = 0xffff880116b15c88 "2",
ln_namelen = 1
},
crash> p *(struct ll_fid *)0xffff880116b15c20
$6 = {
id = 8589935617,
generation = 1,
f_type = 0
}
crash> p *(struct ll_fid *)0xffff880116b15c30
$7 = {
id = 8589935617,
generation = 1,
f_type = 0
}
And indeed in mdt_reint_rename() we see: lock_ibits = MDS_INODELOCK_LOOKUP | MDS_INODELOCK_XATTR; rc = mdt_rename_source_lock(info, msrcdir, mold, lh_oldp, lh_rmt, lock_ibits, cos_incompat); aka 0x21 lock bit for mold
lh_newp = &info->mti_lh[MDT_LH_NEW];
mdt_lock_reg_init(lh_newp, LCK_EX);
lock_ibits = MDS_INODELOCK_LOOKUP | MDS_INODELOCK_UPDATE;
... // skip remote check
rc = mdt_reint_object_lock(info, mnew, lh_newp, lock_ibits,
cos_incompat);
aka 0x3 lock bit vmcore is available in the report (the server one) if you want to marvel at the coincidence |
| Comments |
| Comment by Gerrit Updater [ 28/Jan/22 ] |
|
"Oleg Drokin <green@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/46352 |
| Comment by Andreas Dilger [ 04/Nov/22 ] |
|
+1 on master: https://testing.whamcloud.com/test_sets/2f4ce97f-d375-4353-8c2c-7a8696c9f84b [Fri Nov 4 17:45:37 2022] task:mv state:D stack: 0 pid:499292 ppid:235624 flags:0x00000080 [Fri Nov 4 17:45:37 2022] Call Trace: [Fri Nov 4 17:45:37 2022] schedule+0x35/0xa0 [Fri Nov 4 17:45:37 2022] __mutex_lock.isra.7+0x310/0x420 [Fri Nov 4 17:45:37 2022] lock_rename+0x28/0xb0 [Fri Nov 4 17:45:37 2022] do_renameat2+0x277/0x4e0 [Fri Nov 4 17:45:37 2022] __x64_sys_rename+0x1c/0x20 [Fri Nov 4 17:45:37 2022] task:mv state:D stack: 0 pid:499317 ppid:235529 flags:0x00000080 [Fri Nov 4 17:45:37 2022] Call Trace: [Fri Nov 4 17:45:37 2022] schedule+0x35/0xa0 [Fri Nov 4 17:45:37 2022] __mutex_lock.isra.7+0x310/0x420 [Fri Nov 4 17:45:37 2022] lock_rename+0x28/0xb0 [Fri Nov 4 17:45:37 2022] do_renameat2+0x277/0x4e0 [Fri Nov 4 17:45:37 2022] __x64_sys_rename+0x1c/0x20 [Fri Nov 4 17:43:34 2022] task:mv state:D stack: 0 pid:198990 ppid:184082 flags:0x00000080 [Fri Nov 4 17:43:34 2022] Call Trace: [Fri Nov 4 17:43:34 2022] schedule+0x35/0xa0 [Fri Nov 4 17:43:34 2022] __mutex_lock.isra.7+0x310/0x420 [Fri Nov 4 17:43:34 2022] lock_rename+0x28/0xb0 [Fri Nov 4 17:43:34 2022] do_renameat2+0x277/0x4e0 [Fri Nov 4 17:43:34 2022] __x64_sys_renameat+0x1d/0x20 [Fri Nov 4 17:43:34 2022] task:mv state:D stack: 0 pid:198998 ppid:184046 flags:0x00004080 [Fri Nov 4 17:43:34 2022] Call Trace: [Fri Nov 4 17:43:34 2022] schedule+0x35/0xa0 [Fri Nov 4 17:43:34 2022] __mutex_lock.isra.7+0x310/0x420 [Fri Nov 4 17:43:34 2022] lock_rename+0x28/0xb0 [Fri Nov 4 17:43:34 2022] do_renameat2+0x277/0x4e0 [Fri Nov 4 17:43:34 2022] __x64_sys_renameat2+0x20/0x30 and one MDS thread stuck in rename: Fri Nov 4 17:42:41 2022] Pid: 114790, comm: mdt00_023 4.18.0-372.26.1.el8_lustre.x86_64 #1 SMP Mon Oct 17 07:45:27 UTC 2022 [Fri Nov 4 17:42:41 2022] Call Trace TBD: [Fri Nov 4 17:42:41 2022] [<0>] ldlm_completion_ast+0x7ac/0x900 [ptlrpc] [Fri Nov 4 17:42:41 2022] [<0>] ldlm_cli_enqueue_fini+0xa4d/0xf40 [ptlrpc] [Fri Nov 4 17:42:41 2022] [<0>] ldlm_cli_enqueue+0x61d/0xa70 [ptlrpc] [Fri Nov 4 17:42:41 2022] [<0>] osp_md_object_lock+0x1cd/0x2c0 [osp] [Fri Nov 4 17:42:41 2022] [<0>] lod_object_lock+0x132/0x7f0 [lod] [Fri Nov 4 17:42:41 2022] [<0>] mdt_remote_object_lock_try+0x186/0x3c0 [mdt] [Fri Nov 4 17:42:41 2022] [<0>] mdt_remote_object_lock+0x1f/0x30 [mdt] [Fri Nov 4 17:42:41 2022] [<0>] mdt_rename_lock+0x26b/0x3a0 [mdt] [Fri Nov 4 17:42:41 2022] [<0>] mdt_reint_migrate+0x5bd/0x1430 [mdt] [Fri Nov 4 17:42:41 2022] [<0>] mdt_reint_rec+0x117/0x270 [mdt] [Fri Nov 4 17:42:41 2022] [<0>] mdt_reint_internal+0x4bc/0x7d0 [mdt] [Fri Nov 4 17:42:41 2022] [<0>] mdt_reint+0x5d/0x110 [mdt] [Fri Nov 4 17:42:41 2022] [<0>] tgt_request_handle+0xc8c/0x19c0 [ptlrpc] [Fri Nov 4 17:42:41 2022] [<0>] ptlrpc_server_handle_request+0x31d/0xbc0 [ptlrpc] [Fri Nov 4 17:42:41 2022] [<0>] ptlrpc_main+0xc0b/0x1560 [ptlrpc] |
| Comment by Andreas Dilger [ 15/Aug/23 ] |
|
+1 on master racer with same mdt_reint_migrate() + mdt_rename_lock() stack traces on MDS and stuck mv processes on client. |