[LU-15285] same dir rename deadlock Created: 29/Nov/21 Updated: 11/Oct/23 Resolved: 31/Jan/22 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.15.0 |
| Fix Version/s: | Lustre 2.15.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Oleg Drokin | Assignee: | Oleg Drokin |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||
| Description |
|
removal of big rename lock uncovered a deadlock situation. When two threads are racing to perform the opposite rename: mv a b & mv b a we obtain PDO locks in the source-target order and it opens up the deadlock (easily seen in master) Deadlock! ungranted lock -- Lock: 0xffff8800b3d24b40/0x45f9a916d27c17ce (pid: 14611) Resource: 8589935618:1:0:13873 Req mode: PW, grant mode: --, read: 0, write: 1 Bits: 0x2 waits for granted lock -- Lock: 0xffff8800b4795200/0x45f9a916d27c17c0 (pid: 12773) Resource: 8589935618:1:0:13873 Req mode: PW, grant mode: PW, read: 0, write: 1 Bits: 0x2 that is blocked waiting on another lock -- Lock: 0xffff8800b4797600/0x45f9a916d27c17d5 (pid: 12773) Resource: 8589935618:1:0:13361 Req mode: PW, grant mode: --, read: 0, write: 1 Bits: 0x2 that is held by the first thread wanting the first lock in the chain -- Lock: 0xffff8800b4513840/0x45f9a916d27c17c7 (pid: 14611) Resource: 8589935618:1:0:13361 Req mode: PW, grant mode: PW, read: 0, write: 1 Bits: 0x2 Deadlock! ungranted lock -- Lock: 0xffff8800b4797600/0x45f9a916d27c17d5 (pid: 12773) Resource: 8589935618:1:0:13361 Req mode: PW, grant mode: --, read: 0, write: 1 Bits: 0x2 waits for granted lock -- Lock: 0xffff8800b4513840/0x45f9a916d27c17c7 (pid: 14611) Resource: 8589935618:1:0:13361 Req mode: PW, grant mode: PW, read: 0, write: 1 Bits: 0x2 that is blocked waiting on another lock -- Lock: 0xffff8800b3d24b40/0x45f9a916d27c17ce (pid: 14611) Resource: 8589935618:1:0:13873 Req mode: PW, grant mode: --, read: 0, write: 1 Bits: 0x2 that is held by the first thread wanting the first lock in the chain -- Lock: 0xffff8800b4795200/0x45f9a916d27c17c0 (pid: 12773) Resource: 8589935618:1:0:13873 Req mode: PW, grant mode: PW, read: 0, write: 1 Bits: 0x2 rr_opcode = REINT_RENAME,
rr_open_handle = 0x0,
rr_lease_handle = 0x0,
rr_fid1 = 0xffff8800a9d81b40,
rr_fid2 = 0xffff8800a9d81b50,
rr_name = {
ln_name = 0xffff8800a9d81ba0 "14",
ln_namelen = 2
},
rr_tgt_name = {
ln_name = 0xffff8800a9d81ba8 "16",
ln_namelen = 2
},
---
rr_opcode = REINT_RENAME,
rr_open_handle = 0x0,
rr_lease_handle = 0x0,
rr_fid1 = 0xffff8800a9d84b58,
rr_fid2 = 0xffff8800a9d84b68,
rr_name = {
ln_name = 0xffff8800a9d84bb8 "16",
ln_namelen = 2
},
rr_tgt_name = {
ln_name = 0xffff8800a9d84bc0 "14",
ln_namelen = 2
},
crash> bt 12773 PID: 12773 TASK: ffff8800b827c440 CPU: 3 COMMAND: "mdt00_005" #0 [ffff8800b7b37920] __schedule at ffffffff817e3e22 #1 [ffff8800b7b37988] schedule at ffffffff817e4339 #2 [ffff8800b7b37998] ldlm_completion_ast at ffffffffa05ec3dd [ptlrpc] #3 [ffff8800b7b37a38] ldlm_cli_enqueue_local at ffffffffa05ea219 [ptlrpc] #4 [ffff8800b7b37ad8] mdt_reint_rename at ffffffffa0d53948 [mdt] #5 [ffff8800b7b37bf0] mdt_reint_rec at ffffffffa0d5dfb7 [mdt] #6 [ffff8800b7b37c18] mdt_reint_internal at ffffffffa0d32acc [mdt] #7 [ffff8800b7b37c58] mdt_reint at ffffffffa0d3d647 [mdt] #8 [ffff8800b7b37c88] tgt_request_handle at ffffffffa06852be [ptlrpc] #9 [ffff8800b7b37d18] ptlrpc_server_handle_request at ffffffffa06309c0 [ptlrpc] #10 [ffff8800b7b37dd0] ptlrpc_main at ffffffffa0632559 [ptlrpc] #11 [ffff8800b7b37ea8] kthread at ffffffff810ba114 #12 [ffff8800b7b37f50] ret_from_fork_nospec_begin at ffffffff817f1e5d crash> bt 14611 PID: 14611 TASK: ffff8800c930b330 CPU: 1 COMMAND: "mdt00_013" #0 [ffff8800b7efb920] __schedule at ffffffff817e3e22 #1 [ffff8800b7efb988] schedule at ffffffff817e4339 #2 [ffff8800b7efb998] ldlm_completion_ast at ffffffffa05ec3dd [ptlrpc] #3 [ffff8800b7efba38] ldlm_cli_enqueue_local at ffffffffa05ea219 [ptlrpc] #4 [ffff8800b7efbad8] mdt_reint_rename at ffffffffa0d53948 [mdt] #5 [ffff8800b7efbbf0] mdt_reint_rec at ffffffffa0d5dfb7 [mdt] #6 [ffff8800b7efbc18] mdt_reint_internal at ffffffffa0d32acc [mdt] #7 [ffff8800b7efbc58] mdt_reint at ffffffffa0d3d647 [mdt] #8 [ffff8800b7efbc88] tgt_request_handle at ffffffffa06852be [ptlrpc] #9 [ffff8800b7efbd18] ptlrpc_server_handle_request at ffffffffa06309c0 [ptlrpc] #10 [ffff8800b7efbdd0] ptlrpc_main at ffffffffa0632559 [ptlrpc] #11 [ffff8800b7efbea8] kthread at ffffffff810ba114 #12 [ffff8800b7efbf50] ret_from_fork_nospec_begin at ffffffff817f1e5d Additionally while looking at this code it's not yet clear to me why do we need mdt_pdir_hash_lock at all since everythning it does is also done by the mdt_object_local_lock (that we get into from mdt_object_lock_save call) |
| Comments |
| Comment by Gerrit Updater [ 29/Nov/21 ] |
|
"Oleg Drokin <green@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/45676 |
| Comment by Andreas Dilger [ 30/Nov/21 ] |
|
Oleg, I was thinking about your mdt_pdir_hash_lock vs. mdt_object_local_lock comment. If they were taking exactly the same lock, wouldn't the thread deadlock on itself in that case? |
| Comment by Gerrit Updater [ 31/Jan/22 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45676/ |
| Comment by Peter Jones [ 31/Jan/22 ] |
|
Landed for 2.15 |