Details
-
Bug
-
Resolution: Fixed
-
Major
-
None
-
None
-
Production Lustre 2.12.6 + patches
-
3
-
9223372036854775807
Description
This issue was seen on a production filesystem.
A MDT became unresponsive (from clients) after a user (mistakenly) start a job with HSM release files.
To solve the issue the admin needed to crash the MDT and copytools and stop the problematic job.
dmesg
LustreError: 2989:0:(ldlm_request.c:130:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1634368802, 300s ago); not entering recovery in server code, just going back to sleep ns: mdt-lustre1-MDT0001_UUID lock: ffff8a89b9d48fc0/0x936f3f94aa7a15ba lrc: 3/0,1 mode: --/EX res: [0xc40051b70:0xa:0x0].0x0 bits 0x8/0x0 rrc: 378 type: IBT flags: 0x40210400000020 nid: local remote: 0x0 expref: -99 pid: 2989 timeout: 0 lvb_type: 0
stack
[1371853.573298] LNet: Service thread pid 3160 was inactive for 200.47s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: [1371853.590331] LNet: Skipped 3 previous similar messages [1371853.595565] Pid: 3160, comm: mdt01_083 3.10.0-1062.18.1.el7.x86_64 #1 SMP Tue Mar 17 23:49:1 7 UTC 2020 [1371853.595567] Call Trace: [1371853.595584] [<ffffffffc113a220>] ldlm_completion_ast+0x430/0x860 [ptlrpc] [1371853.602747] [<ffffffffc113b101>] ldlm_cli_enqueue_local+0x231/0x830 [ptlrpc] [1371853.610134] [<ffffffffc170b69b>] mdt_object_local_lock+0x50b/0xb20 [mdt] [1371853.617154] [<ffffffffc170bd20>] mdt_object_lock_internal+0x70/0x360 [mdt] [1371853.624325] [<ffffffffc170ebaa>] mdt_object_find_lock+0x6a/0x1a0 [mdt] [1371853.631160] [<ffffffffc175d56f>] cdt_restore_handle_add+0xdf/0x540 [mdt] [1371853.638167] [<ffffffffc175422b>] mdt_hsm_register_hal+0x59b/0xb10 [mdt] [1371853.645082] [<ffffffffc1754d1c>] mdt_hsm_add_actions+0x57c/0x6c0 [mdt] [1371853.651910] [<ffffffffc1744b8a>] mdt_hsm_request+0x4da/0xa30 [mdt] [1371853.658402] [<ffffffffc11d8aaa>] tgt_request_handle+0xada/0x1570 [ptlrpc] [1371853.665549] [<ffffffffc117d9cb>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [1371853.673461] [<ffffffffc1181334>] ptlrpc_main+0xb34/0x1470 [ptlrpc] [1371853.679982] [<ffffffff998c6321>] kthread+0xd1/0xe0 [1371853.685073] [<ffffffff99f8dd37>] ret_from_fork_nospec_end+0x0/0x39 [1371853.691545] [<ffffffffffffffff>] 0xffffffffffffffff
Analysis
I was able to find the ldlm_lock and ldlm_resource from the crash.
The release file was a big file +430GB, it take a lot of time from copytool to retrieve data.
Here the granted lock for the resource
crash> ldlm_resource.lr_granted -o 0xffff8a8a352e3b00 struct ldlm_resource { [ffff8a8a352e3b20] struct list_head lr_granted; } crash> list -H ffff8a8a352e3b20 -s ldlm_lock ldlm_lock.l_res_link ffff8a89806018c0 struct ldlm_lock { l_handle = { h_link = { next = 0xffff8a6c38bf5000, prev = 0xffff8a6c6aa204e0 }, h_cookie = 10623779953738126729, h_owner = 0x0, h_ops = 0xffffffffc124e730 <lock_handle_ops>, h_rcu = { next = 0x0, func = 0x0 }, h_lock = { { rlock = { raw_lock = { val = { counter = 0 } } } } }, h_size = 0, h_in = 1 }, l_refc = { counter = 2 }, l_lock = { { rlock = { raw_lock = { val = { counter = 0 } } } } }, l_resource = 0xffff8a8a352e3b00, l_lru = { next = 0xffff8a8980601910, prev = 0xffff8a8980601910 }, l_res_link = { next = 0xffff8a8a352e3b20, prev = 0xffff8a8a352e3b20 }, { l_tree_node = 0xffff8a898ec77a80, l_ibits_node = 0xffff8a898ec77a80 }, l_exp_hash = { next = 0x0, pprev = 0x0 }, l_exp_flock_hash = { next = 0x0, pprev = 0x0 }, l_req_mode = LCK_EX, l_granted_mode = LCK_EX, l_completion_ast = 0xffffffffc1139df0 <ldlm_completion_ast>, l_blocking_ast = 0xffffffffc16fb960 <mdt_blocking_ast>, l_glimpse_ast = 0x0, l_export = 0x0, l_conn_export = 0x0, l_remote_handle = { cookie = 0 }, l_policy_data = { l_extent = { start = 8, end = 0, gid = 0 }, l_flock = { start = 8, end = 0, owner = 0, blocking_owner = 0, blocking_export = 0x0, blocking_refs = { counter = 0 }, pid = 0 }, l_inodebits = { bits = 8, { try_bits = 0, cancel_bits = 0 } } }, l_flags = 18050699573067808, l_readers = 0, l_writers = 1, l_waitq = { lock = { { rlock = { raw_lock = { val = { counter = 0 } } } } }, task_list = { next = 0xffff8a89806019d8, prev = 0xffff8a89806019d8 } }, l_last_used = { tv64 = 1371509486081554 }, l_req_extent = { start = 0, end = 0, gid = 0 }, l_lvb_type = LVB_T_NONE, l_lvb_len = 0, l_lvb_data = 0x0, l_ast_data = 0x0, { l_activity = 1634368802, l_blast_sent = 1634368802 }, l_ost_lvb = { lvb_size = 0, lvb_mtime = 0, lvb_atime = 0, lvb_ctime = 0, lvb_blocks = 0, lvb_mtime_ns = 0, lvb_atime_ns = 0, lvb_ctime_ns = 0, lvb_padding = 0 }, l_client_cookie = 10623779952545863461, l_pending_chain = { next = 0xffff8a8980601a68, prev = 0xffff8a8980601a68 }, l_callback_timeout = 0, l_pid = 2930, l_bl_ast_run = 1, l_bl_ast = { next = 0xffff8a8980601a88, prev = 0xffff8a8980601a88 }, l_cp_ast = { next = 0xffff8a8980601a98, prev = 0xffff8a8980601a98 }, l_rk_ast = { next = 0xffff8a8980601aa8, prev = 0xffff8a8980601aa8 }, l_blocking_lock = 0x0, l_sl_mode = { next = 0xffff8a8980601ac0, prev = 0xffff8a8980601ac0 }, l_sl_policy = { next = 0xffff8a8980601ad0, prev = 0xffff8a8980601ad0 }, l_reference = {<No data fields>}, l_exp_list = { next = 0xffff8a8980601ae0, prev = 0xffff8a8980601ae0 } }
There is lot of EX lock in the waiting list:
crash> ldlm_resource.lr_waiting -o 0xffff8a8a352e3b00 struct ldlm_resource { [ffff8a8a352e3b30] struct list_head lr_waiting; } crash> list -H ffff8a8a352e3b30 -s ldlm_lock.l_req_mode ldlm_lock.l_res_link | grep EX | wc -l 243
For each lock the creation pid (l_pid) is linked to a hanging thread. All the corresponding threads are in cdt_restore_handle_add().
2021-10-16_12:48:44 [1384176467802037] [IN] PID: 3149 TASK: ffff8a6bd74aa0e0 CPU: 15 COMMAND: "mdt01_081" 2021-10-16_12:48:44 [1384176467755519] [IN] PID: 3031 TASK: ffff8a6c61f7b150 CPU: 14 COMMAND: "mdt01_049" 2021-10-16_12:48:44 [1384176460239208] [IN] PID: 3186 TASK: ffff8a8cbcae1070 CPU: 2 COMMAND: "mdt00_107" 2021-10-16_12:48:44 [1384176460192173] [IN] PID: 2983 TASK: ffff8a8c5f215230 CPU: 0 COMMAND: "mdt00_049" 2021-10-16_12:48:44 [1384176330809933] [IN] PID: 2930 TASK: ffff8a8cbb52e2a0 CPU: 5 COMMAND: "mdt00_031" 2021-10-16_09:25:01 [1371953151468039] [IN] PID: 3259 TASK: ffff8a8cbb613150 CPU: 11 COMMAND: "mdt01_121" 2021-10-16_09:25:01 [1371953151453579] [IN] PID: 3115 TASK: ffff8a8c60a841c0 CPU: 11 COMMAND: "mdt01_072" ... ... 2021-10-16_09:25:01 [1371953111405844] [IN] PID: 3005 TASK: ffff8a8c7638a0e0 CPU: 1 COMMAND: "mdt00_056" 2021-10-16_09:25:01 [1371953111403702] [IN] PID: 2989 TASK: ffff8a6cbcf28000 CPU: 13 COMMAND: "mdt01_037"
Only 5/248 mdt threads were able to manage client requests (3 on CPT0 and 2 on CPT1).
For now, I think it exists a window where several clients could send HSM restore requests when EX layout lock is not already taken by the MDT.
The first mdt thread takes the EX lock via cdt_restore_handle_add, the other threads wait the lock until the RESTORE is finished.
This could causes all the available threads to hang (waiting the first RESTORE request to finish).