Details
-
Bug
-
Resolution: Fixed
-
Major
-
None
-
None
-
Production Lustre 2.12.6 + patches
-
3
-
9223372036854775807
Description
This issue was seen on a production filesystem.
A MDT became unresponsive (from clients) after a user (mistakenly) start a job with HSM release files.
To solve the issue the admin needed to crash the MDT and copytools and stop the problematic job.
dmesg
LustreError: 2989:0:(ldlm_request.c:130:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1634368802, 300s ago); not entering recovery in server code, just going back to sleep ns: mdt-lustre1-MDT0001_UUID lock: ffff8a89b9d48fc0/0x936f3f94aa7a15ba lrc: 3/0,1 mode: --/EX res: [0xc40051b70:0xa:0x0].0x0 bits 0x8/0x0 rrc: 378 type: IBT flags: 0x40210400000020 nid: local remote: 0x0 expref: -99 pid: 2989 timeout: 0 lvb_type: 0
stack
[1371853.573298] LNet: Service thread pid 3160 was inactive for 200.47s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: [1371853.590331] LNet: Skipped 3 previous similar messages [1371853.595565] Pid: 3160, comm: mdt01_083 3.10.0-1062.18.1.el7.x86_64 #1 SMP Tue Mar 17 23:49:1 7 UTC 2020 [1371853.595567] Call Trace: [1371853.595584] [<ffffffffc113a220>] ldlm_completion_ast+0x430/0x860 [ptlrpc] [1371853.602747] [<ffffffffc113b101>] ldlm_cli_enqueue_local+0x231/0x830 [ptlrpc] [1371853.610134] [<ffffffffc170b69b>] mdt_object_local_lock+0x50b/0xb20 [mdt] [1371853.617154] [<ffffffffc170bd20>] mdt_object_lock_internal+0x70/0x360 [mdt] [1371853.624325] [<ffffffffc170ebaa>] mdt_object_find_lock+0x6a/0x1a0 [mdt] [1371853.631160] [<ffffffffc175d56f>] cdt_restore_handle_add+0xdf/0x540 [mdt] [1371853.638167] [<ffffffffc175422b>] mdt_hsm_register_hal+0x59b/0xb10 [mdt] [1371853.645082] [<ffffffffc1754d1c>] mdt_hsm_add_actions+0x57c/0x6c0 [mdt] [1371853.651910] [<ffffffffc1744b8a>] mdt_hsm_request+0x4da/0xa30 [mdt] [1371853.658402] [<ffffffffc11d8aaa>] tgt_request_handle+0xada/0x1570 [ptlrpc] [1371853.665549] [<ffffffffc117d9cb>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [1371853.673461] [<ffffffffc1181334>] ptlrpc_main+0xb34/0x1470 [ptlrpc] [1371853.679982] [<ffffffff998c6321>] kthread+0xd1/0xe0 [1371853.685073] [<ffffffff99f8dd37>] ret_from_fork_nospec_end+0x0/0x39 [1371853.691545] [<ffffffffffffffff>] 0xffffffffffffffff
Analysis
I was able to find the ldlm_lock and ldlm_resource from the crash.
The release file was a big file +430GB, it take a lot of time from copytool to retrieve data.
Here the granted lock for the resource
crash> ldlm_resource.lr_granted -o 0xffff8a8a352e3b00
struct ldlm_resource {
[ffff8a8a352e3b20] struct list_head lr_granted;
}
crash> list -H ffff8a8a352e3b20 -s ldlm_lock ldlm_lock.l_res_link
ffff8a89806018c0
struct ldlm_lock {
l_handle = {
h_link = {
next = 0xffff8a6c38bf5000,
prev = 0xffff8a6c6aa204e0
},
h_cookie = 10623779953738126729,
h_owner = 0x0,
h_ops = 0xffffffffc124e730 <lock_handle_ops>,
h_rcu = {
next = 0x0,
func = 0x0
},
h_lock = {
{
rlock = {
raw_lock = {
val = {
counter = 0
}
}
}
}
},
h_size = 0,
h_in = 1
},
l_refc = {
counter = 2
},
l_lock = {
{
rlock = {
raw_lock = {
val = {
counter = 0
}
}
}
}
},
l_resource = 0xffff8a8a352e3b00,
l_lru = {
next = 0xffff8a8980601910,
prev = 0xffff8a8980601910
},
l_res_link = {
next = 0xffff8a8a352e3b20,
prev = 0xffff8a8a352e3b20
},
{
l_tree_node = 0xffff8a898ec77a80,
l_ibits_node = 0xffff8a898ec77a80
},
l_exp_hash = {
next = 0x0,
pprev = 0x0
},
l_exp_flock_hash = {
next = 0x0,
pprev = 0x0
},
l_req_mode = LCK_EX,
l_granted_mode = LCK_EX,
l_completion_ast = 0xffffffffc1139df0 <ldlm_completion_ast>,
l_blocking_ast = 0xffffffffc16fb960 <mdt_blocking_ast>,
l_glimpse_ast = 0x0,
l_export = 0x0,
l_conn_export = 0x0,
l_remote_handle = {
cookie = 0
},
l_policy_data = {
l_extent = {
start = 8,
end = 0,
gid = 0
},
l_flock = {
start = 8,
end = 0,
owner = 0,
blocking_owner = 0,
blocking_export = 0x0,
blocking_refs = {
counter = 0
},
pid = 0
},
l_inodebits = {
bits = 8,
{
try_bits = 0,
cancel_bits = 0
}
}
},
l_flags = 18050699573067808,
l_readers = 0,
l_writers = 1,
l_waitq = {
lock = {
{
rlock = {
raw_lock = {
val = {
counter = 0
}
}
}
}
},
task_list = {
next = 0xffff8a89806019d8,
prev = 0xffff8a89806019d8
}
},
l_last_used = {
tv64 = 1371509486081554
},
l_req_extent = {
start = 0,
end = 0,
gid = 0
},
l_lvb_type = LVB_T_NONE,
l_lvb_len = 0,
l_lvb_data = 0x0,
l_ast_data = 0x0,
{
l_activity = 1634368802,
l_blast_sent = 1634368802
},
l_ost_lvb = {
lvb_size = 0,
lvb_mtime = 0,
lvb_atime = 0,
lvb_ctime = 0,
lvb_blocks = 0,
lvb_mtime_ns = 0,
lvb_atime_ns = 0,
lvb_ctime_ns = 0,
lvb_padding = 0
},
l_client_cookie = 10623779952545863461,
l_pending_chain = {
next = 0xffff8a8980601a68,
prev = 0xffff8a8980601a68
},
l_callback_timeout = 0,
l_pid = 2930,
l_bl_ast_run = 1,
l_bl_ast = {
next = 0xffff8a8980601a88,
prev = 0xffff8a8980601a88
},
l_cp_ast = {
next = 0xffff8a8980601a98,
prev = 0xffff8a8980601a98
},
l_rk_ast = {
next = 0xffff8a8980601aa8,
prev = 0xffff8a8980601aa8
},
l_blocking_lock = 0x0,
l_sl_mode = {
next = 0xffff8a8980601ac0,
prev = 0xffff8a8980601ac0
},
l_sl_policy = {
next = 0xffff8a8980601ad0,
prev = 0xffff8a8980601ad0
},
l_reference = {<No data fields>},
l_exp_list = {
next = 0xffff8a8980601ae0,
prev = 0xffff8a8980601ae0
}
}
There is lot of EX lock in the waiting list:
crash> ldlm_resource.lr_waiting -o 0xffff8a8a352e3b00
struct ldlm_resource {
[ffff8a8a352e3b30] struct list_head lr_waiting;
}
crash> list -H ffff8a8a352e3b30 -s ldlm_lock.l_req_mode ldlm_lock.l_res_link | grep EX | wc -l
243
For each lock the creation pid (l_pid) is linked to a hanging thread. All the corresponding threads are in cdt_restore_handle_add().
2021-10-16_12:48:44 [1384176467802037] [IN] PID: 3149 TASK: ffff8a6bd74aa0e0 CPU: 15 COMMAND: "mdt01_081" 2021-10-16_12:48:44 [1384176467755519] [IN] PID: 3031 TASK: ffff8a6c61f7b150 CPU: 14 COMMAND: "mdt01_049" 2021-10-16_12:48:44 [1384176460239208] [IN] PID: 3186 TASK: ffff8a8cbcae1070 CPU: 2 COMMAND: "mdt00_107" 2021-10-16_12:48:44 [1384176460192173] [IN] PID: 2983 TASK: ffff8a8c5f215230 CPU: 0 COMMAND: "mdt00_049" 2021-10-16_12:48:44 [1384176330809933] [IN] PID: 2930 TASK: ffff8a8cbb52e2a0 CPU: 5 COMMAND: "mdt00_031" 2021-10-16_09:25:01 [1371953151468039] [IN] PID: 3259 TASK: ffff8a8cbb613150 CPU: 11 COMMAND: "mdt01_121" 2021-10-16_09:25:01 [1371953151453579] [IN] PID: 3115 TASK: ffff8a8c60a841c0 CPU: 11 COMMAND: "mdt01_072" ... ... 2021-10-16_09:25:01 [1371953111405844] [IN] PID: 3005 TASK: ffff8a8c7638a0e0 CPU: 1 COMMAND: "mdt00_056" 2021-10-16_09:25:01 [1371953111403702] [IN] PID: 2989 TASK: ffff8a6cbcf28000 CPU: 13 COMMAND: "mdt01_037"
Only 5/248 mdt threads were able to manage client requests (3 on CPT0 and 2 on CPT1).
For now, I think it exists a window where several clients could send HSM restore requests when EX layout lock is not already taken by the MDT.
The first mdt thread takes the EX lock via cdt_restore_handle_add, the other threads wait the lock until the RESTORE is finished.
This could causes all the available threads to hang (waiting the first RESTORE request to finish).