Details
-
Bug
-
Resolution: Unresolved
-
Minor
-
None
-
Lustre 2.15.5
-
None
-
3
-
9223372036854775807
Description
Hi,
We've seen a few occurrences of this list_del corruption issue on MDS which sometimes causes MDS to crash. The stack trace looks like this:
[ 61.794583] list_del corruption. prev->next should be ffff000475622e80, but was ffff000475622e00 [ 61.795928] WARNING: CPU: 3 PID: 6749 at lib/list_debug.c:59 [ 61.822503] Call trace: [ 61.822895] __list_del_entry_valid+0xd4/0x100 [ 61.823636] tgt_free_reply_data+0x78/0x344 [ptlrpc] [ 61.824451] tgt_release_reply_data+0x90/0x218 [ptlrpc] [ 61.825304] tgt_handle_received_xid+0x84/0xe0 [ptlrpc] [ 61.826161] process_req_last_xid+0x190/0x4b0 [ptlrpc] [ 61.826999] tgt_request_handle+0x1e0/0xaac [ptlrpc] [ 61.827811] ptlrpc_server_handle_request.isra.0+0x464/0xee8 [ptlrpc] [ 61.828841] ptlrpc_main+0xd24/0x15bc [ptlrpc] [ 61.829536] kthread+0x118/0x120 [ 61.830040] ---[ end trace 19946910bdb9cb2b ]---
This list trd_list is protected by ted_lcd_lock mutex. There is an LASSERT to check ted_lcd_lock mutex is locked right before the list_del call in tgt_free_reply_data but we don't actually know if the calling thread has the lock - we just know that the mutex is locked.
Upon scanning the code, all the spots where trd_list is modified seem to be protected by this ted_lcd_lock mutex. However, in process_req_last_xid function, the locking/unlocking of ted_lcd_lock hinges on bool need_lock while the call to tgt_handle_received_xid does the same check tgt_is_multimodrpcs_client(exp) again.
static int process_req_last_xid(struct ptlrpc_request *req) { __u64 last_xid; int rc = 0; struct obd_export *exp = req->rq_export; struct tg_export_data *ted = &exp->exp_target_data; bool need_lock = tgt_is_multimodrpcs_client(exp); ENTRY; if (need_lock) mutex_lock(&ted->ted_lcd_lock); [ truncated some logic here ] /* try to release in-memory reply data */ if (tgt_is_multimodrpcs_client(exp)) { tgt_handle_received_xid(exp, last_xid); rc = tgt_handle_tag(req); } out: if (need_lock) mutex_unlock(&ted->ted_lcd_lock); RETURN(rc); }
Because this method doesn't lock the export itself, there could be a potential inconsistency where one thread does not lock ted_lcd_lock but still calls tgt_handle_received_xid while another thread is already deleting form trd_list. It's probably better to consistently use this need_lock boolean, or just fork this method to call something like process_req_last_xid_lock without these if-statements.
It would be helpful to get other's opinions on whether this is the right code path to be looking at, or whether I've overlooked something here. I can see that there's a lot of place in code that checks these connect flags, so I'm not sure if it makes sense that these flags could get unassigned and cause this racey behavior. Thanks!