Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Unresolved
Priority: Minor
Fix Version/s: None
Affects Version/s: Lustre 2.15.5
Labels:
None

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

Hi,

We've seen a few occurrences of this list_del corruption issue on MDS which sometimes causes MDS to crash. The stack trace looks like this:

[   61.794583] list_del corruption. prev->next should be ffff000475622e80, but was ffff000475622e00
[   61.795928] WARNING: CPU: 3 PID: 6749 at lib/list_debug.c:59 
[   61.822503] Call trace:
[   61.822895]  __list_del_entry_valid+0xd4/0x100
[   61.823636]  tgt_free_reply_data+0x78/0x344 [ptlrpc]
[   61.824451]  tgt_release_reply_data+0x90/0x218 [ptlrpc]
[   61.825304]  tgt_handle_received_xid+0x84/0xe0 [ptlrpc]
[   61.826161]  process_req_last_xid+0x190/0x4b0 [ptlrpc]
[   61.826999]  tgt_request_handle+0x1e0/0xaac [ptlrpc]
[   61.827811]  ptlrpc_server_handle_request.isra.0+0x464/0xee8 [ptlrpc]
[   61.828841]  ptlrpc_main+0xd24/0x15bc [ptlrpc]
[   61.829536]  kthread+0x118/0x120
[   61.830040] ---[ end trace 19946910bdb9cb2b ]---

This list trd_list is protected by ted_lcd_lock mutex. There is an LASSERT to check ted_lcd_lock mutex is locked right before the list_del call in tgt_free_reply_data but we don't actually know if the calling thread has the lock - we just know that the mutex is locked.

Upon scanning the code, all the spots where trd_list is modified seem to be protected by this ted_lcd_lock mutex. However, in process_req_last_xid function, the locking/unlocking of ted_lcd_lock hinges on bool need_lock while the call to tgt_handle_received_xid does the same check tgt_is_multimodrpcs_client(exp) again.

static int process_req_last_xid(struct ptlrpc_request *req)
{
	__u64	last_xid;
	int rc = 0;
	struct obd_export *exp = req->rq_export;
	struct tg_export_data *ted = &exp->exp_target_data;
	bool need_lock = tgt_is_multimodrpcs_client(exp);
	ENTRY;

	if (need_lock)
		mutex_lock(&ted->ted_lcd_lock);

[ truncated some logic here ]

	/* try to release in-memory reply data */
	if (tgt_is_multimodrpcs_client(exp)) {
		tgt_handle_received_xid(exp, last_xid);
		rc = tgt_handle_tag(req);
	}

out:
	if (need_lock)
		mutex_unlock(&ted->ted_lcd_lock);

	RETURN(rc);
}

Because this method doesn't lock the export itself, there could be a potential inconsistency where one thread does not lock ted_lcd_lock but still calls tgt_handle_received_xid while another thread is already deleting form trd_list. It's probably better to consistently use this need_lock boolean, or just fork this method to call something like process_req_last_xid_lock without these if-statements.

It would be helpful to get other's opinions on whether this is the right code path to be looking at, or whether I've overlooked something here. I can see that there's a lot of place in code that checks these connect flags, so I'm not sure if it makes sense that these flags could get unassigned and cause this racey behavior. Thanks!

Attachments

Issue Links

is related to

LU-11444 RPC resend may corrupt the data

Resolved

mentioned in: Page Loading...; Page Loading...; Page Loading...; Page Loading...

Activity

People

Assignee:: WC Triage

Reporter:: Kaitlin Hoang

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 08/May/25 6:49 PM

Updated:: 21/Aug/25 12:43 AM