[LU-5630] mdt_getattr_name_lock()) ASSERTION( lock != NULL ) Created: 16/Sep/14  Updated: 01/Feb/22  Resolved: 14/Dec/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.2
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Christopher Morrone Assignee: Oleg Drokin
Resolution: Cannot Reproduce Votes: 0
Labels: llnl
Environment:

Lustre 2.4.2-14chaos (see github.com/chaos/lustre)


Issue Links:
Related
is related to LU-5579 MDS crashed by "mdt_check_resent_lock... Resolved
Severity: 3
Rank (Obsolete): 15745

 Description   
2014-09-11 21:10:30 LustreError: 0:0:(ldlm_lockd.c:402:waiting_locks_callback()) ### lock callback timer expired after 100s: evicting client at 192.168.120.199@o2ib7  ns: mdt-lsd-MDT0000_UUID lock: ffff880321a4a480/0x6bd4680b789ee41f lrc: 4/0,0 mode: PR/PR res: [0x2000112f3:0xf:0x0].0 bits 0x13 rrc: 4 type: IBT flags: 0x200000000020 nid: 192.168.120.199@o2ib7 remote: 0xf350c14aff003b28 expref: 30 pid: 17248 timeout: 6838410913 lvb_type: 0 used 0
2014-09-11 21:10:30 LustreError: 15075:0:(mdt_handler.c:1423:mdt_getattr_name_lock()) ASSERTION( lock != NULL ) failed: Invalid lock handle 0x6bd4680b789ee41f
2014-09-11 21:10:30 LustreError: 15075:0:(mdt_handler.c:1423:mdt_getattr_name_lock()) LBUG
2014-09-11 21:10:30 Pid: 15075, comm: mdt00_069

The backtrace is:

PID: 15075  TASK: ffff880d7001f540  CPU: 2   COMMAND: "mdt00_069"
 #0 [ffff880d70021938] machine_kexec+0x18b at ffffffff810391ab
 #1 [ffff880d70021998] crash_kexec+0x72 at ffffffff810c5ee2
 #2 [ffff880d70021a68] panic+0xae at ffffffff8152b247
 #3 [ffff880d70021ae8] lbug_with_loc+0x9b at ffffffffa0601f4b [libcfs]
 #4 [ffff880d70021b08] mdt_getattr_name_lock+0x18d0 at ffffffffa0e99900 [mdt]
 #5 [ffff880d70021bc8] mdt_intent_getattr+0x29d at ffffffffa0e99c5d [mdt]
 #6 [ffff880d70021c28] mdt_intent_policy+0x39e at ffffffffa0e86fde [mdt]
 #7 [ffff880d70021c68] ldlm_lock_enqueue+0x361 at ffffffffa08b8911 [ptlrpc]
 #8 [ffff880d70021cc8] ldlm_handle_enqueue0+0x4ef at ffffffffa08e1a7f [ptlrpc]
 #9 [ffff880d70021d38] mdt_enqueue+0x46 at ffffffffa0e87466 [mdt]
#10 [ffff880d70021d58] mdt_handle_common+0x647 at ffffffffa0e8c0d7 [mdt]
#11 [ffff880d70021da8] mds_regular_handle+0x15 at ffffffffa0ec7c75 [mdt]
#12 [ffff880d70021db8] ptlrpc_server_handle_request+0x398 at ffffffffa0912188 [ptlrpc]
#13 [ffff880d70021eb8] ptlrpc_main+0xace at ffffffffa091351e [ptlrpc]
#14 [ffff880d70021f48] child_rip+0xa at ffffffff8100c24a

This looks like the same assertion assertion as LU-5579, but that was presumably hit on Lustre 2.6 or later.



 Comments   
Comment by Liang Zhen (Inactive) [ 16/Sep/14 ]

I think this is an issue we also hit on master, Vitaly has already posted a patch on LU-5579

Comment by Peter Jones [ 16/Sep/14 ]

Oleg

Can you confirm whether this is a duplicate of LU-5579?

Thanks

Peter

Comment by Oleg Drokin [ 16/Sep/14 ]

Yes, I think the bug is the same.
Quickfix for b2_4 would be to just replace assert with return -ESTALE;

This is not the final solution, I am starting to have my doubts that we should return ESTALE on resend as the client is not really at fault here and reprocessign the entire request might be a better idea.
I am going to disucss this idea with Vitaly, but at least this will fix the crash for now.

Comment by Christopher Morrone [ 16/Sep/14 ]

How will the client behave when it gets ESTALE?

Comment by Oleg Drokin [ 17/Sep/14 ]

I suspect ESTALE would propagate all the way up to userspace.

On the other hand, if it's due to eviction of that same client, it does not matter due to a bunch of EIO and other stuff this client will get anyway.
In case of the Vitaly-described race where resend happens in parallel with delayed delivery of RPC for which the resend happened, ESTALE is just going to be dropped because the client will not be waiting for this duplicate reply.

Generated at Sat Feb 10 01:53:09 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.