[LU-4309] mds_intent_policy ASSERTION(new_lock != NULL) failed: op 0x8 lockh 0x0 Created: 25/Nov/13  Updated: 10/Feb/14  Resolved: 10/Feb/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 1.8.9
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Blake Caldwell Assignee: Lai Siyao
Resolution: Fixed Votes: 0
Labels: None
Environment:

RHEL 5.9/distro IB


Attachments: Text File 0001-LU-4309-debug-debug-mds_intent_policy-assert.patch     Text File widow-mds2_lbug.log    
Severity: 2
Rank (Obsolete): 11801

 Description   

An MDT thread hit an assertion in mds_intent_policy in what otherwise appeared to be normal operation.

I'm attaching the kernel log messages after the LBUG. These are from the console. We have a crash dump from the node, but no lustre log files.

Lustre build:
Nov 18 12:46:55 widow-mds2 kernel: [ 387.597792] Lustre: Build Version: v1_8_9_WC1--CHANGED-2.6.18-348.3.1.el5.widow



 Comments   
Comment by Johann Lombardi (Inactive) [ 25/Nov/13 ]
Nov 18 11:46:10 widow-mds2 kernel: [1726316.746981] LustreError: dumping log to /tmp/lustre-log.1384793170.9088

Any chance to access /tmp/lustre-log.1384793170.9088?

Comment by Peter Jones [ 25/Nov/13 ]

Hongchao

Could you please advise on this issue?

Thanks

Peter

Comment by Blake Caldwell [ 25/Nov/13 ]

Unfortunately, no logs from /tmp/ are left (on ramdisk).

Comment by Peter Jones [ 29/Nov/13 ]

Lai

I have realized that Hongchao is on vacation so could you please handle this one instead?

Thanks

Peter

Comment by Lai Siyao [ 02/Dec/13 ]

Could you know which client getattr cause this ASSERT? If so, can you check the backtrace of the process on the client that is doing getattr?

Comment by Blake Caldwell [ 02/Dec/13 ]

Without the lustre logs in /tmp, I won't be able to track down the client. Even if the client could be identified from the crash dump, then there is the problem of identifying what it was doing at the time.

I see that the dmesg output is not very helpful, but that's all I have other than a crash dump.

So that we are better prepared for these cases in the future, what information can be collected on the server side beside /tmp/lustre.*? Collecting client debug logs is very difficult due to the number of clients. Would a ldlm_namespace_dump be helpful? If the LBUG has already occurred are there any debug flags for /proc/sys/lnet/debug that would provide useful information? Since the offending request has already been made, does capturing +net +dlmtrace +rpctrace do any good?

Comment by Lai Siyao [ 03/Dec/13 ]

Hmm, there is not much we can do in this case IMO, since MDS crash will cause all system hang, and it's hard to trace back to the client. I'll do more reviews on related code to understand this assert.

Comment by Lai Siyao [ 04/Dec/13 ]

I am not able to find the problem in the code, and I composed a debug patch to dump request before this assert, could you apply it and to get more info upon this failure again?

Comment by James Nunez (Inactive) [ 04/Jan/14 ]

Blake,

Are you still seeing this assertion on your systems? If so, were you able to apply the patch to collect more information?

Thanks,
James

Comment by Blake Caldwell [ 06/Jan/14 ]

I haven't been able to apply this debug patch yet. The system has been stable, and as a result we haven't had an unschedule outage to apply that patch. So nothing at this time. I will try applying the debug patch to another system that we can take down sooner.

Comment by James Nunez (Inactive) [ 06/Jan/14 ]

Blake, Thanks for the update.

Comment by Jason Hill (Inactive) [ 10/Feb/14 ]

So this filesystem is out of production (in a hold state before decommissioning); my assertion is that we should go ahead and close this issue – even if we integrated the patch and ran the storage system with it for a while it would never get any client access and likely would not exercise the code path for the patch. Any objections?


-Jason

Comment by Blake Caldwell [ 10/Feb/14 ]

Let's close it.

Comment by James Nunez (Inactive) [ 10/Feb/14 ]

Thank you for the update. I will close this ticket.

Generated at Sat Feb 10 01:41:34 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.