[LU-6189] LustreError: (mdt_handler.c:4078:mdt_intent_reint()) ASSERTION( rc == 0 ) failed: Error occurred but lock handle is still in use, rc = -116 Created: 01/Feb/15  Updated: 04/Jan/16  Resolved: 02/Apr/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.3
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Philip B Curtis Assignee: Peter Jones
Resolution: Duplicate Votes: 0
Labels: None

Issue Links:
Related
is related to LU-5934 mdt_intent_reint()) ASSERTION( rc == ... Resolved
Severity: 2
Rank (Obsolete): 17312

 Description   

This morning within a few hours of each other, we hit this LBUG which caused the MDS to crash. The first time after reboot we had to abort recovery to get lustre back. We have a crashdump from the MDS.

Feb 1 10:03:15 atlas-mds1.ccs.ornl.gov kernel: [ 7271.805235] LustreError: 0:0:(ldlm_lockd.c:344:waiting_locks_callback()) ### lock callback timer expired after 375s: evicting client at 4966@gni100 ns: mdt-
atlas1-MDT0000_UUID lock: ffff881ec6e16c80/0xfc6e8aed747d1af2 lrc: 4/0,0 mode: CR/CR res: [0x2001a597a:0x85:0x0].0 bits 0x2 rrc: 4 type: IBT flags: 0x60200000000020 nid: 4966@gni100 remote: 0x20ee476ee499c158
expref: 132 pid: 16827 timeout: 4301930544 lvb_type: 0
Feb 1 10:03:15 atlas-mds1.ccs.ornl.gov kernel: [ 7271.858358] LustreError: 16827:0:(mdt_handler.c:4078:mdt_intent_reint()) ASSERTION( rc == 0 ) failed: Error occurred but lock handle is still in use, rc = -1
16
Feb 1 10:03:15 atlas-mds1.ccs.ornl.gov kernel: [ 7271.874660] LustreError: 16827:0:(mdt_handler.c:4078:mdt_intent_reint()) LBUG
Feb 1 10:03:15 atlas-mds1.ccs.ornl.gov kernel: [ 7271.882757] Pid: 16827, comm: mdt00_224
Feb 1 10:03:15 atlas-mds1.ccs.ornl.gov kernel: [ 7271.887151]
Feb 1 10:03:15 atlas-mds1.ccs.ornl.gov kernel: [ 7271.887152] Call Trace:
Feb 1 10:03:15 atlas-mds1.ccs.ornl.gov kernel: [ 7271.891770] [<ffffffffa0407895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
Feb 1 10:03:15 atlas-mds1.ccs.ornl.gov kernel: [ 7271.899670] [<ffffffffa0407e97>] lbug_with_loc+0x47/0xb0 [libcfs]
Feb 1 10:03:15 atlas-mds1.ccs.ornl.gov kernel: [ 7271.906710] [<ffffffffa0d4379a>] mdt_intent_reint+0x51a/0x520 [mdt]
Feb 1 10:03:15 atlas-mds1.ccs.ornl.gov kernel: [ 7271.913933] [<ffffffffa0d40c4e>] mdt_intent_policy+0x3ae/0x770 [mdt]
Feb 1 10:03:15 atlas-mds1.ccs.ornl.gov kernel: [ 7271.921281] [<ffffffffa06de2e5>] ldlm_lock_enqueue+0x135/0x980 [ptlrpc]
Feb 1 10:03:15 atlas-mds1.ccs.ornl.gov kernel: [ 7271.928910] [<ffffffffa0707d0b>] ldlm_handle_enqueue0+0x51b/0x10c0 [ptlrpc]
Feb 1 10:03:15 atlas-mds1.ccs.ornl.gov kernel: [ 7271.936903] [<ffffffff81069f75>] ? enqueue_entity+0x125/0x450
Feb 1 10:03:15 atlas-mds1.ccs.ornl.gov kernel: [ 7271.943544] [<ffffffffa0d41116>] mdt_enqueue+0x46/0xe0 [mdt]
Feb 1 10:03:15 atlas-mds1.ccs.ornl.gov kernel: [ 7271.950094] [<ffffffffa0d4602a>] mdt_handle_common+0x52a/0x1470 [mdt]
Feb 1 10:03:15 atlas-mds1.ccs.ornl.gov kernel: [ 7271.957515] [<ffffffffa0d833e5>] mds_regular_handle+0x15/0x20 [mdt]
Feb 1 10:03:15 atlas-mds1.ccs.ornl.gov kernel: [ 7271.964770] [<ffffffffa0737fe5>] ptlrpc_server_handle_request+0x385/0xc00 [ptlrpc]
Feb 1 10:03:15 atlas-mds1.ccs.ornl.gov kernel: [ 7271.973547] [<ffffffffa04084ce>] ? cfs_timer_arm+0xe/0x10 [libcfs]
Feb 1 10:03:15 atlas-mds1.ccs.ornl.gov kernel: [ 7271.980677] [<ffffffffa04193cf>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
Feb 1 10:03:15 atlas-mds1.ccs.ornl.gov kernel: [ 7271.988407] [<ffffffffa072f6c9>] ? ptlrpc_wait_event+0xa9/0x2d0 [ptlrpc]
Feb 1 10:03:15 atlas-mds1.ccs.ornl.gov kernel: [ 7271.996116] [<ffffffff810546b9>] ? __wake_up_common+0x59/0x90
Feb 1 10:03:15 atlas-mds1.ccs.ornl.gov kernel: [ 7272.002774] [<ffffffffa073934d>] ptlrpc_main+0xaed/0x1760 [ptlrpc]
Feb 1 10:03:15 atlas-mds1.ccs.ornl.gov kernel: [ 7272.009920] [<ffffffffa0738860>] ? ptlrpc_main+0x0/0x1760 [ptlrpc]
Feb 1 10:03:15 atlas-mds1.ccs.ornl.gov kernel: [ 7272.017040] [<ffffffff8109ab56>] kthread+0x96/0xa0
Feb 1 10:03:15 atlas-mds1.ccs.ornl.gov kernel: [ 7272.022607] [<ffffffff8100c20a>] child_rip+0xa/0x20
Feb 1 10:03:15 atlas-mds1.ccs.ornl.gov kernel: [ 7272.028267] [<ffffffff8109aac0>] ? kthread+0x0/0xa0
Feb 1 10:03:15 atlas-mds1.ccs.ornl.gov kernel: [ 7272.033930] [<ffffffff8100c200>] ? child_rip+0x0/0x20
Feb 1 10:03:15 atlas-mds1.ccs.ornl.gov kernel: [ 7272.039782]



 Comments   
Comment by Peter Jones [ 01/Feb/15 ]

Philip

You have entered this ticket as a Severity 1 which means that the filesystem is down. Is this the case? From the description it sounds like service has been restored but you want to treat this as a high priority to prevent further such crashes.

Peter

Comment by Philip B Curtis [ 01/Feb/15 ]

Peter

No, the first time this occurred lustre was restarted. I haven't brought lustre back up this time since this was following so closely to the first time. I wanted to get Intel involved before I attempted another start.

Philip

Comment by Peter Jones [ 01/Feb/15 ]

ok. I think that it is best to start uploading the crash dump to our ftp site in case that is useful. Do you have the instructions on how to do that? Also, is the code being run exactly in sync with the tip of your b2_5 branch on gut hub? https://github.com/ORNL-TechInt/lustre/commits/b2_5

Comment by Alex Zhuravlev [ 01/Feb/15 ]

I'm quite sure this is fixed with http://review.whamcloud.com/#/c/12828/

Comment by Peter Jones [ 01/Feb/15 ]

Philip

This is a patch that needs to be applied to the MDS only. Is there anything else that you need from us at this point before attempting to bring the filesystem back up?

Peter

Comment by Philip B Curtis [ 01/Feb/15 ]

No, I do not have instructions for the ftp site. That is correct, we are at the tip of the code there.

Comment by James A Simmons [ 01/Feb/15 ]

We are running what is in the ORNL git hub. We attempted a upgrade but it failed after a few days. I general don't upgrade the ORNL branch for a few weeks after a upgrade just in case something goes wrong.

Comment by Philip B Curtis [ 01/Feb/15 ]

Nope. I will get you those crashdumps once I have those instructions and I will see about getting this patched version in place and we will go from there.

Comment by Philip B Curtis [ 01/Feb/15 ]

We have rebooted into the new RPMs with the patch. Lustre has started and I will continue to monitor. Thank you for your help.

Philip

Comment by Peter Jones [ 01/Feb/15 ]

Good news. Thanks for the update. I will drop the severity to S2 and will continue to monitor in case there are any further complications.

Comment by Peter Jones [ 02/Apr/15 ]

As per ORNL ok to close as LU-5934 has landed

Generated at Sat Feb 10 01:58:02 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.