[LU-6189] LustreError: (mdt_handler.c:4078:mdt_intent_reint()) ASSERTION( rc == 0 ) failed: Error occurred but lock handle is still in use, rc = -116 Created: 01/Feb/15 Updated: 04/Jan/16 Resolved: 02/Apr/15 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.5.3 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical |
| Reporter: | Philip B Curtis | Assignee: | Peter Jones |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Severity: | 2 | ||||||||
| Rank (Obsolete): | 17312 | ||||||||
| Description |
|
This morning within a few hours of each other, we hit this LBUG which caused the MDS to crash. The first time after reboot we had to abort recovery to get lustre back. We have a crashdump from the MDS. Feb 1 10:03:15 atlas-mds1.ccs.ornl.gov kernel: [ 7271.805235] LustreError: 0:0:(ldlm_lockd.c:344:waiting_locks_callback()) ### lock callback timer expired after 375s: evicting client at 4966@gni100 ns: mdt- |
| Comments |
| Comment by Peter Jones [ 01/Feb/15 ] |
|
Philip You have entered this ticket as a Severity 1 which means that the filesystem is down. Is this the case? From the description it sounds like service has been restored but you want to treat this as a high priority to prevent further such crashes. Peter |
| Comment by Philip B Curtis [ 01/Feb/15 ] |
|
Peter No, the first time this occurred lustre was restarted. I haven't brought lustre back up this time since this was following so closely to the first time. I wanted to get Intel involved before I attempted another start. Philip |
| Comment by Peter Jones [ 01/Feb/15 ] |
|
ok. I think that it is best to start uploading the crash dump to our ftp site in case that is useful. Do you have the instructions on how to do that? Also, is the code being run exactly in sync with the tip of your b2_5 branch on gut hub? https://github.com/ORNL-TechInt/lustre/commits/b2_5 |
| Comment by Alex Zhuravlev [ 01/Feb/15 ] |
|
I'm quite sure this is fixed with http://review.whamcloud.com/#/c/12828/ |
| Comment by Peter Jones [ 01/Feb/15 ] |
|
Philip This is a patch that needs to be applied to the MDS only. Is there anything else that you need from us at this point before attempting to bring the filesystem back up? Peter |
| Comment by Philip B Curtis [ 01/Feb/15 ] |
|
No, I do not have instructions for the ftp site. That is correct, we are at the tip of the code there. |
| Comment by James A Simmons [ 01/Feb/15 ] |
|
We are running what is in the ORNL git hub. We attempted a upgrade but it failed after a few days. I general don't upgrade the ORNL branch for a few weeks after a upgrade just in case something goes wrong. |
| Comment by Philip B Curtis [ 01/Feb/15 ] |
|
Nope. I will get you those crashdumps once I have those instructions and I will see about getting this patched version in place and we will go from there. |
| Comment by Philip B Curtis [ 01/Feb/15 ] |
|
We have rebooted into the new RPMs with the patch. Lustre has started and I will continue to monitor. Thank you for your help. Philip |
| Comment by Peter Jones [ 01/Feb/15 ] |
|
Good news. Thanks for the update. I will drop the severity to S2 and will continue to monitor in case there are any further complications. |
| Comment by Peter Jones [ 02/Apr/15 ] |
|
As per ORNL ok to close as |