[LU-500] MDS threads hang ldlm_expired_completion_wait+ Created: 12/Jul/11 Updated: 29/Mar/12 Resolved: 29/Mar/12 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 1.8.6 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Steven Woods | Assignee: | Oleg Drokin |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Environment: |
CentOS 5.3 |
||
| Attachments: |
|
| Severity: | 3 |
| Bugzilla ID: | 24,450 |
| Rank (Obsolete): | 6583 |
| Description |
|
At a key cutomer site we were and still are experiencing MDS thread hangs. Initially they were seen under 1.8.4 and when the MDS would dump the threads the only way to recover would be to reboot the MDS. The site did upgrade to 1.8.6 which includes a at_min patch from bug 23352 which was thought that might help the issue. However they are still seeing the thread hangs but can usually now get out of it without a MDS reboot but still a serious problem. Call Trace: |
| Comments |
| Comment by Peter Jones [ 12/Jul/11 ] |
|
Steve Could you please clarify as to which site(s) is\are affected? I am finding the history in the Oracle bz ticket confusing. Thanks Peter |
| Comment by Steven Woods [ 12/Jul/11 ] |
|
Sorry - ORNL spider |
| Comment by Peter Jones [ 12/Jul/11 ] |
|
Oleg Could you please look at this one? Thanks Peter |
| Comment by Peter Jones [ 14/Jul/11 ] |
|
Oleg, You mentioned that you were going to talk to ORNL about this ticket. Could you please provide a status update? Thanks Peter |
| Comment by Oleg Drokin [ 14/Jul/11 ] |
|
So how do you get out of the thread hungs now without a reboot? Can you please share the logs from 1.8.6? |
| Comment by Steven Woods [ 14/Jul/11 ] |
|
Attached 1.8.6 traces |
| Comment by Cory Spitz [ 19/Jul/11 ] |
|
Oleg, Steve can answer questions about the most recent dumps. Internally, we've seen similar behavior on internal Cray systems. In those cases we see that an unresponsive client does eventually get evicted and the MDS recovers, but the service threads that are waiting for the blocking callbacks to complete never seem to unwedge. Also in these cases, the client initially fails for something completely unrelated; e.g. kernel null dereference. ...not sure that it is the same symptoms witnessed in the Spider dumps. |
| Comment by Steven Woods [ 19/Jul/11 ] |
|
AFAIK initially before the patch to 23352 things never cleared up until a reboot. After that they can keep running but I do not believe the thread ever recovers. |
| Comment by Oleg Drokin [ 24/Jul/11 ] |
|
If the thread never recovers I am interested to see a backtrace dump from such a thread. I am nor sure how rpctrace is going to help, though. |
| Comment by Lukasz Flis [ 25/Jul/11 ] |
|
It seems that the issue is common to 1.8.6 and 2.1. I've found our MDS (2.1) hanging today morning with plenty of the same errors that Steven Reported. In order to recover from it we have remounted MDT resource. And recovery begun. Unfrotunatelly we have never seen We can provide lustre log files and kernel stack traces if needed. |
| Comment by Oleg Drokin [ 02/Aug/11 ] |
|
Lukasz, wcan you share exact backtrace? Also how long were yo waiting to relieve the situation? Additionally if you are having constant recovery problems, can you please file a separate ticket for it? Thanks. |
| Comment by Cory Spitz [ 02/Sep/11 ] |
|
I believe that the 1.8.6 instances of this bug are caused by incorrect lock ordering introduced by the patch from bug 24437 and is being pursued under |
| Comment by James A Simmons [ 06/Mar/12 ] |
|
Can this bug be closed now? |
| Comment by Cory Spitz [ 23/Mar/12 ] |
|
on 2/Sep/11 I mentioned |
| Comment by James A Simmons [ 26/Mar/12 ] |
|
Which patches from bz 24450? |
| Comment by Cory Spitz [ 26/Mar/12 ] |
|
> Which patches from bz 24450? |
| Comment by James A Simmons [ 29/Mar/12 ] |
|
|
| Comment by Cory Spitz [ 29/Mar/12 ] |
|
James, that sounds right to me. This should now be closed as a dup of |
| Comment by Peter Jones [ 29/Mar/12 ] |
|
duplicate of |