[LU-308] Hang and eventual ASSERT after mdc_enqueue()) ldlm_cli_enqueue: -4 Created: 11/May/11 Updated: 28/Jun/11 Resolved: 13/Jun/11 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 1.8.6 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Christopher Morrone | Assignee: | Lai Siyao |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
RHEL5.5ish (CHAOS4.4-2), lustre 1.8.5.0-3chaos |
||
| Severity: | 3 |
| Rank (Obsolete): | 10319 |
| Description |
|
On a production lustre client node, we hit an ASSERT. The first sign of trouble on the console is this: 2011-05-11 08:55:44 LustreError: ... (mdc_locks.c:648:mdc_enqueue()) I believe that is under an emacs process. Ten seconds later we start getting "soft lockup" "stuck for 10s" warnings :mdc:mdc_enter_request Later a different process throws these errors: 2011-05-11 09:06:07 Lustre: ... Request mdc_close sent 106s ago has failed due And then three seconds later the original stuck thread does: 2011-05-11 09:06:10 ldlm_lock.c:189:ldlm_lock_remove_from_lru_nolock ASSERT(ns->ns_nr_unused > 0) failed Backtrace looks like: ldlm_lock_remove_from_lru_nolock Apologies for any typos. That all had to be hand copied. Since this all appears to have started with an EINTR in mdc_enqueue(), it may be that this bug is related: https://bugzilla.lustre.org/show_bug.cgi?id=18213 We are running 1.8.5+, so we should have the fix that was applied to 1.8.5 in bug 18213. |
| Comments |
| Comment by Peter Jones [ 12/May/11 ] |
|
Lai Could you please look into this one? Thanks Peter |
| Comment by Lai Siyao [ 16/May/11 ] |
|
Johann, could you take a look into this, I can't find a use case which will trigger ldlm_lock_remove_from_lru_nolock ASSERT(ns->ns_nr_unused > 0). |
| Comment by Johann Lombardi (Inactive) [ 17/May/11 ] |
|
The LASSERT might just be a side effect of the initial soft lockup. |
| Comment by Lai Siyao [ 17/May/11 ] |
|
Thank you, Johann. This looks reasonable. |
| Comment by Christopher Morrone [ 17/May/11 ] |
|
Correct, we do not have that patch. And our code is not "test" code; we saw this in production. |
| Comment by Peter Jones [ 18/May/11 ] |
|
Chris Will you be trying this patch in production or are some additional steps required first? Peter |
| Comment by Christopher Morrone [ 19/May/11 ] |
|
I'll pull it into our 1.8.5-llnl branch. As for when it goes in production...our local testing infrastracture has almost completely moved to RHEL6 and lustre 2.1. We have a 1.8 server cluster left over for testing, but no 1.8 clients. The first window we have to get 1.8 clients and test a release is probably mid June, with a target for installation in late June if there are no surprises. |
| Comment by Peter Jones [ 13/Jun/11 ] |
|
Let's close this ticket for now and reopen if the issue reoccurs with the patch applied |