[LU-308] Hang and eventual ASSERT after mdc_enqueue()) ldlm_cli_enqueue: -4 Created: 11/May/11  Updated: 28/Jun/11  Resolved: 13/Jun/11

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 1.8.6
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Christopher Morrone Assignee: Lai Siyao
Resolution: Fixed Votes: 0
Labels: None
Environment:

RHEL5.5ish (CHAOS4.4-2), lustre 1.8.5.0-3chaos


Severity: 3
Rank (Obsolete): 10319

 Description   

On a production lustre client node, we hit an ASSERT. The first sign of trouble on the console is this:

2011-05-11 08:55:44 LustreError: ... (mdc_locks.c:648:mdc_enqueue())
ldlm_cli_enqueue: -4

I believe that is under an emacs process.

Ten seconds later we start getting "soft lockup" "stuck for 10s" warnings
about the same process. The messages pop up every 10s until we finally get an
assertion later on. Backtrace looks like:

:mdc:mdc_enter_request
:ptlrpc:ldlm_lock_addref_internal_nolock
:mdc:mdc_enqueue
dequeue_task
thread_return
:ptlrpc:ldlm_lock_add_to_lru_nolock
:mdc:mdc_intent_lock
:ptlrpc:ldlm_lock_decref
:mdc:mdc_set_lock_data
:lustre:ll_mdc_blocking_ast
:ptlrpc:ldlm_completion_ast
:lustre:ll_prepare_mdc_op_data
:lustre:ll_lookup_it
:lustre:ll_mdc_blocking_ast
:lov:lov_fini_enqueue_set
:lustre:ll_lookup_nd
list_add
d_alloc
do_lookup
__link_path_walk
link_path_walk
do_path_lookup
__user_walk_fd
vfs_stat_fd
sys_rt_sigreturn
sys_rt_sigreturn
sys_newstat
sys_setitimer
stub_rt_sigreturn
system_call

Later a different process throws these errors:

2011-05-11 09:06:07 Lustre: ... Request mdc_close sent 106s ago has failed due
to network error (limit 106s)
2011-05-11 09:06:07 LustreError: ... ll_lcose_inode_openhandle()) inode X mdc
close failed: -4
2011-05-11 09:06:07 Skipped 4 previous messages

And then three seconds later the original stuck thread does:

2011-05-11 09:06:10 ldlm_lock.c:189:ldlm_lock_remove_from_lru_nolock ASSERT(ns->ns_nr_unused > 0) failed

Backtrace looks like:

ldlm_lock_remove_from_lru_nolock
ldlm_lock_remove_from_lru
ldlm_lock_addref_internal_nolock
search_queue
ldlm_lock_match
ldlm_resource_get
mdc_revalidate_lock
ldlm_lock_addref_internal_nolock
mdc_intent_lock
ll_i2gids
ll_prepare_mdc_op_data
__ll_inode_revalidate_it
ll_mdc_blocking_ast
ll_inode_permission
dput
permission
vfs_permission
__link_path_walk
link_path_walk
do_path_lookup
__path_lookup_intent_open
path_lookup_open
open_namei
do_filp_open
get_unused_fd
do_sys_open
sys_open

Apologies for any typos. That all had to be hand copied.

Since this all appears to have started with an EINTR in mdc_enqueue(), it may be that this bug is related:

https://bugzilla.lustre.org/show_bug.cgi?id=18213
http://jira.whamcloud.com/browse/LU-234

We are running 1.8.5+, so we should have the fix that was applied to 1.8.5 in bug 18213.



 Comments   
Comment by Peter Jones [ 12/May/11 ]

Lai

Could you please look into this one?

Thanks

Peter

Comment by Lai Siyao [ 16/May/11 ]

Johann, could you take a look into this, I can't find a use case which will trigger ldlm_lock_remove_from_lru_nolock ASSERT(ns->ns_nr_unused > 0).

Comment by Johann Lombardi (Inactive) [ 17/May/11 ]

The LASSERT might just be a side effect of the initial soft lockup.
We actually found & fixed a problem in 1.8.5 with mdc_enter_request().
See https://bugzilla.lustre.org/show_bug.cgi?id=24508#c1
A patch was landed to Whamcloud's b1_8 as part of LU-286, see http://review.whamcloud.com/506

Comment by Lai Siyao [ 17/May/11 ]

Thank you, Johann. This looks reasonable.
Chris, could you verify the patch for LU-286 is not included in your test code?

Comment by Christopher Morrone [ 17/May/11 ]

Correct, we do not have that patch.

And our code is not "test" code; we saw this in production.

Comment by Peter Jones [ 18/May/11 ]

Chris

Will you be trying this patch in production or are some additional steps required first?

Peter

Comment by Christopher Morrone [ 19/May/11 ]

I'll pull it into our 1.8.5-llnl branch.

As for when it goes in production...our local testing infrastracture has almost completely moved to RHEL6 and lustre 2.1. We have a 1.8 server cluster left over for testing, but no 1.8 clients. The first window we have to get 1.8 clients and test a release is probably mid June, with a target for installation in late June if there are no surprises.

Comment by Peter Jones [ 13/Jun/11 ]

Let's close this ticket for now and reopen if the issue reoccurs with the patch applied

Generated at Sat Feb 10 01:05:46 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.