Details
-
Bug
-
Resolution: Fixed
-
Minor
-
None
-
Lustre 1.8.6
-
None
-
RHEL5.5ish (CHAOS4.4-2), lustre 1.8.5.0-3chaos
-
3
-
10319
Description
On a production lustre client node, we hit an ASSERT. The first sign of trouble on the console is this:
2011-05-11 08:55:44 LustreError: ... (mdc_locks.c:648:mdc_enqueue())
ldlm_cli_enqueue: -4
I believe that is under an emacs process.
Ten seconds later we start getting "soft lockup" "stuck for 10s" warnings
about the same process. The messages pop up every 10s until we finally get an
assertion later on. Backtrace looks like:
:mdc:mdc_enter_request
:ptlrpc:ldlm_lock_addref_internal_nolock
:mdc:mdc_enqueue
dequeue_task
thread_return
:ptlrpc:ldlm_lock_add_to_lru_nolock
:mdc:mdc_intent_lock
:ptlrpc:ldlm_lock_decref
:mdc:mdc_set_lock_data
:lustre:ll_mdc_blocking_ast
:ptlrpc:ldlm_completion_ast
:lustre:ll_prepare_mdc_op_data
:lustre:ll_lookup_it
:lustre:ll_mdc_blocking_ast
:lov:lov_fini_enqueue_set
:lustre:ll_lookup_nd
list_add
d_alloc
do_lookup
__link_path_walk
link_path_walk
do_path_lookup
__user_walk_fd
vfs_stat_fd
sys_rt_sigreturn
sys_rt_sigreturn
sys_newstat
sys_setitimer
stub_rt_sigreturn
system_call
Later a different process throws these errors:
2011-05-11 09:06:07 Lustre: ... Request mdc_close sent 106s ago has failed due
to network error (limit 106s)
2011-05-11 09:06:07 LustreError: ... ll_lcose_inode_openhandle()) inode X mdc
close failed: -4
2011-05-11 09:06:07 Skipped 4 previous messages
And then three seconds later the original stuck thread does:
2011-05-11 09:06:10 ldlm_lock.c:189:ldlm_lock_remove_from_lru_nolock ASSERT(ns->ns_nr_unused > 0) failed
Backtrace looks like:
ldlm_lock_remove_from_lru_nolock
ldlm_lock_remove_from_lru
ldlm_lock_addref_internal_nolock
search_queue
ldlm_lock_match
ldlm_resource_get
mdc_revalidate_lock
ldlm_lock_addref_internal_nolock
mdc_intent_lock
ll_i2gids
ll_prepare_mdc_op_data
__ll_inode_revalidate_it
ll_mdc_blocking_ast
ll_inode_permission
dput
permission
vfs_permission
__link_path_walk
link_path_walk
do_path_lookup
__path_lookup_intent_open
path_lookup_open
open_namei
do_filp_open
get_unused_fd
do_sys_open
sys_open
Apologies for any typos. That all had to be hand copied.
Since this all appears to have started with an EINTR in mdc_enqueue(), it may be that this bug is related:
https://bugzilla.lustre.org/show_bug.cgi?id=18213
http://jira.whamcloud.com/browse/LU-234
We are running 1.8.5+, so we should have the fix that was applied to 1.8.5 in bug 18213.
Attachments
Issue Links
- Trackbacks
-
Lustre 1.8.x known issues tracker
While testing against Lustre b18 branch, we would hit known bugs which were already reported in Lustre Bugzilla https://bugzilla.lustre.org/. In order to move away from relying on Bugzilla, we would create a JIRA