[LU-5467] process stuck in cl_locks_prune() - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Unresolved
Priority: Major
Fix Version/s: None
Affects Version/s: Lustre 2.4.2
Labels:
- llnl
Environment:
https://github.com/chaos/lustre/commits/2.4.2-13chaos

Severity:
3
Rank (Obsolete):
15233

Description

User processes are stuck in cl_locks_prune(). The system is classified so files from the system can't be uploaded. We currently have two lustre clients in this state.

Stack trace from stuck process:

cfs_waitq_wait
cl_locks_prune
lov_delete_raid0
lov_object_delete
lu_object_free
lu_object_put
cl_object_put
cl_inode_fini
ll_clear_inode
clear_inode
ll_delete_inode
generic_delete_inode
generic_drop_inode
...
sys_unlink

They are waiting for lock user count to drop to 0:

2063 again:
2064                 cl_lock_mutex_get(env, lock);
2065                 if (lock->cll_state < CLS_FREEING) {
2066                         LASSERT(lock->cll_users <= 1);
2067                         if (unlikely(lock->cll_users == 1)) {
2068                                 struct l_wait_info lwi = { 0 };
2069                                                                                 
2070                                 cl_lock_mutex_put(env, lock);
2071                                 l_wait_event(lock->cll_wq,
2072                                              lock->cll_users == 0, 
2073                                              &lwi);
2074                                 goto again; 
2075                         }

On one node I also found a user process stuck in osc_io_setattr_end() line 500:

489 static void osc_io_setattr_end(const struct lu_env *env,
490                                const struct cl_io_slice *slice)
491 { 
492         struct cl_io     *io  = slice->cis_io;
493         struct osc_io    *oio = cl2osc_io(env, slice);
494         struct cl_object *obj = slice->cis_obj;
495         struct osc_async_cbargs *cbargs = &oio->oi_cbarg;
496         int result = 0;
497
498         if (cbargs->opc_rpc_sent) {
499                 wait_for_completion(&cbargs->opc_sync);
500                 result = io->ci_result = cbargs->opc_rc;
501         }

On both stuck nodes, I also notice the ptlrpcd_rcv thread blocked with this backtrace:

sync_page
__lock_page
vvp_page_own
cl_page_own0
cl_page_own
check_and_discard_cb
cl_page_gang_lookup
cl_lock_discard_pages
osc_lock_flush
osc_lock_cancel
cl_lock_cancel0
cl_lock_cancel
osc_ldlm_blocking_ast
ldlm_cancel_callback
ldlm_lock_cancel
ldlm_cli_cancel_list_local
ldlm_cancel_lru_local
ldlm_replay_locks
ptlrpc_import_recov_state_machine
ptlrpc_connect_interpret
ptlrpc_check_set
ptlrpcd_check
ptlrpcd

I haven't checked anything on the server side yet. Please let us know ASAP if you want any more debug data from the clients before we reboot them.

Attachments

Activity

People

Assignee:: Zhenyu Xu

Reporter:: Ned Bass (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 08/Aug/14 9:46 PM

Updated:: 07/Jun/16 3:38 PM