Details
-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
Lustre 2.4.2
-
3
-
15233
Description
User processes are stuck in cl_locks_prune(). The system is classified so files from the system can't be uploaded. We currently have two lustre clients in this state.
Stack trace from stuck process:
cfs_waitq_wait cl_locks_prune lov_delete_raid0 lov_object_delete lu_object_free lu_object_put cl_object_put cl_inode_fini ll_clear_inode clear_inode ll_delete_inode generic_delete_inode generic_drop_inode ... sys_unlink
They are waiting for lock user count to drop to 0:
2063 again: 2064 cl_lock_mutex_get(env, lock); 2065 if (lock->cll_state < CLS_FREEING) { 2066 LASSERT(lock->cll_users <= 1); 2067 if (unlikely(lock->cll_users == 1)) { 2068 struct l_wait_info lwi = { 0 }; 2069 2070 cl_lock_mutex_put(env, lock); 2071 l_wait_event(lock->cll_wq, 2072 lock->cll_users == 0, 2073 &lwi); 2074 goto again; 2075 }
On one node I also found a user process stuck in osc_io_setattr_end() line 500:
489 static void osc_io_setattr_end(const struct lu_env *env, 490 const struct cl_io_slice *slice) 491 { 492 struct cl_io *io = slice->cis_io; 493 struct osc_io *oio = cl2osc_io(env, slice); 494 struct cl_object *obj = slice->cis_obj; 495 struct osc_async_cbargs *cbargs = &oio->oi_cbarg; 496 int result = 0; 497 498 if (cbargs->opc_rpc_sent) { 499 wait_for_completion(&cbargs->opc_sync); 500 result = io->ci_result = cbargs->opc_rc; 501 }
On both stuck nodes, I also notice the ptlrpcd_rcv thread blocked with this backtrace:
sync_page __lock_page vvp_page_own cl_page_own0 cl_page_own check_and_discard_cb cl_page_gang_lookup cl_lock_discard_pages osc_lock_flush osc_lock_cancel cl_lock_cancel0 cl_lock_cancel osc_ldlm_blocking_ast ldlm_cancel_callback ldlm_lock_cancel ldlm_cli_cancel_list_local ldlm_cancel_lru_local ldlm_replay_locks ptlrpc_import_recov_state_machine ptlrpc_connect_interpret ptlrpc_check_set ptlrpcd_check ptlrpcd
I haven't checked anything on the server side yet. Please let us know ASAP if you want any more debug data from the clients before we reboot them.