Details
-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
Lustre 2.4.2
-
3
-
15233
Description
User processes are stuck in cl_locks_prune(). The system is classified so files from the system can't be uploaded. We currently have two lustre clients in this state.
Stack trace from stuck process:
cfs_waitq_wait cl_locks_prune lov_delete_raid0 lov_object_delete lu_object_free lu_object_put cl_object_put cl_inode_fini ll_clear_inode clear_inode ll_delete_inode generic_delete_inode generic_drop_inode ... sys_unlink
They are waiting for lock user count to drop to 0:
2063 again: 2064 cl_lock_mutex_get(env, lock); 2065 if (lock->cll_state < CLS_FREEING) { 2066 LASSERT(lock->cll_users <= 1); 2067 if (unlikely(lock->cll_users == 1)) { 2068 struct l_wait_info lwi = { 0 }; 2069 2070 cl_lock_mutex_put(env, lock); 2071 l_wait_event(lock->cll_wq, 2072 lock->cll_users == 0, 2073 &lwi); 2074 goto again; 2075 }
On one node I also found a user process stuck in osc_io_setattr_end() line 500:
489 static void osc_io_setattr_end(const struct lu_env *env, 490 const struct cl_io_slice *slice) 491 { 492 struct cl_io *io = slice->cis_io; 493 struct osc_io *oio = cl2osc_io(env, slice); 494 struct cl_object *obj = slice->cis_obj; 495 struct osc_async_cbargs *cbargs = &oio->oi_cbarg; 496 int result = 0; 497 498 if (cbargs->opc_rpc_sent) { 499 wait_for_completion(&cbargs->opc_sync); 500 result = io->ci_result = cbargs->opc_rc; 501 }
On both stuck nodes, I also notice the ptlrpcd_rcv thread blocked with this backtrace:
sync_page __lock_page vvp_page_own cl_page_own0 cl_page_own check_and_discard_cb cl_page_gang_lookup cl_lock_discard_pages osc_lock_flush osc_lock_cancel cl_lock_cancel0 cl_lock_cancel osc_ldlm_blocking_ast ldlm_cancel_callback ldlm_lock_cancel ldlm_cli_cancel_list_local ldlm_cancel_lru_local ldlm_replay_locks ptlrpc_import_recov_state_machine ptlrpc_connect_interpret ptlrpc_check_set ptlrpcd_check ptlrpcd
I haven't checked anything on the server side yet. Please let us know ASAP if you want any more debug data from the clients before we reboot them.
It looks like we hit a similar problem on a BGQ I/O Node (lustre client). The backtrace for the prlrpc_rcv thread is identical to the backtrace that Ned listed above. There are two OSCs stuck in the REPLAY_LOCKS state as Ned reported in the earlier instance on x86_64.
There is no thread in cl_locks_prune() this time.
The OSTs appear to be fine. Other nodes can use them.
Many other threads are stuck waiting under an open():
One thread had nearly and identical stack as the open() ones, but got there through fstat():
Finally, a couple of threads where in this backtrace:
Do you still think that http://review.whamcloud.com/11418 will address this problem? We have not yet pulled in that patch.