Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5467

process stuck in cl_locks_prune()

Details

    • 3
    • 15233

    Description

      User processes are stuck in cl_locks_prune(). The system is classified so files from the system can't be uploaded. We currently have two lustre clients in this state.

      Stack trace from stuck process:

      cfs_waitq_wait
      cl_locks_prune
      lov_delete_raid0
      lov_object_delete
      lu_object_free
      lu_object_put
      cl_object_put
      cl_inode_fini
      ll_clear_inode
      clear_inode
      ll_delete_inode
      generic_delete_inode
      generic_drop_inode
      ...
      sys_unlink
      

      They are waiting for lock user count to drop to 0:

      2063 again:
      2064                 cl_lock_mutex_get(env, lock);
      2065                 if (lock->cll_state < CLS_FREEING) {
      2066                         LASSERT(lock->cll_users <= 1);
      2067                         if (unlikely(lock->cll_users == 1)) {
      2068                                 struct l_wait_info lwi = { 0 };
      2069                                                                                 
      2070                                 cl_lock_mutex_put(env, lock);
      2071                                 l_wait_event(lock->cll_wq,
      2072                                              lock->cll_users == 0, 
      2073                                              &lwi);
      2074                                 goto again; 
      2075                         }
      

      On one node I also found a user process stuck in osc_io_setattr_end() line 500:

      489 static void osc_io_setattr_end(const struct lu_env *env,
      490                                const struct cl_io_slice *slice)
      491 { 
      492         struct cl_io     *io  = slice->cis_io;
      493         struct osc_io    *oio = cl2osc_io(env, slice);
      494         struct cl_object *obj = slice->cis_obj;
      495         struct osc_async_cbargs *cbargs = &oio->oi_cbarg;
      496         int result = 0;
      497
      498         if (cbargs->opc_rpc_sent) {
      499                 wait_for_completion(&cbargs->opc_sync);
      500                 result = io->ci_result = cbargs->opc_rc;
      501         } 
      

      On both stuck nodes, I also notice the ptlrpcd_rcv thread blocked with this backtrace:

      sync_page
      __lock_page
      vvp_page_own
      cl_page_own0
      cl_page_own
      check_and_discard_cb
      cl_page_gang_lookup
      cl_lock_discard_pages
      osc_lock_flush
      osc_lock_cancel
      cl_lock_cancel0
      cl_lock_cancel
      osc_ldlm_blocking_ast
      ldlm_cancel_callback
      ldlm_lock_cancel
      ldlm_cli_cancel_list_local
      ldlm_cancel_lru_local
      ldlm_replay_locks
      ptlrpc_import_recov_state_machine
      ptlrpc_connect_interpret
      ptlrpc_check_set
      ptlrpcd_check
      ptlrpcd
      

      I haven't checked anything on the server side yet. Please let us know ASAP if you want any more debug data from the clients before we reboot them.

      Attachments

        Activity

          People

            bobijam Zhenyu Xu
            nedbass Ned Bass (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated: