Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5467

process stuck in cl_locks_prune()

Details

    • 3
    • 15233

    Description

      User processes are stuck in cl_locks_prune(). The system is classified so files from the system can't be uploaded. We currently have two lustre clients in this state.

      Stack trace from stuck process:

      cfs_waitq_wait
      cl_locks_prune
      lov_delete_raid0
      lov_object_delete
      lu_object_free
      lu_object_put
      cl_object_put
      cl_inode_fini
      ll_clear_inode
      clear_inode
      ll_delete_inode
      generic_delete_inode
      generic_drop_inode
      ...
      sys_unlink
      

      They are waiting for lock user count to drop to 0:

      2063 again:
      2064                 cl_lock_mutex_get(env, lock);
      2065                 if (lock->cll_state < CLS_FREEING) {
      2066                         LASSERT(lock->cll_users <= 1);
      2067                         if (unlikely(lock->cll_users == 1)) {
      2068                                 struct l_wait_info lwi = { 0 };
      2069                                                                                 
      2070                                 cl_lock_mutex_put(env, lock);
      2071                                 l_wait_event(lock->cll_wq,
      2072                                              lock->cll_users == 0, 
      2073                                              &lwi);
      2074                                 goto again; 
      2075                         }
      

      On one node I also found a user process stuck in osc_io_setattr_end() line 500:

      489 static void osc_io_setattr_end(const struct lu_env *env,
      490                                const struct cl_io_slice *slice)
      491 { 
      492         struct cl_io     *io  = slice->cis_io;
      493         struct osc_io    *oio = cl2osc_io(env, slice);
      494         struct cl_object *obj = slice->cis_obj;
      495         struct osc_async_cbargs *cbargs = &oio->oi_cbarg;
      496         int result = 0;
      497
      498         if (cbargs->opc_rpc_sent) {
      499                 wait_for_completion(&cbargs->opc_sync);
      500                 result = io->ci_result = cbargs->opc_rc;
      501         } 
      

      On both stuck nodes, I also notice the ptlrpcd_rcv thread blocked with this backtrace:

      sync_page
      __lock_page
      vvp_page_own
      cl_page_own0
      cl_page_own
      check_and_discard_cb
      cl_page_gang_lookup
      cl_lock_discard_pages
      osc_lock_flush
      osc_lock_cancel
      cl_lock_cancel0
      cl_lock_cancel
      osc_ldlm_blocking_ast
      ldlm_cancel_callback
      ldlm_lock_cancel
      ldlm_cli_cancel_list_local
      ldlm_cancel_lru_local
      ldlm_replay_locks
      ptlrpc_import_recov_state_machine
      ptlrpc_connect_interpret
      ptlrpc_check_set
      ptlrpcd_check
      ptlrpcd
      

      I haven't checked anything on the server side yet. Please let us know ASAP if you want any more debug data from the clients before we reboot them.

      Attachments

        Activity

          [LU-5467] process stuck in cl_locks_prune()
          pjones Peter Jones made changes -
          End date New: 09/Sep/14
          Start date New: 08/Aug/14
          morrone Christopher Morrone (Inactive) made changes -
          Labels New: llnl
          pjones Peter Jones made changes -
          Assignee Original: WC Triage [ wc-triage ] New: Zhenyu Xu [ bobijam ]
          nedbass Ned Bass (Inactive) created issue -

          People

            bobijam Zhenyu Xu
            nedbass Ned Bass (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated: