Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-308

Hang and eventual ASSERT after mdc_enqueue()) ldlm_cli_enqueue: -4

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • None
    • Lustre 1.8.6
    • None
    • RHEL5.5ish (CHAOS4.4-2), lustre 1.8.5.0-3chaos
    • 3
    • 10319

    Description

      On a production lustre client node, we hit an ASSERT. The first sign of trouble on the console is this:

      2011-05-11 08:55:44 LustreError: ... (mdc_locks.c:648:mdc_enqueue())
      ldlm_cli_enqueue: -4

      I believe that is under an emacs process.

      Ten seconds later we start getting "soft lockup" "stuck for 10s" warnings
      about the same process. The messages pop up every 10s until we finally get an
      assertion later on. Backtrace looks like:

      :mdc:mdc_enter_request
      :ptlrpc:ldlm_lock_addref_internal_nolock
      :mdc:mdc_enqueue
      dequeue_task
      thread_return
      :ptlrpc:ldlm_lock_add_to_lru_nolock
      :mdc:mdc_intent_lock
      :ptlrpc:ldlm_lock_decref
      :mdc:mdc_set_lock_data
      :lustre:ll_mdc_blocking_ast
      :ptlrpc:ldlm_completion_ast
      :lustre:ll_prepare_mdc_op_data
      :lustre:ll_lookup_it
      :lustre:ll_mdc_blocking_ast
      :lov:lov_fini_enqueue_set
      :lustre:ll_lookup_nd
      list_add
      d_alloc
      do_lookup
      __link_path_walk
      link_path_walk
      do_path_lookup
      __user_walk_fd
      vfs_stat_fd
      sys_rt_sigreturn
      sys_rt_sigreturn
      sys_newstat
      sys_setitimer
      stub_rt_sigreturn
      system_call

      Later a different process throws these errors:

      2011-05-11 09:06:07 Lustre: ... Request mdc_close sent 106s ago has failed due
      to network error (limit 106s)
      2011-05-11 09:06:07 LustreError: ... ll_lcose_inode_openhandle()) inode X mdc
      close failed: -4
      2011-05-11 09:06:07 Skipped 4 previous messages

      And then three seconds later the original stuck thread does:

      2011-05-11 09:06:10 ldlm_lock.c:189:ldlm_lock_remove_from_lru_nolock ASSERT(ns->ns_nr_unused > 0) failed

      Backtrace looks like:

      ldlm_lock_remove_from_lru_nolock
      ldlm_lock_remove_from_lru
      ldlm_lock_addref_internal_nolock
      search_queue
      ldlm_lock_match
      ldlm_resource_get
      mdc_revalidate_lock
      ldlm_lock_addref_internal_nolock
      mdc_intent_lock
      ll_i2gids
      ll_prepare_mdc_op_data
      __ll_inode_revalidate_it
      ll_mdc_blocking_ast
      ll_inode_permission
      dput
      permission
      vfs_permission
      __link_path_walk
      link_path_walk
      do_path_lookup
      __path_lookup_intent_open
      path_lookup_open
      open_namei
      do_filp_open
      get_unused_fd
      do_sys_open
      sys_open

      Apologies for any typos. That all had to be hand copied.

      Since this all appears to have started with an EINTR in mdc_enqueue(), it may be that this bug is related:

      https://bugzilla.lustre.org/show_bug.cgi?id=18213
      http://jira.whamcloud.com/browse/LU-234

      We are running 1.8.5+, so we should have the fix that was applied to 1.8.5 in bug 18213.

      Attachments

        Issue Links

          Activity

            People

              laisiyao Lai Siyao
              morrone Christopher Morrone
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: