Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-1276

MDS threads all stuck in jbd2_journal_start

Details

    • 3
    • 8047

    Description

      The MDS on a classified production 2.1 lustre cluster got stuck today. The symptoms were high load (800+), but very little CPU usage.

      Almost all of the lustre threads were stuck in jbd2_journal_start, while the jbd2/sda thread is stuck in
      jbd2_journal_commit_transaction. There is zero I/O going to disk.

      There's one thread that stands out as a suspect, as it's not in jbd2_journal_start but seems to be handling an unlink. Perhaps it got stuck waiting on a semaphore while holding an open transaction with jbd2. Its stack trace looks like this:

      COMMAND: "mdt_152"
      schedule
      rwsem_down_failed_common
      rwsem_down_read_failed
      call_rwsem_down_read_failed
      llog_cat_current_log.clone.0
      llog_cat_add_rec
      llog_obd_origin_add
      llog_add
      lov_llog_origin_add
      llog_add
      mds_llog_origin_add
      llog_add
      mds_llog_add_unlink
      mds_log_op_unlink
      mdd_unlink_log
      mdd_object_kill
      mdd_finish_unlink
      mdd_unlink
      cml_unlink
      mdt_reint_unlink
      mdt_reint_rec
      mdt_reint_internal
      mdt_reint
      mdt_handle_common
      mdt_regular_handle
      ptlrpc_main
      kernel_thread

      Attachments

        1. bt-all.merged.txt
          230 kB
        2. dmesg.txt
          125 kB

        Activity

          [LU-1276] MDS threads all stuck in jbd2_journal_start
          jgmitter Joseph Gmitter (Inactive) made changes -
          Link Original: This issue is related to DDN-283 [ DDN-283 ]
          jgmitter Joseph Gmitter (Inactive) made changes -
          Link New: This issue is duplicated by DDN-283 [ DDN-283 ]
          jgmitter Joseph Gmitter (Inactive) made changes -
          Link New: This issue is related to DDN-283 [ DDN-283 ]
          patrick.valentin Patrick Valentin (Inactive) made changes -
          Attachment New: dmesg.txt [ 14189 ]
          Attachment New: bt-all.merged.txt [ 14190 ]
          pjones Peter Jones made changes -
          Resolution New: Cannot Reproduce [ 5 ]
          Status Original: Reopened [ 4 ] New: Resolved [ 5 ]
          pjones Peter Jones made changes -
          Resolution Original: Duplicate [ 3 ]
          Status Original: Resolved [ 5 ] New: Reopened [ 4 ]
          pjones Peter Jones made changes -
          Resolution New: Duplicate [ 3 ]
          Status Original: Open [ 1 ] New: Resolved [ 5 ]
          pjones Peter Jones made changes -
          Assignee Original: WC Triage [ wc-triage ] New: Oleg Drokin [ green ]
          nedbass Ned Bass (Inactive) created issue -

          People

            green Oleg Drokin
            nedbass Ned Bass (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: