Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-1276

MDS threads all stuck in jbd2_journal_start

    XMLWordPrintable

Details

    • 3
    • 8047

    Description

      The MDS on a classified production 2.1 lustre cluster got stuck today. The symptoms were high load (800+), but very little CPU usage.

      Almost all of the lustre threads were stuck in jbd2_journal_start, while the jbd2/sda thread is stuck in
      jbd2_journal_commit_transaction. There is zero I/O going to disk.

      There's one thread that stands out as a suspect, as it's not in jbd2_journal_start but seems to be handling an unlink. Perhaps it got stuck waiting on a semaphore while holding an open transaction with jbd2. Its stack trace looks like this:

      COMMAND: "mdt_152"
      schedule
      rwsem_down_failed_common
      rwsem_down_read_failed
      call_rwsem_down_read_failed
      llog_cat_current_log.clone.0
      llog_cat_add_rec
      llog_obd_origin_add
      llog_add
      lov_llog_origin_add
      llog_add
      mds_llog_origin_add
      llog_add
      mds_llog_add_unlink
      mds_log_op_unlink
      mdd_unlink_log
      mdd_object_kill
      mdd_finish_unlink
      mdd_unlink
      cml_unlink
      mdt_reint_unlink
      mdt_reint_rec
      mdt_reint_internal
      mdt_reint
      mdt_handle_common
      mdt_regular_handle
      ptlrpc_main
      kernel_thread

      Attachments

        1. bt-all.merged.txt
          230 kB
        2. dmesg.txt
          125 kB

        Activity

          People

            green Oleg Drokin
            nedbass Ned Bass (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: