Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-1276

MDS threads all stuck in jbd2_journal_start

Details

    • 3
    • 8047

    Description

      The MDS on a classified production 2.1 lustre cluster got stuck today. The symptoms were high load (800+), but very little CPU usage.

      Almost all of the lustre threads were stuck in jbd2_journal_start, while the jbd2/sda thread is stuck in
      jbd2_journal_commit_transaction. There is zero I/O going to disk.

      There's one thread that stands out as a suspect, as it's not in jbd2_journal_start but seems to be handling an unlink. Perhaps it got stuck waiting on a semaphore while holding an open transaction with jbd2. Its stack trace looks like this:

      COMMAND: "mdt_152"
      schedule
      rwsem_down_failed_common
      rwsem_down_read_failed
      call_rwsem_down_read_failed
      llog_cat_current_log.clone.0
      llog_cat_add_rec
      llog_obd_origin_add
      llog_add
      lov_llog_origin_add
      llog_add
      mds_llog_origin_add
      llog_add
      mds_llog_add_unlink
      mds_log_op_unlink
      mdd_unlink_log
      mdd_object_kill
      mdd_finish_unlink
      mdd_unlink
      cml_unlink
      mdt_reint_unlink
      mdt_reint_rec
      mdt_reint_internal
      mdt_reint
      mdt_handle_common
      mdt_regular_handle
      ptlrpc_main
      kernel_thread

      Attachments

        1. bt-all.merged.txt
          230 kB
          Patrick Valentin
        2. dmesg.txt
          125 kB
          Patrick Valentin

        Activity

          [LU-1276] MDS threads all stuck in jbd2_journal_start
          jgmitter Joseph Gmitter (Inactive) made changes -
          Link Original: This issue is related to DDN-283 [ DDN-283 ]
          jgmitter Joseph Gmitter (Inactive) made changes -
          Link New: This issue is duplicated by DDN-283 [ DDN-283 ]
          jgmitter Joseph Gmitter (Inactive) made changes -
          Link New: This issue is related to DDN-283 [ DDN-283 ]
          patrick.valentin Patrick Valentin (Inactive) made changes -
          Attachment New: dmesg.txt [ 14189 ]
          Attachment New: bt-all.merged.txt [ 14190 ]

          One of the Bull customers (TGCC) had the same deadlock twice during the past six months: one thread is stuck in jbd2_journal_commit_transaction() and many other thread are stuck in jbd2_journal_start().

          PID: 29225 TASK: ffff88107c3bb040 CPU: 15 COMMAND: "jbd2/dm-2-8"
           #0 [ffff88107a343c60] schedule at ffffffff81485765
           0000001 [ffff88107a343d28] jbd2_journal_commit_transaction at ffffffffa006a94f [jbd2]
           0000002 [ffff88107a343e68] kjournald2 at ffffffffa0070c08 [jbd2]
           0000003 [ffff88107a343ee8] kthread at ffffffff8107b5f6
           0000004 [ffff88107a343f48] kernel_thread at ffffffff8100412a
          

          and most of the threads:

          PID: 15585 TASK: ffff88063062a790 CPU: 0 COMMAND: "mdt_503"
          PID: 15586 TASK: ffff88063062a040 CPU: 23 COMMAND: "mdt_504"
          PID: 15587 TASK: ffff88020f3ad7d0 CPU: 30 COMMAND: "mdt_505"
          PID: 29286 TASK: ffff88087505e790 CPU: 25 COMMAND: "mdt_01"
          ...
          #0 [ffff881949c078f0] schedule at ffffffff81485765
          0000001 [ffff881949c079b8] start_this_handle at ffffffffa006908a [jbd2]
          0000002 [ffff881949c07a78] jbd2_journal_start at ffffffffa0069500 [jbd2]
          0000003 [ffff881949c07ac8] ldiskfs_journal_start_sb at ffffffffa0451ca8 [ldiskfs]
          0000004 [ffff881949c07ad8] osd_trans_start at ffffffffa0d4a324 [osd_ldiskfs]
          0000005 [ffff881949c07b18] mdd_trans_start at ffffffffa0c4c4e3 [mdd]
          0000006 [ffff881949c07b38] mdd_unlink at ffffffffa0c401eb [mdd]
          0000007 [ffff881949c07bf8] cml_unlink at ffffffffa0d82e07 [cmm]
          0000008 [ffff881949c07c38] mdt_reint_unlink at ffffffffa0cba0f4 [mdt]
          0000009 [ffff881949c07cb8] mdt_reint_rec at ffffffffa0cb7cb1 [mdt]
          0000010 [ffff881949c07cd8] mdt_reint_internal at ffffffffa0caeed4 [mdt]
          0000011 [ffff881949c07d28] mdt_reint at ffffffffa0caf2b4 [mdt]
          0000012 [ffff881949c07d48] mdt_handle_common at ffffffffa0ca3762 [mdt]
          0000013 [ffff881949c07d98] mdt_regular_handle at ffffffffa0ca4655 [mdt]
          0000014 [ffff881949c07da8] ptlrpc_main at ffffffffa071f4f6 [ptlrpc]
          0000015 [ffff881949c07f48] kernel_thread at ffffffff8100412a
          

          The are running lustre 2.1.6 which contains http://review.whamcloud.com/4743 from LU-1648.

          I attach two files containing the dmseg and the crash back trace of all threads.
          could you reopen this ticket, as it was closed with "Cannot Reproduce".

          patrick.valentin Patrick Valentin (Inactive) added a comment - - edited One of the Bull customers (TGCC) had the same deadlock twice during the past six months: one thread is stuck in jbd2_journal_commit_transaction() and many other thread are stuck in jbd2_journal_start(). PID: 29225 TASK: ffff88107c3bb040 CPU: 15 COMMAND: "jbd2/dm-2-8" #0 [ffff88107a343c60] schedule at ffffffff81485765 0000001 [ffff88107a343d28] jbd2_journal_commit_transaction at ffffffffa006a94f [jbd2] 0000002 [ffff88107a343e68] kjournald2 at ffffffffa0070c08 [jbd2] 0000003 [ffff88107a343ee8] kthread at ffffffff8107b5f6 0000004 [ffff88107a343f48] kernel_thread at ffffffff8100412a and most of the threads: PID: 15585 TASK: ffff88063062a790 CPU: 0 COMMAND: "mdt_503" PID: 15586 TASK: ffff88063062a040 CPU: 23 COMMAND: "mdt_504" PID: 15587 TASK: ffff88020f3ad7d0 CPU: 30 COMMAND: "mdt_505" PID: 29286 TASK: ffff88087505e790 CPU: 25 COMMAND: "mdt_01" ... #0 [ffff881949c078f0] schedule at ffffffff81485765 0000001 [ffff881949c079b8] start_this_handle at ffffffffa006908a [jbd2] 0000002 [ffff881949c07a78] jbd2_journal_start at ffffffffa0069500 [jbd2] 0000003 [ffff881949c07ac8] ldiskfs_journal_start_sb at ffffffffa0451ca8 [ldiskfs] 0000004 [ffff881949c07ad8] osd_trans_start at ffffffffa0d4a324 [osd_ldiskfs] 0000005 [ffff881949c07b18] mdd_trans_start at ffffffffa0c4c4e3 [mdd] 0000006 [ffff881949c07b38] mdd_unlink at ffffffffa0c401eb [mdd] 0000007 [ffff881949c07bf8] cml_unlink at ffffffffa0d82e07 [cmm] 0000008 [ffff881949c07c38] mdt_reint_unlink at ffffffffa0cba0f4 [mdt] 0000009 [ffff881949c07cb8] mdt_reint_rec at ffffffffa0cb7cb1 [mdt] 0000010 [ffff881949c07cd8] mdt_reint_internal at ffffffffa0caeed4 [mdt] 0000011 [ffff881949c07d28] mdt_reint at ffffffffa0caf2b4 [mdt] 0000012 [ffff881949c07d48] mdt_handle_common at ffffffffa0ca3762 [mdt] 0000013 [ffff881949c07d98] mdt_regular_handle at ffffffffa0ca4655 [mdt] 0000014 [ffff881949c07da8] ptlrpc_main at ffffffffa071f4f6 [ptlrpc] 0000015 [ffff881949c07f48] kernel_thread at ffffffff8100412a The are running lustre 2.1.6 which contains http://review.whamcloud.com/4743 from LU-1648 . I attach two files containing the dmseg and the crash back trace of all threads. could you reopen this ticket, as it was closed with "Cannot Reproduce".
          pjones Peter Jones made changes -
          Resolution New: Cannot Reproduce [ 5 ]
          Status Original: Reopened [ 4 ] New: Resolved [ 5 ]
          pjones Peter Jones added a comment -

          ok thanks Chris

          pjones Peter Jones added a comment - ok thanks Chris

          It looks like the LU-1648 fix landed on before 2.1.4 in change 4743. I think we can close this until we see it again.

          morrone Christopher Morrone (Inactive) added a comment - It looks like the LU-1648 fix landed on before 2.1.4 in change 4743 . I think we can close this until we see it again.

          It sounds like the patch from LU-1648 is needed on b2_1.

          morrone Christopher Morrone (Inactive) added a comment - It sounds like the patch from LU-1648 is needed on b2_1.
          green Oleg Drokin added a comment -

          I guess the other candidate for this issue is LU-1648, can you add a patch from it as well please?

          green Oleg Drokin added a comment - I guess the other candidate for this issue is LU-1648 , can you add a patch from it as well please?

          People

            green Oleg Drokin
            nedbass Ned Bass (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: