Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4794

MDS threads all stuck in jbd2_journal_start

    XMLWordPrintable

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Minor
    • None
    • Lustre 2.1.6
    • None
    • 3
    • 13194

    Description

      This seems to be a duplicate of LU-1276, which was closed with "Cannot Reproduce", and in which I initially added the following note.

      One of the Bull customers (TGCC) had the same deadlock as described in LU-1276 twice during the past six months: one thread is stuck in jbd2_journal_commit_transaction() and many other thread are stuck in jbd2_journal_start().

      PID: 29225 TASK: ffff88107c3bb040 CPU: 15 COMMAND: "jbd2/dm-2-8"
       #0 [ffff88107a343c60] schedule at ffffffff81485765
       0000001 [ffff88107a343d28] jbd2_journal_commit_transaction at ffffffffa006a94f [jbd2]
       0000002 [ffff88107a343e68] kjournald2 at ffffffffa0070c08 [jbd2]
       0000003 [ffff88107a343ee8] kthread at ffffffff8107b5f6
       0000004 [ffff88107a343f48] kernel_thread at ffffffff8100412a
      

      and most of the threads:

      PID: 15585 TASK: ffff88063062a790 CPU: 0 COMMAND: "mdt_503"
      PID: 15586 TASK: ffff88063062a040 CPU: 23 COMMAND: "mdt_504"
      PID: 15587 TASK: ffff88020f3ad7d0 CPU: 30 COMMAND: "mdt_505"
      PID: 29286 TASK: ffff88087505e790 CPU: 25 COMMAND: "mdt_01"
      ...
      #0 [ffff881949c078f0] schedule at ffffffff81485765
      0000001 [ffff881949c079b8] start_this_handle at ffffffffa006908a [jbd2]
      0000002 [ffff881949c07a78] jbd2_journal_start at ffffffffa0069500 [jbd2]
      0000003 [ffff881949c07ac8] ldiskfs_journal_start_sb at ffffffffa0451ca8 [ldiskfs]
      0000004 [ffff881949c07ad8] osd_trans_start at ffffffffa0d4a324 [osd_ldiskfs]
      0000005 [ffff881949c07b18] mdd_trans_start at ffffffffa0c4c4e3 [mdd]
      0000006 [ffff881949c07b38] mdd_unlink at ffffffffa0c401eb [mdd]
      0000007 [ffff881949c07bf8] cml_unlink at ffffffffa0d82e07 [cmm]
      0000008 [ffff881949c07c38] mdt_reint_unlink at ffffffffa0cba0f4 [mdt]
      0000009 [ffff881949c07cb8] mdt_reint_rec at ffffffffa0cb7cb1 [mdt]
      0000010 [ffff881949c07cd8] mdt_reint_internal at ffffffffa0caeed4 [mdt]
      0000011 [ffff881949c07d28] mdt_reint at ffffffffa0caf2b4 [mdt]
      0000012 [ffff881949c07d48] mdt_handle_common at ffffffffa0ca3762 [mdt]
      0000013 [ffff881949c07d98] mdt_regular_handle at ffffffffa0ca4655 [mdt]
      0000014 [ffff881949c07da8] ptlrpc_main at ffffffffa071f4f6 [ptlrpc]
      0000015 [ffff881949c07f48] kernel_thread at ffffffff8100412a
      

      They are running lustre 2.1.6 which contains http://review.whamcloud.com/4743 from LU-1648.

      I attach two files containing the dmseg and the crash back trace of all threads.

      Attachments

        1. bt-all.merged.txt
          230 kB
          Lustre Bull
        2. dmesg.txt
          125 kB
          Lustre Bull

        Issue Links

          Activity

            People

              bobijam Zhenyu Xu
              patrick.valentin Patrick Valentin (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: