Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4794

MDS threads all stuck in jbd2_journal_start

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Minor
    • None
    • Lustre 2.1.6
    • None
    • 3
    • 13194

    Description

      This seems to be a duplicate of LU-1276, which was closed with "Cannot Reproduce", and in which I initially added the following note.

      One of the Bull customers (TGCC) had the same deadlock as described in LU-1276 twice during the past six months: one thread is stuck in jbd2_journal_commit_transaction() and many other thread are stuck in jbd2_journal_start().

      PID: 29225 TASK: ffff88107c3bb040 CPU: 15 COMMAND: "jbd2/dm-2-8"
       #0 [ffff88107a343c60] schedule at ffffffff81485765
       0000001 [ffff88107a343d28] jbd2_journal_commit_transaction at ffffffffa006a94f [jbd2]
       0000002 [ffff88107a343e68] kjournald2 at ffffffffa0070c08 [jbd2]
       0000003 [ffff88107a343ee8] kthread at ffffffff8107b5f6
       0000004 [ffff88107a343f48] kernel_thread at ffffffff8100412a
      

      and most of the threads:

      PID: 15585 TASK: ffff88063062a790 CPU: 0 COMMAND: "mdt_503"
      PID: 15586 TASK: ffff88063062a040 CPU: 23 COMMAND: "mdt_504"
      PID: 15587 TASK: ffff88020f3ad7d0 CPU: 30 COMMAND: "mdt_505"
      PID: 29286 TASK: ffff88087505e790 CPU: 25 COMMAND: "mdt_01"
      ...
      #0 [ffff881949c078f0] schedule at ffffffff81485765
      0000001 [ffff881949c079b8] start_this_handle at ffffffffa006908a [jbd2]
      0000002 [ffff881949c07a78] jbd2_journal_start at ffffffffa0069500 [jbd2]
      0000003 [ffff881949c07ac8] ldiskfs_journal_start_sb at ffffffffa0451ca8 [ldiskfs]
      0000004 [ffff881949c07ad8] osd_trans_start at ffffffffa0d4a324 [osd_ldiskfs]
      0000005 [ffff881949c07b18] mdd_trans_start at ffffffffa0c4c4e3 [mdd]
      0000006 [ffff881949c07b38] mdd_unlink at ffffffffa0c401eb [mdd]
      0000007 [ffff881949c07bf8] cml_unlink at ffffffffa0d82e07 [cmm]
      0000008 [ffff881949c07c38] mdt_reint_unlink at ffffffffa0cba0f4 [mdt]
      0000009 [ffff881949c07cb8] mdt_reint_rec at ffffffffa0cb7cb1 [mdt]
      0000010 [ffff881949c07cd8] mdt_reint_internal at ffffffffa0caeed4 [mdt]
      0000011 [ffff881949c07d28] mdt_reint at ffffffffa0caf2b4 [mdt]
      0000012 [ffff881949c07d48] mdt_handle_common at ffffffffa0ca3762 [mdt]
      0000013 [ffff881949c07d98] mdt_regular_handle at ffffffffa0ca4655 [mdt]
      0000014 [ffff881949c07da8] ptlrpc_main at ffffffffa071f4f6 [ptlrpc]
      0000015 [ffff881949c07f48] kernel_thread at ffffffff8100412a
      

      They are running lustre 2.1.6 which contains http://review.whamcloud.com/4743 from LU-1648.

      I attach two files containing the dmseg and the crash back trace of all threads.

      Attachments

        1. dmesg.txt
          125 kB
        2. bt-all.merged.txt
          230 kB

        Issue Links

          Activity

            [LU-4794] MDS threads all stuck in jbd2_journal_start
            bobijam Zhenyu Xu added a comment -

            patch tracking at http://review.whamcloud.com/10076 (b2_1 needs it, b2_4 does not)

            bobijam Zhenyu Xu added a comment - patch tracking at http://review.whamcloud.com/10076 (b2_1 needs it, b2_4 does not)
            bobijam Zhenyu Xu added a comment -

            the competing deadlock thread 15514

            PID: 15514  TASK: ffff88199fe40850  CPU: 16  COMMAND: "mdt_432"
             #0 [ffff881cc6f1b548] schedule at ffffffff81485765
             #1 [ffff881cc6f1b610] rwsem_down_failed_common at ffffffff81487d65
             #2 [ffff881cc6f1b670] rwsem_down_read_failed at ffffffff81487f16
             #3 [ffff881cc6f1b6b0] call_rwsem_down_read_failed at ffffffff81262b24
             #4 [ffff881cc6f1b718] llog_cat_current_log.clone.0 at ffffffffa056ada5 [obdclass]
             #5 [ffff881cc6f1b7b8] llog_cat_add_rec at ffffffffa056baca [obdclass]
             #6 [ffff881cc6f1b808] llog_obd_origin_add at ffffffffa0571627 [obdclass]
             #7 [ffff881cc6f1b838] llog_add at ffffffffa0571801 [obdclass]
             #8 [ffff881cc6f1b888] lov_llog_origin_add at ffffffffa09f70fc [lov]
             #9 [ffff881cc6f1b908] llog_add at ffffffffa0571801 [obdclass]
            #10 [ffff881cc6f1b958] mds_llog_origin_add at ffffffffa0bf8d53 [mds]
            #11 [ffff881cc6f1b9a8] llog_add at ffffffffa0571801 [obdclass]
            #12 [ffff881cc6f1b9f8] mds_llog_add_unlink at ffffffffa0bf93ca [mds]
            #13 [ffff881cc6f1ba48] mds_log_op_unlink at ffffffffa0bf9a08 [mds]
            #14 [ffff881cc6f1baa8] mdd_unlink_log at ffffffffa0c2df31 [mdd]
            #15 [ffff881cc6f1bac8] mdd_object_kill at ffffffffa0c2526b [mdd]
            #16 [ffff881cc6f1baf8] mdd_finish_unlink at ffffffffa0c3b13e [mdd]
            #17 [ffff881cc6f1bb38] mdd_unlink at ffffffffa0c40696 [mdd]
            #18 [ffff881cc6f1bbf8] cml_unlink at ffffffffa0d82e07 [cmm]
            #19 [ffff881cc6f1bc38] mdt_reint_unlink at ffffffffa0cba0f4 [mdt]
            #20 [ffff881cc6f1bcb8] mdt_reint_rec at ffffffffa0cb7cb1 [mdt]
            #21 [ffff881cc6f1bcd8] mdt_reint_internal at ffffffffa0caeed4 [mdt]
            #22 [ffff881cc6f1bd28] mdt_reint at ffffffffa0caf2b4 [mdt]
            #23 [ffff881cc6f1bd48] mdt_handle_common at ffffffffa0ca3762 [mdt]
            #24 [ffff881cc6f1bd98] mdt_regular_handle at ffffffffa0ca4655 [mdt]
            #25 [ffff881cc6f1bda8] ptlrpc_main at ffffffffa071f4f6 [ptlrpc]
            #26 [ffff881cc6f1bf48] kernel_thread at ffffffff8100412a
            
            bobijam Zhenyu Xu added a comment - the competing deadlock thread 15514 PID: 15514 TASK: ffff88199fe40850 CPU: 16 COMMAND: "mdt_432" #0 [ffff881cc6f1b548] schedule at ffffffff81485765 #1 [ffff881cc6f1b610] rwsem_down_failed_common at ffffffff81487d65 #2 [ffff881cc6f1b670] rwsem_down_read_failed at ffffffff81487f16 #3 [ffff881cc6f1b6b0] call_rwsem_down_read_failed at ffffffff81262b24 #4 [ffff881cc6f1b718] llog_cat_current_log.clone.0 at ffffffffa056ada5 [obdclass] #5 [ffff881cc6f1b7b8] llog_cat_add_rec at ffffffffa056baca [obdclass] #6 [ffff881cc6f1b808] llog_obd_origin_add at ffffffffa0571627 [obdclass] #7 [ffff881cc6f1b838] llog_add at ffffffffa0571801 [obdclass] #8 [ffff881cc6f1b888] lov_llog_origin_add at ffffffffa09f70fc [lov] #9 [ffff881cc6f1b908] llog_add at ffffffffa0571801 [obdclass] #10 [ffff881cc6f1b958] mds_llog_origin_add at ffffffffa0bf8d53 [mds] #11 [ffff881cc6f1b9a8] llog_add at ffffffffa0571801 [obdclass] #12 [ffff881cc6f1b9f8] mds_llog_add_unlink at ffffffffa0bf93ca [mds] #13 [ffff881cc6f1ba48] mds_log_op_unlink at ffffffffa0bf9a08 [mds] #14 [ffff881cc6f1baa8] mdd_unlink_log at ffffffffa0c2df31 [mdd] #15 [ffff881cc6f1bac8] mdd_object_kill at ffffffffa0c2526b [mdd] #16 [ffff881cc6f1baf8] mdd_finish_unlink at ffffffffa0c3b13e [mdd] #17 [ffff881cc6f1bb38] mdd_unlink at ffffffffa0c40696 [mdd] #18 [ffff881cc6f1bbf8] cml_unlink at ffffffffa0d82e07 [cmm] #19 [ffff881cc6f1bc38] mdt_reint_unlink at ffffffffa0cba0f4 [mdt] #20 [ffff881cc6f1bcb8] mdt_reint_rec at ffffffffa0cb7cb1 [mdt] #21 [ffff881cc6f1bcd8] mdt_reint_internal at ffffffffa0caeed4 [mdt] #22 [ffff881cc6f1bd28] mdt_reint at ffffffffa0caf2b4 [mdt] #23 [ffff881cc6f1bd48] mdt_handle_common at ffffffffa0ca3762 [mdt] #24 [ffff881cc6f1bd98] mdt_regular_handle at ffffffffa0ca4655 [mdt] #25 [ffff881cc6f1bda8] ptlrpc_main at ffffffffa071f4f6 [ptlrpc] #26 [ffff881cc6f1bf48] kernel_thread at ffffffff8100412a
            bobijam Zhenyu Xu added a comment -

            Relates to LU-1648.

            Thread 27469 stack trace from bt-all.merged.txt

            PID: 27469  TASK: ffff88199d25a080  CPU: 12  COMMAND: "ldlm_cn_88"
             #0 [ffff881c4f78b490] schedule at ffffffff81485765
             #1 [ffff881c4f78b558] start_this_handle at ffffffffa006908a [jbd2]
             #2 [ffff881c4f78b618] jbd2_journal_restart at ffffffffa00693d1 [jbd2]
             #3 [ffff881c4f78b668] ldiskfs_truncate_restart_trans at ffffffffa042791a [ldiskfs]
             #4 [ffff881c4f78b698] ldiskfs_clear_blocks at ffffffffa042cc3d [ldiskfs]
             #5 [ffff881c4f78b6f8] ldiskfs_free_data at ffffffffa042ce24 [ldiskfs]
             #6 [ffff881c4f78b758] ldiskfs_free_branches at ffffffffa042d063 [ldiskfs]
             #7 [ffff881c4f78b7b8] ldiskfs_free_branches at ffffffffa042cf56 [ldiskfs]
             #8 [ffff881c4f78b818] ldiskfs_truncate at ffffffffa042d659 [ldiskfs]
             #9 [ffff881c4f78b938] ldiskfs_delete_inode at ffffffffa042e9d0 [ldiskfs]
            #10 [ffff881c4f78b958] generic_delete_inode at ffffffff8117f0de
            #11 [ffff881c4f78b988] generic_drop_inode at ffffffff8117f235
            #12 [ffff881c4f78b9a8] iput at ffffffff8117df52
            #13 [ffff881c4f78b9c8] mds_obd_destroy at ffffffffa0bf717d [mds]
            #14 [ffff881c4f78bb08] llog_lvfs_destroy at ffffffffa05705cd [obdclass]
            #15 [ffff881c4f78bbd8] llog_cancel_rec at ffffffffa0566424 [obdclass]
            #16 [ffff881c4f78bc08] llog_cat_cancel_records at ffffffffa056a3a1 [obdclass]
            #17 [ffff881c4f78bc68] llog_origin_handle_cancel at ffffffffa072923b [ptlrpc]
            #18 [ffff881c4f78bd68] ldlm_cancel_handler at ffffffffa06ee8ff [ptlrpc]
            #19 [ffff881c4f78bda8] ptlrpc_main at ffffffffa071f4f6 [ptlrpc]
            #20 [ffff881c4f78bf48] kernel_thread at ffffffff8100412a
            

            the transaction credit is not enough, and the thread restart the transaction while holding log_handle::lgh_lock

            bobijam Zhenyu Xu added a comment - Relates to LU-1648 . Thread 27469 stack trace from bt-all.merged.txt PID: 27469 TASK: ffff88199d25a080 CPU: 12 COMMAND: "ldlm_cn_88" #0 [ffff881c4f78b490] schedule at ffffffff81485765 #1 [ffff881c4f78b558] start_this_handle at ffffffffa006908a [jbd2] #2 [ffff881c4f78b618] jbd2_journal_restart at ffffffffa00693d1 [jbd2] #3 [ffff881c4f78b668] ldiskfs_truncate_restart_trans at ffffffffa042791a [ldiskfs] #4 [ffff881c4f78b698] ldiskfs_clear_blocks at ffffffffa042cc3d [ldiskfs] #5 [ffff881c4f78b6f8] ldiskfs_free_data at ffffffffa042ce24 [ldiskfs] #6 [ffff881c4f78b758] ldiskfs_free_branches at ffffffffa042d063 [ldiskfs] #7 [ffff881c4f78b7b8] ldiskfs_free_branches at ffffffffa042cf56 [ldiskfs] #8 [ffff881c4f78b818] ldiskfs_truncate at ffffffffa042d659 [ldiskfs] #9 [ffff881c4f78b938] ldiskfs_delete_inode at ffffffffa042e9d0 [ldiskfs] #10 [ffff881c4f78b958] generic_delete_inode at ffffffff8117f0de #11 [ffff881c4f78b988] generic_drop_inode at ffffffff8117f235 #12 [ffff881c4f78b9a8] iput at ffffffff8117df52 #13 [ffff881c4f78b9c8] mds_obd_destroy at ffffffffa0bf717d [mds] #14 [ffff881c4f78bb08] llog_lvfs_destroy at ffffffffa05705cd [obdclass] #15 [ffff881c4f78bbd8] llog_cancel_rec at ffffffffa0566424 [obdclass] #16 [ffff881c4f78bc08] llog_cat_cancel_records at ffffffffa056a3a1 [obdclass] #17 [ffff881c4f78bc68] llog_origin_handle_cancel at ffffffffa072923b [ptlrpc] #18 [ffff881c4f78bd68] ldlm_cancel_handler at ffffffffa06ee8ff [ptlrpc] #19 [ffff881c4f78bda8] ptlrpc_main at ffffffffa071f4f6 [ptlrpc] #20 [ffff881c4f78bf48] kernel_thread at ffffffff8100412a the transaction credit is not enough, and the thread restart the transaction while holding log_handle::lgh_lock
            pjones Peter Jones added a comment -

            Bobijam

            Could you please advise?

            Thanks

            Peter

            pjones Peter Jones added a comment - Bobijam Could you please advise? Thanks Peter

            People

              bobijam Zhenyu Xu
              patrick.valentin Patrick Valentin (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: