Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4271

mds load goes very high and filesystem hangs after mounting mdt

Details

    • Bug
    • Resolution: Won't Fix
    • Major
    • None
    • Lustre 2.1.5
    • None
    • 4
    • 11731

    Description

      after recover of a crashed mds the system load goes to >800
      Filesystem is DOWN. We need help to bring the filesystem online!

      here is the error
      Lustre: Skipped 2 previous similar messages
      Lustre: Service thread pid 7014 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      Pid: 7014, comm: mdt_01

      Call Trace:
      [<ffffffff8151d552>] schedule_timeout+0x192/0x2e0
      [<ffffffff8107bf80>] ? process_timeout+0x0/0x10
      [<ffffffffa04e45e1>] cfs_waitq_timedwait+0x11/0x20 [libcfs]
      [<ffffffffa0da2508>] osc_create+0x528/0xdc0 [osc]
      [<ffffffff8105fab0>] ? default_wake_function+0x0/0x20
      [<ffffffffa0e13337>] lov_check_and_create_object+0x187/0x570 [lov]
      [<ffffffffa0e13a1b>] qos_remedy_create+0x1db/0x220 [lov]
      [<ffffffffa0e1059a>] lov_fini_create_set+0x24a/0x1200 [lov]
      [<ffffffffa0dfa0f2>] lov_create+0x792/0x1400 [lov]
      [<ffffffffa11000d6>] ? mdd_get_md+0x96/0x2f0 [mdd]
      [<ffffffff8105fab0>] ? default_wake_function+0x0/0x20
      [<ffffffffa1120916>] ? mdd_read_unlock+0x26/0x30 [mdd]
      [<ffffffffa110490e>] mdd_lov_create+0x9ee/0x1ba0 [mdd]
      [<ffffffffa1116871>] mdd_create+0xf81/0x1a90 [mdd]
      [<ffffffffa121edf3>] ? osd_oi_lookup+0x83/0x110 [osd_ldiskfs]
      [<ffffffffa121956c>] ? osd_object_init+0xdc/0x3e0 [osd_ldiskfs]
      [<ffffffffa124f3f7>] cml_create+0x97/0x250 [cmm]
      [<ffffffffa118b5e1>] ? mdt_version_get_save+0x91/0xd0 [mdt]
      [<ffffffffa11a106e>] mdt_reint_open+0x1aae/0x28a0 [mdt]
      [<ffffffffa077a724>] ? lustre_msg_add_version+0x74/0xd0 [ptlrpc]
      [<ffffffffa111956e>] ? md_ucred+0x1e/0x60 [mdd]
      [<ffffffffa1189c81>] mdt_reint_rec+0x41/0xe0 [mdt]
      [<ffffffffa1180ed4>] mdt_reint_internal+0x544/0x8e0 [mdt]
      [<ffffffffa118153d>] mdt_intent_reint+0x1ed/0x530 [mdt]
      [<ffffffffa117fc09>] mdt_intent_policy+0x379/0x690 [mdt]
      [<ffffffffa0736351>] ldlm_lock_enqueue+0x361/0x8f0 [ptlrpc]
      [<ffffffffa075c1ad>] ldlm_handle_enqueue0+0x48d/0xf50 [ptlrpc]
      [<ffffffffa1180586>] mdt_enqueue+0x46/0x130 [mdt]
      [<ffffffffa1175772>] mdt_handle_common+0x932/0x1750 [mdt]
      [<ffffffffa1176665>] mdt_regular_handle+0x15/0x20 [mdt]
      [<ffffffffa078ab4e>] ptlrpc_main+0xc4e/0x1a40 [ptlrpc]
      [<ffffffffa0789f00>] ? ptlrpc_main+0x0/0x1a40 [ptlrpc]
      [<ffffffff8100c0ca>] child_rip+0xa/0x20
      [<ffffffffa0789f00>] ? ptlrpc_main+0x0/0x1a40 [ptlrpc]
      [<ffffffffa0789f00>] ? ptlrpc_main+0x0/0x1a40 [ptlrpc]
      [<ffffffff8100c0c0>] ? child_rip+0x0/0x20

      Attachments

        Issue Links

          Activity

            [LU-4271] mds load goes very high and filesystem hangs after mounting mdt
            pjones Peter Jones added a comment -

            As per NASA ok to close ticket

            pjones Peter Jones added a comment - As per NASA ok to close ticket

            Hello Bruno and Mahmoud,

            Can I ask if there is going to be more progress on this ticket?

            Many thanks,
            ~ jfc.

            jfc John Fuchs-Chesney (Inactive) added a comment - Hello Bruno and Mahmoud, Can I ask if there is going to be more progress on this ticket? Many thanks, ~ jfc.

            I have been working "blindly" with Bob having direct access to the crash-dump, and here is the analysis of the deadlock out from the crash-dump you provided.
            Bob already posted before 2 threads stacks, were one is stuck in jbd2_journal_stop()/jbd2_log_wait_commit() synchronously waiting for the current transaction to be committed, and the other started the commit process, in jbd2_journal_commit_transaction(), but schedule()'d.
            The reason the committing thread schedule()'d is because it is waiting for the last handle to complete (ie, we miss a last journal_stop() to have journal_t->j_running_transaction->t_updates to become 0 when it is still 1).
            It took us some time to find who owns this last handle, but finally we found it and it is stuck with the following stack :

            PID: 19803  TASK: ffff883ff3754080  CPU: 1   COMMAND: "mdt_504"
             #0 [ffff883dc31f76b0] schedule at ffffffff8151c712
             #1 [ffff883dc31f7778] cfs_waitq_wait at ffffffffa04f960e [libcfs]
             #2 [ffff883dc31f7788] qctxt_wait_pending_dqacq at ffffffffa09bfe1b [lquota]
             #3 [ffff883dc31f7878] qctxt_adjust_qunit at ffffffffa09c5d41 [lquota]
             #4 [ffff883dc31f7908] quota_acquire_common at ffffffffa09c9653 [lquota]
             #5 [ffff883dc31f7938] quota_chk_acq_common at ffffffffa09cb722 [lquota]
             #6 [ffff883dc31f7a78] lquota_chkquota.clone.3 at ffffffffa0c0300b [mdd]
             #7 [ffff883dc31f7ae8] mdd_attr_set at ffffffffa0c0dcf6 [mdd]
             #8 [ffff883dc31f7bc8] cml_attr_set at ffffffffa0d4fa76 [cmm]
             #9 [ffff883dc31f7bf8] mdt_attr_set at ffffffffa0c9ec68 [mdt]
            #10 [ffff883dc31f7c48] mdt_reint_setattr at ffffffffa0c9f2b5 [mdt]
            #11 [ffff883dc31f7cd8] mdt_reint_rec at ffffffffa0c98c81 [mdt]
            #12 [ffff883dc31f7cf8] mdt_reint_internal at ffffffffa0c8fed4 [mdt]
            #13 [ffff883dc31f7d48] mdt_reint at ffffffffa0c902b4 [mdt]
            #14 [ffff883dc31f7d68] mdt_handle_common at ffffffffa0c84772 [mdt]
            #15 [ffff883dc31f7db8] mdt_regular_handle at ffffffffa0c85665 [mdt]
            #16 [ffff883dc31f7dc8] ptlrpc_main at ffffffffa079fb4e [ptlrpc]
            #17 [ffff883dc31f7f48] kernel_thread at ffffffff8100c0ca
            

            This is confirmed by looking into the source code of mdd_attr_set(), where we can see the mdd_trans_start()/lquota_chkquota()/mdd_trans_stop() call sequence.

            The reason this thread is waiting is because the qunit it wants to process is presently hashed/"inflight".

            Looking more into others threads to find a reason to explain this blocked situation, we found a thread that is working on this same qunit but is itself stuck trying to start a new journal/transaction handle (like hundreds other threads ...), with the following stack :

            PID: 7460   TASK: ffff883f65bc9500  CPU: 25  COMMAND: "mdt_308"
             #0 [ffff883f65bcb2b0] schedule at ffffffff8151c712
             #1 [ffff883f65bcb378] start_this_handle at ffffffffa055c14a [jbd2]
             #2 [ffff883f65bcb438] jbd2_journal_start at ffffffffa055c570 [jbd2]
             #3 [ffff883f65bcb488] ldiskfs_journal_start_sb at ffffffffa08e6338 [ldiskfs]
             #4 [ffff883f65bcb498] lustre_quota_journal_start at ffffffffa072c61f [fsfilt_ldiskfs]
             #5 [ffff883f65bcb4a8] lustre_commit_dquot at ffffffffa0734968 [fsfilt_ldiskfs]
             #6 [ffff883f65bcb548] fsfilt_ldiskfs_dquot at ffffffffa072b604 [fsfilt_ldiskfs]
             #7 [ffff883f65bcb568] dqacq_handler at ffffffffa09d59fe [lquota]
             #8 [ffff883f65bcb628] schedule_dqacq at ffffffffa09c2529 [lquota]
             #9 [ffff883f65bcb728] qctxt_adjust_qunit at ffffffffa09c5cdc [lquota]
            #10 [ffff883f65bcb7b8] quota_acquire_common at ffffffffa09c9653 [lquota]
            #11 [ffff883f65bcb7e8] quota_chk_acq_common at ffffffffa09cb722 [lquota]
            #12 [ffff883f65bcb928] lquota_chkquota.clone.1 at ffffffffa0c1c28b [mdd]
            #13 [ffff883f65bcb998] mdd_rename at ffffffffa0c22c0a [mdd]
            #14 [ffff883f65bcbb08] cml_rename at ffffffffa0d52b04 [cmm]
            #15 [ffff883f65bcbb88] mdt_reint_rename at ffffffffa0c9da73 [mdt]
            #16 [ffff883f65bcbcd8] mdt_reint_rec at ffffffffa0c98c81 [mdt]
            #17 [ffff883f65bcbcf8] mdt_reint_internal at ffffffffa0c8fed4 [mdt]
            #18 [ffff883f65bcbd48] mdt_reint at ffffffffa0c902b4 [mdt]
            #19 [ffff883f65bcbd68] mdt_handle_common at ffffffffa0c84772 [mdt]
            #20 [ffff883f65bcbdb8] mdt_regular_handle at ffffffffa0c85665 [mdt]
            #21 [ffff883f65bcbdc8] ptlrpc_main at ffffffffa079fb4e [ptlrpc]
            #22 [ffff883f65bcbf48] kernel_thread at ffffffff8100c0ca
            

            and which is presently blocked due to the current/running transaction being in commit process!!
            So, dead-lock ...

            Seems that this has been possible due to the fact in mdd_attr_set(), a journal_start()/mdd_trans_start() has been done before starting a quota operation, when it is never the case in same layer's routines, like mdd_rename()/mdd_[un]link()/mdd_create()/..., where all "heavy" quota operations are made out of mdd_trans_start()/mdd_trans_stop() boundaries, to prevent such dead-lock situations.

            So, a possible fix for this should be to change the code of mdd_attr_set() by simply moving the #ifdef HAVE_QUOTA_SUPPORT/#endif block out from mdd_trans_start()/mdd_trans_stop() boundaries. I give it a try at http://review.whamcloud.com/10443.

            BTW, this problem is a pure b2_1/2.1.x one since all the concerned code has been re-written in 2.4.

            bfaccini Bruno Faccini (Inactive) added a comment - I have been working "blindly" with Bob having direct access to the crash-dump, and here is the analysis of the deadlock out from the crash-dump you provided. Bob already posted before 2 threads stacks, were one is stuck in jbd2_journal_stop()/jbd2_log_wait_commit() synchronously waiting for the current transaction to be committed, and the other started the commit process, in jbd2_journal_commit_transaction(), but schedule()'d. The reason the committing thread schedule()'d is because it is waiting for the last handle to complete (ie, we miss a last journal_stop() to have journal_t->j_running_transaction->t_updates to become 0 when it is still 1). It took us some time to find who owns this last handle, but finally we found it and it is stuck with the following stack : PID: 19803 TASK: ffff883ff3754080 CPU: 1 COMMAND: "mdt_504" #0 [ffff883dc31f76b0] schedule at ffffffff8151c712 #1 [ffff883dc31f7778] cfs_waitq_wait at ffffffffa04f960e [libcfs] #2 [ffff883dc31f7788] qctxt_wait_pending_dqacq at ffffffffa09bfe1b [lquota] #3 [ffff883dc31f7878] qctxt_adjust_qunit at ffffffffa09c5d41 [lquota] #4 [ffff883dc31f7908] quota_acquire_common at ffffffffa09c9653 [lquota] #5 [ffff883dc31f7938] quota_chk_acq_common at ffffffffa09cb722 [lquota] #6 [ffff883dc31f7a78] lquota_chkquota.clone.3 at ffffffffa0c0300b [mdd] #7 [ffff883dc31f7ae8] mdd_attr_set at ffffffffa0c0dcf6 [mdd] #8 [ffff883dc31f7bc8] cml_attr_set at ffffffffa0d4fa76 [cmm] #9 [ffff883dc31f7bf8] mdt_attr_set at ffffffffa0c9ec68 [mdt] #10 [ffff883dc31f7c48] mdt_reint_setattr at ffffffffa0c9f2b5 [mdt] #11 [ffff883dc31f7cd8] mdt_reint_rec at ffffffffa0c98c81 [mdt] #12 [ffff883dc31f7cf8] mdt_reint_internal at ffffffffa0c8fed4 [mdt] #13 [ffff883dc31f7d48] mdt_reint at ffffffffa0c902b4 [mdt] #14 [ffff883dc31f7d68] mdt_handle_common at ffffffffa0c84772 [mdt] #15 [ffff883dc31f7db8] mdt_regular_handle at ffffffffa0c85665 [mdt] #16 [ffff883dc31f7dc8] ptlrpc_main at ffffffffa079fb4e [ptlrpc] #17 [ffff883dc31f7f48] kernel_thread at ffffffff8100c0ca This is confirmed by looking into the source code of mdd_attr_set(), where we can see the mdd_trans_start()/lquota_chkquota()/mdd_trans_stop() call sequence. The reason this thread is waiting is because the qunit it wants to process is presently hashed/"inflight". Looking more into others threads to find a reason to explain this blocked situation, we found a thread that is working on this same qunit but is itself stuck trying to start a new journal/transaction handle (like hundreds other threads ...), with the following stack : PID: 7460 TASK: ffff883f65bc9500 CPU: 25 COMMAND: "mdt_308" #0 [ffff883f65bcb2b0] schedule at ffffffff8151c712 #1 [ffff883f65bcb378] start_this_handle at ffffffffa055c14a [jbd2] #2 [ffff883f65bcb438] jbd2_journal_start at ffffffffa055c570 [jbd2] #3 [ffff883f65bcb488] ldiskfs_journal_start_sb at ffffffffa08e6338 [ldiskfs] #4 [ffff883f65bcb498] lustre_quota_journal_start at ffffffffa072c61f [fsfilt_ldiskfs] #5 [ffff883f65bcb4a8] lustre_commit_dquot at ffffffffa0734968 [fsfilt_ldiskfs] #6 [ffff883f65bcb548] fsfilt_ldiskfs_dquot at ffffffffa072b604 [fsfilt_ldiskfs] #7 [ffff883f65bcb568] dqacq_handler at ffffffffa09d59fe [lquota] #8 [ffff883f65bcb628] schedule_dqacq at ffffffffa09c2529 [lquota] #9 [ffff883f65bcb728] qctxt_adjust_qunit at ffffffffa09c5cdc [lquota] #10 [ffff883f65bcb7b8] quota_acquire_common at ffffffffa09c9653 [lquota] #11 [ffff883f65bcb7e8] quota_chk_acq_common at ffffffffa09cb722 [lquota] #12 [ffff883f65bcb928] lquota_chkquota.clone.1 at ffffffffa0c1c28b [mdd] #13 [ffff883f65bcb998] mdd_rename at ffffffffa0c22c0a [mdd] #14 [ffff883f65bcbb08] cml_rename at ffffffffa0d52b04 [cmm] #15 [ffff883f65bcbb88] mdt_reint_rename at ffffffffa0c9da73 [mdt] #16 [ffff883f65bcbcd8] mdt_reint_rec at ffffffffa0c98c81 [mdt] #17 [ffff883f65bcbcf8] mdt_reint_internal at ffffffffa0c8fed4 [mdt] #18 [ffff883f65bcbd48] mdt_reint at ffffffffa0c902b4 [mdt] #19 [ffff883f65bcbd68] mdt_handle_common at ffffffffa0c84772 [mdt] #20 [ffff883f65bcbdb8] mdt_regular_handle at ffffffffa0c85665 [mdt] #21 [ffff883f65bcbdc8] ptlrpc_main at ffffffffa079fb4e [ptlrpc] #22 [ffff883f65bcbf48] kernel_thread at ffffffff8100c0ca and which is presently blocked due to the current/running transaction being in commit process!! So, dead-lock ... Seems that this has been possible due to the fact in mdd_attr_set(), a journal_start()/mdd_trans_start() has been done before starting a quota operation, when it is never the case in same layer's routines, like mdd_rename()/mdd_ [un] link()/mdd_create()/..., where all "heavy" quota operations are made out of mdd_trans_start()/mdd_trans_stop() boundaries, to prevent such dead-lock situations. So, a possible fix for this should be to change the code of mdd_attr_set() by simply moving the #ifdef HAVE_QUOTA_SUPPORT/#endif block out from mdd_trans_start()/mdd_trans_stop() boundaries. I give it a try at http://review.whamcloud.com/10443 . BTW, this problem is a pure b2_1/2.1.x one since all the concerned code has been re-written in 2.4.
            bfaccini Bruno Faccini (Inactive) added a comment - - edited

            Bob, can you check if you find the same kind of dead-lock scenario/stacks that Bobi detailled in LU-4794? BTW, I did not find it in the dmesg with full stacks trace attached to this ticket.

            bfaccini Bruno Faccini (Inactive) added a comment - - edited Bob, can you check if you find the same kind of dead-lock scenario/stacks that Bobi detailled in LU-4794 ? BTW, I did not find it in the dmesg with full stacks trace attached to this ticket.

            Sure Bob, this last thread should be the one blocking all the others in JBD2 layer! And again, and like in one of my original updates on 19/Nov/2013, the concerned device is "dm-3-8" that Mahmoud confirmed to be the MDT.

            And yes, it looks like a dup of LU-4794, but also earlier tickets that have been simply closed due to no new occurrence ...

            What would be cool now would be to identify if this last thread has been scheduled recently, if not why or if yes, why it is looping+re-schedule()ing (t_updates != NULL?) ...

            bfaccini Bruno Faccini (Inactive) added a comment - Sure Bob, this last thread should be the one blocking all the others in JBD2 layer! And again, and like in one of my original updates on 19/Nov/2013, the concerned device is "dm-3-8" that Mahmoud confirmed to be the MDT. And yes, it looks like a dup of LU-4794 , but also earlier tickets that have been simply closed due to no new occurrence ... What would be cool now would be to identify if this last thread has been scheduled recently, if not why or if yes, why it is looping+re-schedule()ing (t_updates != NULL?) ...

            another odd outlier thread:

            PID: 6425   TASK: ffff883fdc6c9500  CPU: 5   COMMAND: "jbd2/dm-3-8"
             #0 [ffff883fec013c60] schedule at ffffffff8151c712
             #1 [ffff883fec013d28] jbd2_journal_commit_transaction at ffffffffa055d8df [jbd2]
             #2 [ffff883fec013e68] kjournald2 at ffffffffa05646c8 [jbd2]
             #3 [ffff883fec013ee8] kthread at ffffffff8108fb96
             #4 [ffff883fec013f48] kernel_thread at ffffffff8100c0ca
            

            I'm wondering if this might be a dup of LU-4794

            bogl Bob Glossman (Inactive) added a comment - another odd outlier thread: PID: 6425 TASK: ffff883fdc6c9500 CPU: 5 COMMAND: "jbd2/dm-3-8" #0 [ffff883fec013c60] schedule at ffffffff8151c712 #1 [ffff883fec013d28] jbd2_journal_commit_transaction at ffffffffa055d8df [jbd2] #2 [ffff883fec013e68] kjournald2 at ffffffffa05646c8 [jbd2] #3 [ffff883fec013ee8] kthread at ffffffff8108fb96 #4 [ffff883fec013f48] kernel_thread at ffffffff8100c0ca I'm wondering if this might be a dup of LU-4794

            I note that in addition to the 500+ threads stuck in jbd2_journal_start, there's one thread stuck elsewhere it jbd2 code. The stack trace of this thread looks like:

            PID: 7885   TASK: ffff881e506e2ae0  CPU: 6   COMMAND: "mdt_439"
             #0 [ffff881e506f38c0] schedule at ffffffff8151c712
             #1 [ffff881e506f3988] jbd2_log_wait_commit at ffffffffa0564325 [jbd2]
             #2 [ffff881e506f3a18] jbd2_journal_stop at ffffffffa055bacb [jbd2]
             #3 [ffff881e506f3a78] __ldiskfs_journal_stop at ffffffffa08e62a8 [ldiskfs]
             #4 [ffff881e506f3aa8] osd_trans_stop at ffffffffa0d26476 [osd_ldiskfs]
             #5 [ffff881e506f3ad8] mdd_trans_stop at ffffffffa0c2d4aa [mdd]
             #6 [ffff881e506f3ae8] mdd_attr_set at ffffffffa0c0ca5f [mdd]
             #7 [ffff881e506f3bc8] cml_attr_set at ffffffffa0d4fa76 [cmm]
             #8 [ffff881e506f3bf8] mdt_attr_set at ffffffffa0c9ec68 [mdt]
             #9 [ffff881e506f3c48] mdt_reint_setattr at ffffffffa0c9f2b5 [mdt]
            #10 [ffff881e506f3cd8] mdt_reint_rec at ffffffffa0c98c81 [mdt]
            #11 [ffff881e506f3cf8] mdt_reint_internal at ffffffffa0c8fed4 [mdt]
            #12 [ffff881e506f3d48] mdt_reint at ffffffffa0c902b4 [mdt]
            #13 [ffff881e506f3d68] mdt_handle_common at ffffffffa0c84772 [mdt]
            #14 [ffff881e506f3db8] mdt_regular_handle at ffffffffa0c85665 [mdt]
            #15 [ffff881e506f3dc8] ptlrpc_main at ffffffffa079fb4e [ptlrpc]
            #16 [ffff881e506f3f48] kernel_thread at ffffffff8100c0ca
            

            I can't see how this thread might be blocking all the others, but it is interesting that it's the only one I have found so far in similar code but different from all the others.

            bogl Bob Glossman (Inactive) added a comment - I note that in addition to the 500+ threads stuck in jbd2_journal_start, there's one thread stuck elsewhere it jbd2 code. The stack trace of this thread looks like: PID: 7885 TASK: ffff881e506e2ae0 CPU: 6 COMMAND: "mdt_439" #0 [ffff881e506f38c0] schedule at ffffffff8151c712 #1 [ffff881e506f3988] jbd2_log_wait_commit at ffffffffa0564325 [jbd2] #2 [ffff881e506f3a18] jbd2_journal_stop at ffffffffa055bacb [jbd2] #3 [ffff881e506f3a78] __ldiskfs_journal_stop at ffffffffa08e62a8 [ldiskfs] #4 [ffff881e506f3aa8] osd_trans_stop at ffffffffa0d26476 [osd_ldiskfs] #5 [ffff881e506f3ad8] mdd_trans_stop at ffffffffa0c2d4aa [mdd] #6 [ffff881e506f3ae8] mdd_attr_set at ffffffffa0c0ca5f [mdd] #7 [ffff881e506f3bc8] cml_attr_set at ffffffffa0d4fa76 [cmm] #8 [ffff881e506f3bf8] mdt_attr_set at ffffffffa0c9ec68 [mdt] #9 [ffff881e506f3c48] mdt_reint_setattr at ffffffffa0c9f2b5 [mdt] #10 [ffff881e506f3cd8] mdt_reint_rec at ffffffffa0c98c81 [mdt] #11 [ffff881e506f3cf8] mdt_reint_internal at ffffffffa0c8fed4 [mdt] #12 [ffff881e506f3d48] mdt_reint at ffffffffa0c902b4 [mdt] #13 [ffff881e506f3d68] mdt_handle_common at ffffffffa0c84772 [mdt] #14 [ffff881e506f3db8] mdt_regular_handle at ffffffffa0c85665 [mdt] #15 [ffff881e506f3dc8] ptlrpc_main at ffffffffa079fb4e [ptlrpc] #16 [ffff881e506f3f48] kernel_thread at ffffffff8100c0ca I can't see how this thread might be blocking all the others, but it is interesting that it's the only one I have found so far in similar code but different from all the others.

            Have been looking over the crash dump from the customer. As noted in previous comments there is quite a large number of threads all stuck in jbd2_journal_start with the common calling sequence ... osd_trans_start -> ldiskfs_journal_start_sb -> jbd2_journal_start -> start_this_handle. It seems very likely there's a single other thread holding a lock or otherwise blocking all these callers of jbd2_journal_start, but so far I haven't been able to find the culprit.

            bogl Bob Glossman (Inactive) added a comment - Have been looking over the crash dump from the customer. As noted in previous comments there is quite a large number of threads all stuck in jbd2_journal_start with the common calling sequence ... osd_trans_start -> ldiskfs_journal_start_sb -> jbd2_journal_start -> start_this_handle. It seems very likely there's a single other thread holding a lock or otherwise blocking all these callers of jbd2_journal_start, but so far I haven't been able to find the culprit.

            Mahmoud, you including a large number of kernel RPMs in the tarball, can you please specify exactly which version corresponds to the kernel running on the node where the vmcore was taken?

            cliffw Cliff White (Inactive) added a comment - Mahmoud, you including a large number of kernel RPMs in the tarball, can you please specify exactly which version corresponds to the kernel running on the node where the vmcore was taken?

            Cliff,

            Email sent

            -Mahmoud

            mhanafi Mahmoud Hanafi added a comment - Cliff, Email sent -Mahmoud

            People

              bfaccini Bruno Faccini (Inactive)
              mhanafi Mahmoud Hanafi
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: