Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-81

Some JBD2 journaling deadlock at BULL

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.2.0, Lustre 2.1.2
    • Lustre 2.0.0
    • None
    • 2
    • 24,438
    • 4793

    Description

      BULL reports at the bugzilla that there are some possible deadlock issues on MDS with jbd2 (just run away transactions?):

      At CEA, they have encountered several occurrences of the same scenario where all Lustre activity is
      hung. Each time they live-debug the problem, they end-up on the MDS node where all Lustre
      operations appear to be frozen.

      As a consequence, MDS has to be rebooted and Lustre layer has to be restarted on it with recovery.

      The MDS threads which appear to be strongly involved in the frozen situation have the following
      stack traces, taken from one of the forced crash-dumps:
      ==================================

      There are about 234 tasks with the same following stack:

      PID 5250 mdt_rdpg_143
      schedule()
      start_this_handle()
      jbd2_journal_start()
      ldiskfs_journal_start_sb()
      osd_trans_start()
      mdd_trans_start()
      cml_close()

      One is with:

      Pid: 4990 mdt_395
      schedule()
      jbd2_log_wait_commit()
      jbd2_journal_stop()
      __ldiskfs_journal_stop()
      osd_trans_stop()
      mdd_trans_stop()
      mdd_attr_set()
      cml_attr_set()

      And another with:

      Pid: 4534 "jbd2/sdd-8"
      schedule()
      jbd2_journal_commit_transaction()
      kjournald2()
      kthread()
      kernel_thread()

      ==================================

      Analyzing the crash dump shows that the task hung in jbd2_journal_commit_transaction() is in this
      state since a very long time.

      This problem looks like bug 16667, but unfortunately it is not applicable 'as is' as it dates back
      to 1.6. Here it seems there is a race or deadlock in Lustre/JBD2 layers.
      As a workaround the customer deactivated the ChangeLog feature, and since then the problem never
      reoccurred. Sadly ChangeLogs are required by HSM so this workaround cannot last...

      Can you see the reason for this deadlock?

      I have to precise that this bug is critical as it blocks normal cluster operation (ie with HSM).

      Attachments

        Issue Links

          Activity

            People

              niu Niu Yawei (Inactive)
              green Oleg Drokin
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: