Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5848

sanity-lfsck test_18e: MDS is not the expected 'completed'

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.8.0
    • Lustre 2.7.0
    • None
    • 3
    • 16378

    Description

      This issue was created by maloo for Nathaniel Clark <nathaniel.l.clark@intel.com>

      This issue relates to the following test suite run:
      https://testing.hpdd.intel.com/test_sets/81b27534-6234-11e4-8055-5254006e85c2
      https://testing.hpdd.intel.com/test_sets/9037c036-5d26-11e4-ae10-5254006e85c2

      The sub-test test_18e failed with the following error:

      (4) MDS4 is not the expected 'completed'
      

      Please provide additional information about the failure here.

      Info required for matching: sanity-lfsck 18e

      Attachments

        Activity

          [LU-5848] sanity-lfsck test_18e: MDS is not the expected 'completed'

          Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13950/
          Subject: LU-5848 lfsck: debug log for sanity-lfsck test_18e
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: bb597d302fe0259ce60691677a6e79c3ff19bbb2

          gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13950/ Subject: LU-5848 lfsck: debug log for sanity-lfsck test_18e Project: fs/lustre-release Branch: master Current Patch Set: Commit: bb597d302fe0259ce60691677a6e79c3ff19bbb2

          Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/13950
          Subject: LU-5848 lfsck: debug log for sanity-lfsck test_18e
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: 249163ca2d2e605be1d7617ca1b0a778ccb05e5e

          gerrit Gerrit Updater added a comment - Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/13950 Subject: LU-5848 lfsck: debug log for sanity-lfsck test_18e Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 249163ca2d2e605be1d7617ca1b0a778ccb05e5e

          As I remembered, Maloo will collect the kernel stack trace automatically. But I do not know why it does not now.

          yong.fan nasf (Inactive) added a comment - As I remembered, Maloo will collect the kernel stack trace automatically. But I do not know why it does not now.

          Hello Nasf, after I tried to further+in-deep debug my own failure case, I agree with you that there is not much of interest (than the expected but missing log entries as you pointed!) in the Lustre debug logs.
          You may want to add an "echo t > /proc/sysrq-trigger" cmd, sent to all nodes/MDS running MMDT facets, in sanity-lfsck/test_18e upon error.

          bfaccini Bruno Faccini (Inactive) added a comment - Hello Nasf, after I tried to further+in-deep debug my own failure case, I agree with you that there is not much of interest (than the expected but missing log entries as you pointed!) in the Lustre debug logs. You may want to add an "echo t > /proc/sysrq-trigger" cmd, sent to all nodes/MDS running MMDT facets, in sanity-lfsck/test_18e upon error.

          According to the log, the layout LFSCK on MDT1 was NOT fall into dt_sync(), instead, the layout LFSCK assistant thread did not arrived dt_sync() point yet. Two possible cases:
          1) The layout LFSCK assistant thread blocked inside lfsck_assistant_notify_others().
          2) There are dead lock between the master engine and assistant thread.

          Unfortunately, we have not the kernel stack trace for all the failure instances. We cannot know exactly what happened. Is there any way to get the stack trace?

          yong.fan nasf (Inactive) added a comment - According to the log, the layout LFSCK on MDT1 was NOT fall into dt_sync(), instead, the layout LFSCK assistant thread did not arrived dt_sync() point yet. Two possible cases: 1) The layout LFSCK assistant thread blocked inside lfsck_assistant_notify_others(). 2) There are dead lock between the master engine and assistant thread. Unfortunately, we have not the kernel stack trace for all the failure instances. We cannot know exactly what happened. Is there any way to get the stack trace?

          +1 at "https://testing.hpdd.intel.com/test_sets/f720aa60-c121-11e4-b948-5254006e85c2". May be there is something new to learn from the added debug logs ?

          bfaccini Bruno Faccini (Inactive) added a comment - +1 at "https://testing.hpdd.intel.com/test_sets/f720aa60-c121-11e4-b948-5254006e85c2". May be there is something new to learn from the added debug logs ?

          People

            yong.fan nasf (Inactive)
            maloo Maloo
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: