Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5848

sanity-lfsck test_18e: MDS is not the expected 'completed'

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.8.0
    • Lustre 2.7.0
    • None
    • 3
    • 16378

    Description

      This issue was created by maloo for Nathaniel Clark <nathaniel.l.clark@intel.com>

      This issue relates to the following test suite run:
      https://testing.hpdd.intel.com/test_sets/81b27534-6234-11e4-8055-5254006e85c2
      https://testing.hpdd.intel.com/test_sets/9037c036-5d26-11e4-ae10-5254006e85c2

      The sub-test test_18e failed with the following error:

      (4) MDS4 is not the expected 'completed'
      

      Please provide additional information about the failure here.

      Info required for matching: sanity-lfsck 18e

      Attachments

        Activity

          [LU-5848] sanity-lfsck test_18e: MDS is not the expected 'completed'

          Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13950/
          Subject: LU-5848 lfsck: debug log for sanity-lfsck test_18e
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: bb597d302fe0259ce60691677a6e79c3ff19bbb2

          gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13950/ Subject: LU-5848 lfsck: debug log for sanity-lfsck test_18e Project: fs/lustre-release Branch: master Current Patch Set: Commit: bb597d302fe0259ce60691677a6e79c3ff19bbb2

          Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/13950
          Subject: LU-5848 lfsck: debug log for sanity-lfsck test_18e
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: 249163ca2d2e605be1d7617ca1b0a778ccb05e5e

          gerrit Gerrit Updater added a comment - Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/13950 Subject: LU-5848 lfsck: debug log for sanity-lfsck test_18e Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 249163ca2d2e605be1d7617ca1b0a778ccb05e5e

          As I remembered, Maloo will collect the kernel stack trace automatically. But I do not know why it does not now.

          yong.fan nasf (Inactive) added a comment - As I remembered, Maloo will collect the kernel stack trace automatically. But I do not know why it does not now.

          Hello Nasf, after I tried to further+in-deep debug my own failure case, I agree with you that there is not much of interest (than the expected but missing log entries as you pointed!) in the Lustre debug logs.
          You may want to add an "echo t > /proc/sysrq-trigger" cmd, sent to all nodes/MDS running MMDT facets, in sanity-lfsck/test_18e upon error.

          bfaccini Bruno Faccini (Inactive) added a comment - Hello Nasf, after I tried to further+in-deep debug my own failure case, I agree with you that there is not much of interest (than the expected but missing log entries as you pointed!) in the Lustre debug logs. You may want to add an "echo t > /proc/sysrq-trigger" cmd, sent to all nodes/MDS running MMDT facets, in sanity-lfsck/test_18e upon error.

          According to the log, the layout LFSCK on MDT1 was NOT fall into dt_sync(), instead, the layout LFSCK assistant thread did not arrived dt_sync() point yet. Two possible cases:
          1) The layout LFSCK assistant thread blocked inside lfsck_assistant_notify_others().
          2) There are dead lock between the master engine and assistant thread.

          Unfortunately, we have not the kernel stack trace for all the failure instances. We cannot know exactly what happened. Is there any way to get the stack trace?

          yong.fan nasf (Inactive) added a comment - According to the log, the layout LFSCK on MDT1 was NOT fall into dt_sync(), instead, the layout LFSCK assistant thread did not arrived dt_sync() point yet. Two possible cases: 1) The layout LFSCK assistant thread blocked inside lfsck_assistant_notify_others(). 2) There are dead lock between the master engine and assistant thread. Unfortunately, we have not the kernel stack trace for all the failure instances. We cannot know exactly what happened. Is there any way to get the stack trace?

          +1 at "https://testing.hpdd.intel.com/test_sets/f720aa60-c121-11e4-b948-5254006e85c2". May be there is something new to learn from the added debug logs ?

          bfaccini Bruno Faccini (Inactive) added a comment - +1 at "https://testing.hpdd.intel.com/test_sets/f720aa60-c121-11e4-b948-5254006e85c2". May be there is something new to learn from the added debug logs ?

          Landed to master for 2.7.0

          adilger Andreas Dilger added a comment - Landed to master for 2.7.0

          Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12573/
          Subject: LU-5848 debug: more debug log for dt_sync
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: ec37d78caf6cbd8118cfe1ff012c828950356e7a

          gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12573/ Subject: LU-5848 debug: more debug log for dt_sync Project: fs/lustre-release Branch: master Current Patch Set: Commit: ec37d78caf6cbd8118cfe1ff012c828950356e7a
          yong.fan nasf (Inactive) added a comment - Here is the debug patch: http://review.whamcloud.com/#/c/12573/

          According to the test_logs, the LFSCK status on the MDT4 is 'scanning-phase2', not the expected "completed". So means the LFSCK on the MDT4 should have moved to the second-stage scanning. Normally, before the second-stage scanning start, the LFSCK assistant thread will generate the LFSCK log "LFSCK assistant phase2 scan start", but I cannot find such log in the MDT4 debug_log, that is abnormal. The most possible reason is that the LFSCK assistant thread was blocked inside the dt_sync() before generating such LFSCK log message. Currently, we do not know what happened during the dt_sync(). I will make a debug patch to verify that.

          yong.fan nasf (Inactive) added a comment - According to the test_logs, the LFSCK status on the MDT4 is 'scanning-phase2', not the expected "completed". So means the LFSCK on the MDT4 should have moved to the second-stage scanning. Normally, before the second-stage scanning start, the LFSCK assistant thread will generate the LFSCK log "LFSCK assistant phase2 scan start", but I cannot find such log in the MDT4 debug_log, that is abnormal. The most possible reason is that the LFSCK assistant thread was blocked inside the dt_sync() before generating such LFSCK log message. Currently, we do not know what happened during the dt_sync(). I will make a debug patch to verify that.

          People

            yong.fan nasf (Inactive)
            maloo Maloo
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: