Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5833

sanity-lfsck test_6b: namespace lfsck completed unexpectedly

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.7.0
    • Lustre 2.7.0
    • None
    • 3
    • 16356

    Description

      This issue was created by maloo for nasf <fan.yong@intel.com>

      This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/29838f2c-60fb-11e4-a66b-5254006e85c2.

      The sub-test test_6b failed with the following error:

      (6) Expect 'scanning-phase1', but got 'completed'
      

      Please provide additional information about the failure here.

      Info required for matching: sanity-lfsck 6b

      Attachments

        Activity

          [LU-5833] sanity-lfsck test_6b: namespace lfsck completed unexpectedly

          Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12533/
          Subject: LU-5833 lfsck: handle lfsck_open_dir() return-value properly
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: f935a36c035a20433669997f7d70b35073dff5f2

          gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12533/ Subject: LU-5833 lfsck: handle lfsck_open_dir() return-value properly Project: fs/lustre-release Branch: master Current Patch Set: Commit: f935a36c035a20433669997f7d70b35073dff5f2

          Inside the lfsck_prep(), the returned value from lfsck_open_dir() was not properly handled before returned back to the caller. For example: if the LFSCK arrived at the end of current directory when call lfsck_open_dir(), then the lfsck_open_dir() will return positive number, if the lfsck_prep() continuously returns such value to its caller, then the whole LFSCK first-stage scanning will be guarded as done by wrong.

          Here is the patch to fix that:
          http://review.whamcloud.com/#/c/12533

          yong.fan nasf (Inactive) added a comment - Inside the lfsck_prep(), the returned value from lfsck_open_dir() was not properly handled before returned back to the caller. For example: if the LFSCK arrived at the end of current directory when call lfsck_open_dir(), then the lfsck_open_dir() will return positive number, if the lfsck_prep() continuously returns such value to its caller, then the whole LFSCK first-stage scanning will be guarded as done by wrong. Here is the patch to fix that: http://review.whamcloud.com/#/c/12533
          yong.fan nasf (Inactive) added a comment - Another failure instance with the patch 11848: https://testing.hpdd.intel.com/test_sets/5a28e144-626e-11e4-b9a7-5254006e85c2

          Here is another failure instance without the patch 11848:
          https://testing.hpdd.intel.com/test_sets/4ce9f344-5ca4-11e4-b9ce-5254006e85c2

          yong.fan nasf (Inactive) added a comment - Here is another failure instance without the patch 11848: https://testing.hpdd.intel.com/test_sets/4ce9f344-5ca4-11e4-b9ce-5254006e85c2

          According to the log on MDS, when the namespace LFSCK started (resume from former run) for the last time, it seemed that the low layer iteration did not return more objects, as to the injected failure stub (OBD_FAIL_LFSCK_DELAY2) has not been triggered as expected, so there was no delay, so the LFSCK completed quickly. I have met such situation before. Although I did not catch the root reason, it should not related with the patch http://review.whamcloud.com/#/c/11848/14, because it has ever happened without this patch.

          00000004:00000080:0.0:1414753244.871791:0:9874:0:(mdt_handler.c:5682:mdt_iocontrol()) handling ioctl cmd 0xc00866e6
          00100000:10000000:1.0:1414753245.981101:0:9876:0:(lfsck_engine.c:1620:lfsck_assistant_engine()) lustre-MDT0000-osd: lfsck_namespace LFSCK assistant thread start
          00100000:10000000:1.0:1414753245.981150:0:9875:0:(lfsck_namespace.c:3966:lfsck_namespace_prep()) lustre-MDT0000-osd: namespace LFSCK prep done, start pos [732, [0x200000bd4:0xdf:0x0], 0xa6f862b9510000]: rc = 0
          00100000:10000000:1.0:1414753245.981685:0:9875:0:(lfsck_namespace.c:4181:lfsck_namespace_post()) lustre-MDT0000-osd: namespace LFSCK post done: rc = 0
          00100000:10000000:1.0:1414753245.981695:0:9876:0:(lfsck_engine.c:1691:lfsck_assistant_engine()) lustre-MDT0000-osd: lfsck_namespace LFSCK assistant thread post
          
          yong.fan nasf (Inactive) added a comment - According to the log on MDS, when the namespace LFSCK started (resume from former run) for the last time, it seemed that the low layer iteration did not return more objects, as to the injected failure stub (OBD_FAIL_LFSCK_DELAY2) has not been triggered as expected, so there was no delay, so the LFSCK completed quickly. I have met such situation before. Although I did not catch the root reason, it should not related with the patch http://review.whamcloud.com/#/c/11848/14 , because it has ever happened without this patch. 00000004:00000080:0.0:1414753244.871791:0:9874:0:(mdt_handler.c:5682:mdt_iocontrol()) handling ioctl cmd 0xc00866e6 00100000:10000000:1.0:1414753245.981101:0:9876:0:(lfsck_engine.c:1620:lfsck_assistant_engine()) lustre-MDT0000-osd: lfsck_namespace LFSCK assistant thread start 00100000:10000000:1.0:1414753245.981150:0:9875:0:(lfsck_namespace.c:3966:lfsck_namespace_prep()) lustre-MDT0000-osd: namespace LFSCK prep done, start pos [732, [0x200000bd4:0xdf:0x0], 0xa6f862b9510000]: rc = 0 00100000:10000000:1.0:1414753245.981685:0:9875:0:(lfsck_namespace.c:4181:lfsck_namespace_post()) lustre-MDT0000-osd: namespace LFSCK post done: rc = 0 00100000:10000000:1.0:1414753245.981695:0:9876:0:(lfsck_engine.c:1691:lfsck_assistant_engine()) lustre-MDT0000-osd: lfsck_namespace LFSCK assistant thread post

          People

            yong.fan nasf (Inactive)
            maloo Maloo
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: