Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5721

sanity-lfsck test_18f failed for unexpected layout LFSCK status

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.7.0
    • None
    • None
    • 3
    • 16053

    Description

      Recently, Maloo tests hit some sanity-lfsck test_18f failures because of unexpected layout LFSCK status:

      sanity-lfsck test 18f: Skip the failed OST(s) when handle orphan OST-objects == 12:43:28 (1407761008)

      https://testing.hpdd.intel.com/test_sets/29aaa042-4a50-11e4-880b-5254006e85c2
      10-02 11390/29 5d5f83237aba013d9bfaee0bc101fa403008e528 LU-5516 lfsck: repair the lost name entry
      11384/29 e9299a16ed3b8402ac951994a04aed0876c3a365 LU-5515 lfsck: repair bad file type in name entry
      11383/28 201b1e2570a43e0929454e46c2a8ef90df67d304 LU-5513 lfsck: repair multiple referenced name entry
      Error: '(8) MDS1 is not the expected 'completed''

      https://testing.hpdd.intel.com/test_sets/a4396266-3b47-11e4-a78f-5254006e85c2
      09-13 11384/20 f3b4b0d9c8e12d79ac7c76ea25685b8b78afb410 LU-5515 lfsck: repair bad file type in name entry
      11383/20 LU-5513 lfsck: repair multiple referenced name entry
      Error: '(8) MDS1 is not the expected 'completed''

      https://testing.hpdd.intel.com/test_sets/2ed4fa44-3a52-11e4-b82a-5254006e85c2
      09-11 11485/10 1b05a4f6d0e3b09bf50cf483a2b587f7f67242ac LU-5509 osd: get PFID from linkEA for remote dir on ldiskfs
      11382/10 091db2912495f692e38e2c20f40452e4925702af LU-5508 osp: RPC adjustment for remote transaction
      10996/18 ab05d3ba9c21125fc8194efb06545c358d962f3f LU-5506 lfsck: skip orphan OST-object handling for failed OSTs (Merged commit).
      Error: '(4) MDS4 is not the expected 'completed''

      https://testing.hpdd.intel.com/test_sets/aa055074-3543-11e4-9daf-5254006e85c2
      09-05 11384/14 54ac922c31c41a0752367587e2692ef3747012bf LU-5515 lfsck: repair bad file type in name entry
      11383/14 9cd74486d1e98ff6492c11e5f97fc873087ed7d4 LU-5513 lfsck: repair multiple referenced name entry
      Error: '(2) MDS4 is not the expected 'partial''

      https://testing.hpdd.intel.com/test_sets/2c7eb87c-2196-11e4-8700-5254006e85c2
      08-11 11391/1 LU-5516 lfsck: repair orphan parent MDT-object
      11390/1 5901fac8b083883dba6e396f73097b11a638659b LU-4788 lfsck: repair the lost name entry
      11384/2 b7f3359b8cd82d208ff427febce92b1202e50a72 LU-5515 lfsck: repair bad file type in name entry
      11383/3 290127554b47ed3871735d217e5c4c5b4d5fe365 LU-5513 lfsck: repair multiple referenced name entry
      Error: '(2) MDS1 is not the expected 'partial''

      https://testing.hpdd.intel.com/test_sets/876c3a80-2186-11e4-b153-5254006e85c2
      08-11 11390/1 5901fac8b083883dba6e396f73097b11a638659b LU-4788 lfsck: repair the lost name entry
      11384/2 b7f3359b8cd82d208ff427febce92b1202e50a72 LU-5515 lfsck: repair bad file type in name entry
      11383/3 290127554b47ed3871735d217e5c4c5b4d5fe365 LU-5513 lfsck: repair multiple referenced name entry
      Error: '(2) MDS1 is not the expected 'partial''

      https://testing.hpdd.intel.com/test_sets/f38fb5c8-216f-11e4-bd4e-5254006e85c2
      08-11 11383/3 290127554b47ed3871735d217e5c4c5b4d5fe365 LU-5513 lfsck: repair multiple referenced name entry
      Error: '(2) MDS1 is not the expected 'partial''

      Attachments

        Activity

          [LU-5721] sanity-lfsck test_18f failed for unexpected layout LFSCK status

          The patch http://review.whamcloud.com/#/c/11516/ has been landed to master at Oct.30th, other related fixes will be landed to master via LU-5301/LU-5731.

          yong.fan nasf (Inactive) added a comment - The patch http://review.whamcloud.com/#/c/11516/ has been landed to master at Oct.30th, other related fixes will be landed to master via LU-5301 / LU-5731 .
          bogl Bob Glossman (Inactive) added a comment - another seen in master: https://testing.hpdd.intel.com/test_sets/464f8b14-5fcb-11e4-9a8e-5254006e85c2
          yong.fan nasf (Inactive) added a comment - - edited

          1) The failure about "Error: '(2) MDS1 is not the expected 'partial'' in the old version tests:
          https://testing.hpdd.intel.com/test_sets/f38fb5c8-216f-11e4-bd4e-5254006e85c2
          https://testing.hpdd.intel.com/test_sets/876c3a80-2186-11e4-b153-5254006e85c2
          https://testing.hpdd.intel.com/test_sets/2c7eb87c-2196-11e4-8700-5254006e85c2

          They failed because the injected failure stub has not been triggered. Such issue has already been resolved by subsequent versions and has been landed to master.

          2) The failure about "Error: '(2) MDS4 is not the expected 'partial'' in the tests:
          https://testing.hpdd.intel.com/test_sets/aa055074-3543-11e4-9daf-5254006e85c2

          It is another failure instance LU-5301.

          3) The failure about "Error: '(4) MDS4 is not the expected 'completed'' in the tests:
          https://testing.hpdd.intel.com/test_sets/2ed4fa44-3a52-11e4-b82a-5254006e85c2

          It is another failure instance LU-5301.

          4) The failure about "Error: '(8) MDS1 is not the expected 'completed'' in the tests:
          https://testing.hpdd.intel.com/test_sets/a4396266-3b47-11e4-a78f-5254006e85c2
          https://testing.hpdd.intel.com/test_sets/29aaa042-4a50-11e4-880b-5254006e85c2

          The LFSCK on the OST got some abnormal status when queried the LFSCK status from the MDT, then the LFSCK on the OST thought that the LFSCK on the MDT hit some unexpected trouble and marked them as exit in advance, and then the subsequent orphan MDT-object handling has been skipped. This issue will be fixed via the patch: http://review.whamcloud.com/#/c/11516/

          yong.fan nasf (Inactive) added a comment - - edited 1) The failure about "Error: '(2) MDS1 is not the expected 'partial'' in the old version tests: https://testing.hpdd.intel.com/test_sets/f38fb5c8-216f-11e4-bd4e-5254006e85c2 https://testing.hpdd.intel.com/test_sets/876c3a80-2186-11e4-b153-5254006e85c2 https://testing.hpdd.intel.com/test_sets/2c7eb87c-2196-11e4-8700-5254006e85c2 They failed because the injected failure stub has not been triggered. Such issue has already been resolved by subsequent versions and has been landed to master. 2) The failure about "Error: '(2) MDS4 is not the expected 'partial'' in the tests: https://testing.hpdd.intel.com/test_sets/aa055074-3543-11e4-9daf-5254006e85c2 It is another failure instance LU-5301 . 3) The failure about "Error: '(4) MDS4 is not the expected 'completed'' in the tests: https://testing.hpdd.intel.com/test_sets/2ed4fa44-3a52-11e4-b82a-5254006e85c2 It is another failure instance LU-5301 . 4) The failure about "Error: '(8) MDS1 is not the expected 'completed'' in the tests: https://testing.hpdd.intel.com/test_sets/a4396266-3b47-11e4-a78f-5254006e85c2 https://testing.hpdd.intel.com/test_sets/29aaa042-4a50-11e4-880b-5254006e85c2 The LFSCK on the OST got some abnormal status when queried the LFSCK status from the MDT, then the LFSCK on the OST thought that the LFSCK on the MDT hit some unexpected trouble and marked them as exit in advance, and then the subsequent orphan MDT-object handling has been skipped. This issue will be fixed via the patch: http://review.whamcloud.com/#/c/11516/

          There are two sub-failures under this ticket:

          1) During the first part of test test, we inject error stub to similar the case of some OST failed to respond some LFSCK request, then the LFSCK on the MDT should skip orphan OST-object handling for this OST and mark the LFSCK status as "partial", but because of some unknown reason, such injection did not cause the "partial" status. The potential impact is that the LFSCK may handle some orphan OST-objects unexpectedly. But because the OST-object usually contains its parent MDT-object's FID information, such unexpected LFSCK behaviour is harmless for most of cases.

          2) During the second part of the test, we clear former injected error stub, then the LFSCK should go smoothly and handle all objects. The final LFSCK status should be "completed", but because of some unknown reason, the LFSCK failed to verify some OST-object(s), then the final LFSCK status was "partial". This failure may cause some orphan OST-objects cannot be handled.

          yong.fan nasf (Inactive) added a comment - There are two sub-failures under this ticket: 1) During the first part of test test, we inject error stub to similar the case of some OST failed to respond some LFSCK request, then the LFSCK on the MDT should skip orphan OST-object handling for this OST and mark the LFSCK status as "partial", but because of some unknown reason, such injection did not cause the "partial" status. The potential impact is that the LFSCK may handle some orphan OST-objects unexpectedly. But because the OST-object usually contains its parent MDT-object's FID information, such unexpected LFSCK behaviour is harmless for most of cases. 2) During the second part of the test, we clear former injected error stub, then the LFSCK should go smoothly and handle all objects. The final LFSCK status should be "completed", but because of some unknown reason, the LFSCK failed to verify some OST-object(s), then the final LFSCK status was "partial". This failure may cause some orphan OST-objects cannot be handled.

          What is the impact of this bug? It isn't at all clear from the description whether this is just causing test failures, or if it is actually a bug that would cause problems for users?

          adilger Andreas Dilger added a comment - What is the impact of this bug? It isn't at all clear from the description whether this is just causing test failures, or if it is actually a bug that would cause problems for users?
          jhammond John Hammond added a comment -

          Here are the other 9/16 failures in maloo when I checked yesterday:

          ---- 11560 ----
          
          https://testing.hpdd.intel.com/test_sets/da2714ee-49f8-11e4-92b1-5254006e85c2
                  10-02 11560 a168cdd6e093d1cb6fc551202d6a53aab5f87fc8/5 LU-5451 lod: improve weird FID handling
                        7e000f8fcad8ed9023f502ca63c47f3bdcac8a6b         LU-5511 lfsck: repair unmatched parent-child pairs
                  Error: '(6.1) Expect 1 fixed on mds1, but got: 0'
          
          https://testing.hpdd.intel.com/test_sets/b6842b5e-4a06-11e4-adcb-5254006e85c2
                  10-01 11560 a168cdd6e093d1cb6fc551202d6a53aab5f87fc8 LU-5451 lod: improve weird FID handling
                  Error: '(6.1) Expect 1 fixed on mds1, but got: 0'
          
          https://testing.hpdd.intel.com/test_sets/47630d76-4560-11e4-8e96-5254006e85c2
                  09-26 11560 fa13c28d81ca917d1cfdfdefedb3a06845bb2386 LU-5451 lod: improve weird FID handling
                  Error: '(6.1) Expect 1 fixed on mds1, but got: 0'
          
          
          ---- 10996 ----
          
          https://testing.hpdd.intel.com/test_sets/883bc6b6-06be-11e4-8941-5254006e85c2
                  07-08 10996/4 2c4ffba41367e2cb850b2f7af1285641112c87fc LU-5506 lfsck: skip orphan OST-object handling for failed OSTs (Merged)
                  Error: '(6) Expect 2 fixed on mds{2}, but got: 3'
          
          https://testing.hpdd.intel.com/test_sets/aef7bb80-0645-11e4-8bf0-5254006e85c2
                  07-08 10996/3 e5d34ebcb64476bc9228551d68f08f0de4ae2944 LU-5506 lfsck: skip orphan OST-object handling for failed OSTs
                  Error: '(3) OST{1} Expect 'partial', but got 'scanning-phase2''
          
          https://testing.hpdd.intel.com/test_sets/6c7a8fea-0617-11e4-be6f-5254006e85c2
                  07-07 10996/2 e5d34ebcb64476bc9228551d68f08f0de4ae2944 LU-5506 lfsck: skip orphan OST-object handling for failed OSTs
                  Error: '(6) Expect 2 fixed on mds{2}, but got: 3'
          
          
          ---- full ----
          
          https://testing.hpdd.intel.com/test_sets/64eb3bde-4dfd-11e4-8fdd-5254006e85c2
                  10-06 full
                  Error: '(2) MDS1 is not the expected 'partial''
                  client 6039fc8fd47ffd73a31b073687f32cac0a35a8aa v2_6_53_0-12-g6039fc8
                  server 73ea776053d99f74a9f5679fe55ec5d9461b8a89 v2_6_0_0
          
          https://testing.hpdd.intel.com/test_sets/44f280d4-4d45-11e4-857c-5254006e85c2
                  10-05 full
                  Error: '(2) MDS1 is not the expected 'partial''
                  client 0b4b33592c09d37c0132d39c7823db78a3efcb3c v2_6_53_0-8-g0b4b335
                  server 73ea776053d99f74a9f5679fe55ec5d9461b8a89 v2_6_0_0
          
          
          ---- 9383 ----
          
          https://testing.hpdd.intel.com/test_sets/4b097264-48ce-11e4-b83b-5254006e85c2
                  09-30 9383 LU-4665 utils: lfs setstripe to specify OSTs
                  Error: '(3) Fail to repair unmatched pair: 0'
                  DUE TO REGRESSION IN PATCH.
          
          jhammond John Hammond added a comment - Here are the other 9/16 failures in maloo when I checked yesterday: ---- 11560 ---- https://testing.hpdd.intel.com/test_sets/da2714ee-49f8-11e4-92b1-5254006e85c2 10-02 11560 a168cdd6e093d1cb6fc551202d6a53aab5f87fc8/5 LU-5451 lod: improve weird FID handling 7e000f8fcad8ed9023f502ca63c47f3bdcac8a6b LU-5511 lfsck: repair unmatched parent-child pairs Error: '(6.1) Expect 1 fixed on mds1, but got: 0' https://testing.hpdd.intel.com/test_sets/b6842b5e-4a06-11e4-adcb-5254006e85c2 10-01 11560 a168cdd6e093d1cb6fc551202d6a53aab5f87fc8 LU-5451 lod: improve weird FID handling Error: '(6.1) Expect 1 fixed on mds1, but got: 0' https://testing.hpdd.intel.com/test_sets/47630d76-4560-11e4-8e96-5254006e85c2 09-26 11560 fa13c28d81ca917d1cfdfdefedb3a06845bb2386 LU-5451 lod: improve weird FID handling Error: '(6.1) Expect 1 fixed on mds1, but got: 0' ---- 10996 ---- https://testing.hpdd.intel.com/test_sets/883bc6b6-06be-11e4-8941-5254006e85c2 07-08 10996/4 2c4ffba41367e2cb850b2f7af1285641112c87fc LU-5506 lfsck: skip orphan OST-object handling for failed OSTs (Merged) Error: '(6) Expect 2 fixed on mds{2}, but got: 3' https://testing.hpdd.intel.com/test_sets/aef7bb80-0645-11e4-8bf0-5254006e85c2 07-08 10996/3 e5d34ebcb64476bc9228551d68f08f0de4ae2944 LU-5506 lfsck: skip orphan OST-object handling for failed OSTs Error: '(3) OST{1} Expect 'partial', but got 'scanning-phase2'' https://testing.hpdd.intel.com/test_sets/6c7a8fea-0617-11e4-be6f-5254006e85c2 07-07 10996/2 e5d34ebcb64476bc9228551d68f08f0de4ae2944 LU-5506 lfsck: skip orphan OST-object handling for failed OSTs Error: '(6) Expect 2 fixed on mds{2}, but got: 3' ---- full ---- https://testing.hpdd.intel.com/test_sets/64eb3bde-4dfd-11e4-8fdd-5254006e85c2 10-06 full Error: '(2) MDS1 is not the expected 'partial'' client 6039fc8fd47ffd73a31b073687f32cac0a35a8aa v2_6_53_0-12-g6039fc8 server 73ea776053d99f74a9f5679fe55ec5d9461b8a89 v2_6_0_0 https://testing.hpdd.intel.com/test_sets/44f280d4-4d45-11e4-857c-5254006e85c2 10-05 full Error: '(2) MDS1 is not the expected 'partial'' client 0b4b33592c09d37c0132d39c7823db78a3efcb3c v2_6_53_0-8-g0b4b335 server 73ea776053d99f74a9f5679fe55ec5d9461b8a89 v2_6_0_0 ---- 9383 ---- https://testing.hpdd.intel.com/test_sets/4b097264-48ce-11e4-b83b-5254006e85c2 09-30 9383 LU-4665 utils: lfs setstripe to specify OSTs Error: '(3) Fail to repair unmatched pair: 0' DUE TO REGRESSION IN PATCH.

          People

            yong.fan nasf (Inactive)
            yong.fan nasf (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: