[LU-5721] sanity-lfsck test_18f failed for unexpected layout LFSCK status Created: 09/Oct/14  Updated: 03/Nov/14  Resolved: 03/Nov/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.7.0

Type: Bug Priority: Critical
Reporter: nasf (Inactive) Assignee: nasf (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 16053

 Description   

Recently, Maloo tests hit some sanity-lfsck test_18f failures because of unexpected layout LFSCK status:

sanity-lfsck test 18f: Skip the failed OST(s) when handle orphan OST-objects == 12:43:28 (1407761008)

https://testing.hpdd.intel.com/test_sets/29aaa042-4a50-11e4-880b-5254006e85c2
10-02 11390/29 5d5f83237aba013d9bfaee0bc101fa403008e528 LU-5516 lfsck: repair the lost name entry
11384/29 e9299a16ed3b8402ac951994a04aed0876c3a365 LU-5515 lfsck: repair bad file type in name entry
11383/28 201b1e2570a43e0929454e46c2a8ef90df67d304 LU-5513 lfsck: repair multiple referenced name entry
Error: '(8) MDS1 is not the expected 'completed''

https://testing.hpdd.intel.com/test_sets/a4396266-3b47-11e4-a78f-5254006e85c2
09-13 11384/20 f3b4b0d9c8e12d79ac7c76ea25685b8b78afb410 LU-5515 lfsck: repair bad file type in name entry
11383/20 LU-5513 lfsck: repair multiple referenced name entry
Error: '(8) MDS1 is not the expected 'completed''

https://testing.hpdd.intel.com/test_sets/2ed4fa44-3a52-11e4-b82a-5254006e85c2
09-11 11485/10 1b05a4f6d0e3b09bf50cf483a2b587f7f67242ac LU-5509 osd: get PFID from linkEA for remote dir on ldiskfs
11382/10 091db2912495f692e38e2c20f40452e4925702af LU-5508 osp: RPC adjustment for remote transaction
10996/18 ab05d3ba9c21125fc8194efb06545c358d962f3f LU-5506 lfsck: skip orphan OST-object handling for failed OSTs (Merged commit).
Error: '(4) MDS4 is not the expected 'completed''

https://testing.hpdd.intel.com/test_sets/aa055074-3543-11e4-9daf-5254006e85c2
09-05 11384/14 54ac922c31c41a0752367587e2692ef3747012bf LU-5515 lfsck: repair bad file type in name entry
11383/14 9cd74486d1e98ff6492c11e5f97fc873087ed7d4 LU-5513 lfsck: repair multiple referenced name entry
Error: '(2) MDS4 is not the expected 'partial''

https://testing.hpdd.intel.com/test_sets/2c7eb87c-2196-11e4-8700-5254006e85c2
08-11 11391/1 LU-5516 lfsck: repair orphan parent MDT-object
11390/1 5901fac8b083883dba6e396f73097b11a638659b LU-4788 lfsck: repair the lost name entry
11384/2 b7f3359b8cd82d208ff427febce92b1202e50a72 LU-5515 lfsck: repair bad file type in name entry
11383/3 290127554b47ed3871735d217e5c4c5b4d5fe365 LU-5513 lfsck: repair multiple referenced name entry
Error: '(2) MDS1 is not the expected 'partial''

https://testing.hpdd.intel.com/test_sets/876c3a80-2186-11e4-b153-5254006e85c2
08-11 11390/1 5901fac8b083883dba6e396f73097b11a638659b LU-4788 lfsck: repair the lost name entry
11384/2 b7f3359b8cd82d208ff427febce92b1202e50a72 LU-5515 lfsck: repair bad file type in name entry
11383/3 290127554b47ed3871735d217e5c4c5b4d5fe365 LU-5513 lfsck: repair multiple referenced name entry
Error: '(2) MDS1 is not the expected 'partial''

https://testing.hpdd.intel.com/test_sets/f38fb5c8-216f-11e4-bd4e-5254006e85c2
08-11 11383/3 290127554b47ed3871735d217e5c4c5b4d5fe365 LU-5513 lfsck: repair multiple referenced name entry
Error: '(2) MDS1 is not the expected 'partial''



 Comments   
Comment by John Hammond [ 09/Oct/14 ]

Here are the other 9/16 failures in maloo when I checked yesterday:

---- 11560 ----

https://testing.hpdd.intel.com/test_sets/da2714ee-49f8-11e4-92b1-5254006e85c2
        10-02 11560 a168cdd6e093d1cb6fc551202d6a53aab5f87fc8/5 LU-5451 lod: improve weird FID handling
              7e000f8fcad8ed9023f502ca63c47f3bdcac8a6b         LU-5511 lfsck: repair unmatched parent-child pairs
        Error: '(6.1) Expect 1 fixed on mds1, but got: 0'

https://testing.hpdd.intel.com/test_sets/b6842b5e-4a06-11e4-adcb-5254006e85c2
        10-01 11560 a168cdd6e093d1cb6fc551202d6a53aab5f87fc8 LU-5451 lod: improve weird FID handling
        Error: '(6.1) Expect 1 fixed on mds1, but got: 0'

https://testing.hpdd.intel.com/test_sets/47630d76-4560-11e4-8e96-5254006e85c2
        09-26 11560 fa13c28d81ca917d1cfdfdefedb3a06845bb2386 LU-5451 lod: improve weird FID handling
        Error: '(6.1) Expect 1 fixed on mds1, but got: 0'


---- 10996 ----

https://testing.hpdd.intel.com/test_sets/883bc6b6-06be-11e4-8941-5254006e85c2
        07-08 10996/4 2c4ffba41367e2cb850b2f7af1285641112c87fc LU-5506 lfsck: skip orphan OST-object handling for failed OSTs (Merged)
        Error: '(6) Expect 2 fixed on mds{2}, but got: 3'

https://testing.hpdd.intel.com/test_sets/aef7bb80-0645-11e4-8bf0-5254006e85c2
        07-08 10996/3 e5d34ebcb64476bc9228551d68f08f0de4ae2944 LU-5506 lfsck: skip orphan OST-object handling for failed OSTs
        Error: '(3) OST{1} Expect 'partial', but got 'scanning-phase2''

https://testing.hpdd.intel.com/test_sets/6c7a8fea-0617-11e4-be6f-5254006e85c2
        07-07 10996/2 e5d34ebcb64476bc9228551d68f08f0de4ae2944 LU-5506 lfsck: skip orphan OST-object handling for failed OSTs
        Error: '(6) Expect 2 fixed on mds{2}, but got: 3'


---- full ----

https://testing.hpdd.intel.com/test_sets/64eb3bde-4dfd-11e4-8fdd-5254006e85c2
        10-06 full
        Error: '(2) MDS1 is not the expected 'partial''
        client 6039fc8fd47ffd73a31b073687f32cac0a35a8aa v2_6_53_0-12-g6039fc8
        server 73ea776053d99f74a9f5679fe55ec5d9461b8a89 v2_6_0_0

https://testing.hpdd.intel.com/test_sets/44f280d4-4d45-11e4-857c-5254006e85c2
        10-05 full
        Error: '(2) MDS1 is not the expected 'partial''
        client 0b4b33592c09d37c0132d39c7823db78a3efcb3c v2_6_53_0-8-g0b4b335
        server 73ea776053d99f74a9f5679fe55ec5d9461b8a89 v2_6_0_0


---- 9383 ----

https://testing.hpdd.intel.com/test_sets/4b097264-48ce-11e4-b83b-5254006e85c2
        09-30 9383 LU-4665 utils: lfs setstripe to specify OSTs
        Error: '(3) Fail to repair unmatched pair: 0'
        DUE TO REGRESSION IN PATCH.
Comment by Andreas Dilger [ 10/Oct/14 ]

What is the impact of this bug? It isn't at all clear from the description whether this is just causing test failures, or if it is actually a bug that would cause problems for users?

Comment by nasf (Inactive) [ 12/Oct/14 ]

There are two sub-failures under this ticket:

1) During the first part of test test, we inject error stub to similar the case of some OST failed to respond some LFSCK request, then the LFSCK on the MDT should skip orphan OST-object handling for this OST and mark the LFSCK status as "partial", but because of some unknown reason, such injection did not cause the "partial" status. The potential impact is that the LFSCK may handle some orphan OST-objects unexpectedly. But because the OST-object usually contains its parent MDT-object's FID information, such unexpected LFSCK behaviour is harmless for most of cases.

2) During the second part of the test, we clear former injected error stub, then the LFSCK should go smoothly and handle all objects. The final LFSCK status should be "completed", but because of some unknown reason, the LFSCK failed to verify some OST-object(s), then the final LFSCK status was "partial". This failure may cause some orphan OST-objects cannot be handled.

Comment by nasf (Inactive) [ 22/Oct/14 ]

1) The failure about "Error: '(2) MDS1 is not the expected 'partial'' in the old version tests:
https://testing.hpdd.intel.com/test_sets/f38fb5c8-216f-11e4-bd4e-5254006e85c2
https://testing.hpdd.intel.com/test_sets/876c3a80-2186-11e4-b153-5254006e85c2
https://testing.hpdd.intel.com/test_sets/2c7eb87c-2196-11e4-8700-5254006e85c2

They failed because the injected failure stub has not been triggered. Such issue has already been resolved by subsequent versions and has been landed to master.

2) The failure about "Error: '(2) MDS4 is not the expected 'partial'' in the tests:
https://testing.hpdd.intel.com/test_sets/aa055074-3543-11e4-9daf-5254006e85c2

It is another failure instance LU-5301.

3) The failure about "Error: '(4) MDS4 is not the expected 'completed'' in the tests:
https://testing.hpdd.intel.com/test_sets/2ed4fa44-3a52-11e4-b82a-5254006e85c2

It is another failure instance LU-5301.

4) The failure about "Error: '(8) MDS1 is not the expected 'completed'' in the tests:
https://testing.hpdd.intel.com/test_sets/a4396266-3b47-11e4-a78f-5254006e85c2
https://testing.hpdd.intel.com/test_sets/29aaa042-4a50-11e4-880b-5254006e85c2

The LFSCK on the OST got some abnormal status when queried the LFSCK status from the MDT, then the LFSCK on the OST thought that the LFSCK on the MDT hit some unexpected trouble and marked them as exit in advance, and then the subsequent orphan MDT-object handling has been skipped. This issue will be fixed via the patch: http://review.whamcloud.com/#/c/11516/

Comment by Bob Glossman (Inactive) [ 30/Oct/14 ]

another seen in master:
https://testing.hpdd.intel.com/test_sets/464f8b14-5fcb-11e4-9a8e-5254006e85c2

Comment by nasf (Inactive) [ 03/Nov/14 ]

The patch http://review.whamcloud.com/#/c/11516/ has been landed to master at Oct.30th, other related fixes will be landed to master via LU-5301/LU-5731.

Generated at Sat Feb 10 01:53:56 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.