[LU-2012] replay-dual test_14b: after 846984 > before 846980 Created: 07/Aug/12  Updated: 16/May/22

Status: Reopened
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.3.0
Fix Version/s: Lustre 2.4.0

Type: Bug Priority: Minor
Reporter: Maloo Assignee: Mikhail Pershin
Resolution: Unresolved Votes: 0
Labels: always_except, ldiskfs, zfs

Issue Links:
Duplicate
duplicates LU-10052 replay-single test_20b fails with 'af... Resolved
is duplicated by LU-10608 replay-dual test_14b: FAIL: after 185... Resolved
is duplicated by LU-3214 Interop 2.3.0<->2.4 failure on test s... Closed
Related
Severity: 3
Rank (Obsolete): 3080

 Description   

This issue was created by maloo for Li Wei <liwei@whamcloud.com>

This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/3e022626-dd35-11e1-a041-52540035b04c.

The sub-test test_14b failed with the following error:

after 846984 > before 846980

This was with LDiskFS, so not ORI-396.

Info required for matching: replay-dual 14b



 Comments   
Comment by Mikhail Pershin [ 09/Aug/12 ]

This test should be disabled since we disabled gap handling for now

Comment by Li Wei (Inactive) [ 09/Sep/12 ]

https://maloo.whamcloud.com/test_sets/044f5114-fa05-11e1-8ea7-52540035b04c

Comment by Li Wei (Inactive) [ 23/Sep/12 ]

https://maloo.whamcloud.com/test_sets/189f38d8-049a-11e2-bfd4-52540035b04c

This was master with OFD and LDiskFS OSTs.

Comment by Li Wei (Inactive) [ 24/Sep/12 ]

https://maloo.whamcloud.com/test_sets/0e1568ba-0611-11e2-9b17-52540035b04c

This was master with OFD and LDiskFS OSTs.

Comment by Li Wei (Inactive) [ 08/Oct/12 ]

https://maloo.whamcloud.com/test_sets/0b57c26e-11ae-11e2-9408-52540035b04c

Comment by Li Wei (Inactive) [ 09/Oct/12 ]

https://maloo.whamcloud.com/test_sets/e7e7add4-1203-11e2-a663-52540035b04c

Comment by Andreas Dilger [ 09/Oct/12 ]

I hit this with ldiskfs in local testing as well:

replay-dual test_14b: @@@@@@ FAIL: after 76856 > before 76852

but it isn't easy to determine if the small number of extra blocks is a clear indication of test failure, or some other usage in the background (e.g. llog records, etc). The orphan files should be created with a noticeable amount of data, so that if they fail to be deleted it will be obvious.

Comment by Andreas Dilger [ 09/Oct/12 ]

Submitted patch to disable test_14b until orphan handling is fixed: http://review.whamcloud.com/4237

The test itself is improved from the previous version, since it writes a large orphan file and it allows for some small margin of error in the df output due to allocated blocks (logs, OI, etc).

Comment by Jian Yu [ 26/Feb/13 ]

Lustre b2_1 client build: http://build.whamcloud.com/job/lustre-b2_1/176
Lustre master server build: http://build.whamcloud.com/job/lustre-master/1269
Distro/Arch: RHEL6.3/x86_64

The same failure occurred:
https://maloo.whamcloud.com/test_sets/a19d48ee-7d78-11e2-85d0-52540035b04c

Comment by Jian Yu [ 26/Feb/13 ]

The same failure also occurred on Lustre b2_1 branch:
https://maloo.whamcloud.com/test_sets/198fd69c-7e6d-11e2-8b88-52540035b04c
https://maloo.whamcloud.com/test_sets/1a65e4fc-7d65-11e2-ac80-52540035b04c
https://maloo.whamcloud.com/test_sets/dca53842-7c48-11e2-897d-52540035b04c

Comment by Jian Yu [ 01/Mar/13 ]

Lustre Branch: b2_1
Lustre Build: http://build.whamcloud.com/job/lustre-b2_1/181
The issue occurred again: https://maloo.whamcloud.com/test_sets/511b9cd0-8251-11e2-8172-52540035b04c

Hi Andreas,

Does the fix in http://review.whamcloud.com/4237 need to be ported to Lustre b2_1 branch?

Comment by Jian Yu [ 03/Mar/13 ]

The failure occurs constantly on Lustre b2_1 branch: https://maloo.whamcloud.com/test_sets/38643f8c-826d-11e2-ba47-52540035b04c

Patch for Lustre b2_1 branch: http://review.whamcloud.com/5571

Comment by Andreas Dilger [ 05/Mar/13 ]

Assign to Yu Jian for follow-up patch to remove test_14b from ALWAYS_EXCEPT to see if the changes to the test itself (larger test file and a small allocation allowance) are enough to allow it to pass regularly.

Comment by Jian Yu [ 08/Mar/13 ]

Lustre Branch: master

After removing replay-dual 14b from ALWAYS_EXCEPT list, the test still failed:
https://maloo.whamcloud.com/test_sets/aeeec35c-87d6-11e2-961a-52540035b04c

Comment by Andreas Dilger [ 08/Mar/13 ]

This failed test is clearly caused by a defect in the code, since the size difference is 5120kB = 5MB, which is the size of the file that should have been deleted by orphan recovery. This is definitely not a case of some other blocks being allocated during testing.

I think Niu was working on a patch related to orphan recovery for 2.1, and Mike was also working on this. Perhaps they already know what the problem is here.

Comment by Mikhail Pershin [ 11/Mar/13 ]

In fact we have just disabled gap handling that is why orphans may stay on OST now. Last time we discussed that the solution was 'run lfsck'
Nevertheless I had some ideas how to handle gaps and destroy orphans, so if lfsck is not the case now we can reconsider to fix that in other way.

Comment by Andreas Dilger [ 11/Mar/13 ]

The LFSCK MDS-OSS checking will not be available until at least 2.6, so if we can get a solution in the meantime that would be good. Note that there was a bug that Niu was working on where non-orphan files were being deleted, so we don't want to reintroduce this.

Comment by Mikhail Pershin [ 11/Mar/13 ]

yes, that is exactly why we disabled it. I will refresh my memory about alternative gap handling then

Comment by Jian Yu [ 04/Sep/13 ]

Lustre client: http://build.whamcloud.com/job/lustre-b2_3/41/ (2.3.0)
Lustre server: http://build.whamcloud.com/job/lustre-b2_4/44/ (2.4.1 RC1)

replay-dual test 14b hit the same failure:
https://maloo.whamcloud.com/test_sets/43411d98-14fe-11e3-9828-52540035b04c

Comment by Jian Yu [ 19/Dec/13 ]

Lustre client: http://build.whamcloud.com/job/lustre-b2_3/41/ (2.3.0)
Lustre server: http://build.whamcloud.com/job/lustre-b2_4/69/ (2.4.2 RC1)

replay-dual test 14b hit the same failure:
https://maloo.whamcloud.com/test_sets/9369eef2-6860-11e3-a16f-52540035b04c

Comment by Andreas Dilger [ 16/May/22 ]

This is in ALWAYS_EXCEPT so that is the reason it has not failed recently.

Generated at Sat Feb 10 01:21:37 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.