[LU-2012] replay-dual test_14b: after 846984 > before 846980 Created: 07/Aug/12 Updated: 16/May/22 |
|
| Status: | Reopened |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.3.0 |
| Fix Version/s: | Lustre 2.4.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Maloo | Assignee: | Mikhail Pershin |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | always_except, ldiskfs, zfs | ||
| Issue Links: |
|
||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||
| Rank (Obsolete): | 3080 | ||||||||||||||||||||
| Description |
|
This issue was created by maloo for Li Wei <liwei@whamcloud.com> This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/3e022626-dd35-11e1-a041-52540035b04c. The sub-test test_14b failed with the following error:
This was with LDiskFS, so not ORI-396. Info required for matching: replay-dual 14b |
| Comments |
| Comment by Mikhail Pershin [ 09/Aug/12 ] |
|
This test should be disabled since we disabled gap handling for now |
| Comment by Li Wei (Inactive) [ 09/Sep/12 ] |
|
https://maloo.whamcloud.com/test_sets/044f5114-fa05-11e1-8ea7-52540035b04c |
| Comment by Li Wei (Inactive) [ 23/Sep/12 ] |
|
https://maloo.whamcloud.com/test_sets/189f38d8-049a-11e2-bfd4-52540035b04c This was master with OFD and LDiskFS OSTs. |
| Comment by Li Wei (Inactive) [ 24/Sep/12 ] |
|
https://maloo.whamcloud.com/test_sets/0e1568ba-0611-11e2-9b17-52540035b04c This was master with OFD and LDiskFS OSTs. |
| Comment by Li Wei (Inactive) [ 08/Oct/12 ] |
|
https://maloo.whamcloud.com/test_sets/0b57c26e-11ae-11e2-9408-52540035b04c |
| Comment by Li Wei (Inactive) [ 09/Oct/12 ] |
|
https://maloo.whamcloud.com/test_sets/e7e7add4-1203-11e2-a663-52540035b04c |
| Comment by Andreas Dilger [ 09/Oct/12 ] |
|
I hit this with ldiskfs in local testing as well: replay-dual test_14b: @@@@@@ FAIL: after 76856 > before 76852 but it isn't easy to determine if the small number of extra blocks is a clear indication of test failure, or some other usage in the background (e.g. llog records, etc). The orphan files should be created with a noticeable amount of data, so that if they fail to be deleted it will be obvious. |
| Comment by Andreas Dilger [ 09/Oct/12 ] |
|
Submitted patch to disable test_14b until orphan handling is fixed: http://review.whamcloud.com/4237 The test itself is improved from the previous version, since it writes a large orphan file and it allows for some small margin of error in the df output due to allocated blocks (logs, OI, etc). |
| Comment by Jian Yu [ 26/Feb/13 ] |
|
Lustre b2_1 client build: http://build.whamcloud.com/job/lustre-b2_1/176 The same failure occurred: |
| Comment by Jian Yu [ 26/Feb/13 ] |
|
The same failure also occurred on Lustre b2_1 branch: |
| Comment by Jian Yu [ 01/Mar/13 ] |
|
Lustre Branch: b2_1 Hi Andreas, Does the fix in http://review.whamcloud.com/4237 need to be ported to Lustre b2_1 branch? |
| Comment by Jian Yu [ 03/Mar/13 ] |
|
The failure occurs constantly on Lustre b2_1 branch: https://maloo.whamcloud.com/test_sets/38643f8c-826d-11e2-ba47-52540035b04c Patch for Lustre b2_1 branch: http://review.whamcloud.com/5571 |
| Comment by Andreas Dilger [ 05/Mar/13 ] |
|
Assign to Yu Jian for follow-up patch to remove test_14b from ALWAYS_EXCEPT to see if the changes to the test itself (larger test file and a small allocation allowance) are enough to allow it to pass regularly. |
| Comment by Jian Yu [ 08/Mar/13 ] |
|
Lustre Branch: master After removing replay-dual 14b from ALWAYS_EXCEPT list, the test still failed: |
| Comment by Andreas Dilger [ 08/Mar/13 ] |
|
This failed test is clearly caused by a defect in the code, since the size difference is 5120kB = 5MB, which is the size of the file that should have been deleted by orphan recovery. This is definitely not a case of some other blocks being allocated during testing. I think Niu was working on a patch related to orphan recovery for 2.1, and Mike was also working on this. Perhaps they already know what the problem is here. |
| Comment by Mikhail Pershin [ 11/Mar/13 ] |
|
In fact we have just disabled gap handling that is why orphans may stay on OST now. Last time we discussed that the solution was 'run lfsck' |
| Comment by Andreas Dilger [ 11/Mar/13 ] |
|
The LFSCK MDS-OSS checking will not be available until at least 2.6, so if we can get a solution in the meantime that would be good. Note that there was a bug that Niu was working on where non-orphan files were being deleted, so we don't want to reintroduce this. |
| Comment by Mikhail Pershin [ 11/Mar/13 ] |
|
yes, that is exactly why we disabled it. I will refresh my memory about alternative gap handling then |
| Comment by Jian Yu [ 04/Sep/13 ] |
|
Lustre client: http://build.whamcloud.com/job/lustre-b2_3/41/ (2.3.0) replay-dual test 14b hit the same failure: |
| Comment by Jian Yu [ 19/Dec/13 ] |
|
Lustre client: http://build.whamcloud.com/job/lustre-b2_3/41/ (2.3.0) replay-dual test 14b hit the same failure: |
| Comment by Andreas Dilger [ 16/May/22 ] |
|
This is in ALWAYS_EXCEPT so that is the reason it has not failed recently. |