[LU-1867] replay-single test_89: @@@@@@ FAIL: 4 blocks leaked Created: 10/Sep/12 Updated: 28/Nov/17 Resolved: 16/Oct/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.0 |
| Fix Version/s: | Lustre 2.4.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Xuezhao Liu | Assignee: | Yang Sheng |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
||||||||||||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 4419 | ||||||||||||
| Description |
|
Hit this problem on Maloo test on latest master branch: It is similiar with ORI-412 reported on Orion. Test logs of test_89 attached. |
| Comments |
| Comment by Xuezhao Liu [ 10/Sep/12 ] |
|
Again https://maloo.whamcloud.com/test_sets/de889ae4-fb49-11e1-8e05-52540035b04c test_89 |
| Comment by Peter Jones [ 12/Sep/12 ] |
|
https://maloo.whamcloud.com/test_sessions/dff25e82-fc94-11e1-b09c-52540035b04c |
| Comment by Peter Jones [ 12/Sep/12 ] |
|
Yangsheng Could you please look into this one? Thanks Peter |
| Comment by Yang Sheng [ 12/Sep/12 ] |
|
Another one: https://maloo.whamcloud.com/test_sets/f621c0de-fc95-11e1-b09c-52540035b04c |
| Comment by Bob Glossman (Inactive) [ 12/Sep/12 ] |
|
Another one: https://maloo.whamcloud.com/test_sets/b8eb255e-fcca-11e1-b09c-52540035b04c |
| Comment by Andreas Dilger [ 13/Sep/12 ] |
|
Increasing priority, since this is causing a fairly high failure rate in tests - about 1 of 6 in the last few days. Searching through the Maloo history for this test, it seems the majority of bugs hit were testing USE_OFD=yes, for |
| Comment by Andreas Dilger [ 23/Sep/12 ] |
|
Failed https://maloo.whamcloud.com/test_sets/e3948898-040b-11e2-aec7-52540035b04c, which didn't run with USE_OFD, though it is an ofd related patch. |
| Comment by Li Wei (Inactive) [ 23/Sep/12 ] |
|
https://maloo.whamcloud.com/test_sets/e0643d4a-0496-11e2-bfd4-52540035b04c This was master with OFD and LDiskFS OSTs. |
| Comment by Andreas Dilger [ 24/Sep/12 ] |
|
Has any work been done yet to determine why this test is failing, and what needs to be done to fix it? |
| Comment by Xuezhao Liu [ 24/Sep/12 ] |
|
Can http://review.whamcloud.com/#change,1704 on Orion help to resolve/reduce this issue on master? |
| Comment by Li Wei (Inactive) [ 24/Sep/12 ] |
|
Liu Xuezhao, Yes, I think the extra wait_delete_completed() should help reduce the failure rate. That change is already included in http://review.whamcloud.com/2982, which hopefully could land soon. |
| Comment by Li Wei (Inactive) [ 24/Sep/12 ] |
|
https://maloo.whamcloud.com/test_sets/abc64ce8-060f-11e2-9b17-52540035b04c This was master with OFD and LDiskFS OSTs. |
| Comment by Ian Colle (Inactive) [ 26/Sep/12 ] |
|
https://maloo.whamcloud.com/test_sets/a22a10ee-07df-11e2-9e76-52540035b04c |
| Comment by Alex Zhuravlev [ 26/Sep/12 ] |
|
let's disable this test until the root cause is understood. the issue looks pretty local and not affecting other tests, functionality. |
| Comment by Yang Sheng [ 28/Sep/12 ] |
|
I have doing some investigate as below. This issue caused by config-llog data not sync between mgs & ost. We count free block first as BLOCK1. Then write data to OST...etc. There have some data wrote in OST of config-data on MGS, but not in OST. Then ost umount, And the config-data will sync when ost remount. There we count free block as BLOCK2. So the BLOCK2 - BLOCK1 is the config-data changes. It may or may not cause a new block be allocated(4k). So we encounter this issue very randomly and the leak block always 4k. |
| Comment by Andreas Dilger [ 28/Sep/12 ] |
|
In this case, the test pass condition should be changed to allow 4 blocks (16kB) difference between BLOCKS2 and BLOCKS1 and still pass, along with a comment explaining this. I guess this doesn't explain the "1536 blocks leaked" problem seen in other test failures. |
| Comment by Yang Sheng [ 29/Sep/12 ] |
|
Patch commit to: http://review.whamcloud.com/#change,4130 |
| Comment by Yang Sheng [ 01/Oct/12 ] |
|
Patch landed. Close bug. |
| Comment by Andreas Dilger [ 01/Oct/14 ] |
|
replay-single test_89 is still being skipped on ZFS due to this bug. It looks like the landed patch may resolve the test failure, so a patch to re-enable it should be submitted. |
| Comment by Yang Sheng [ 08/Oct/14 ] |
|
re-enable test patch: http://review.whamcloud.com/12227 |
| Comment by Andreas Dilger [ 15/Oct/14 ] |
|
This is still failing for ZFS so the above patch only re-enables test_89 for ldiskfs: |
| Comment by Yang Sheng [ 16/Oct/14 ] |
|
I think this ticket is for ldiskfs issue. ZFS has similar issue but it shows more blocks leak and OR-412 is more proper to handle it. So i close this one first. |
| Comment by Andreas Dilger [ 17/Oct/14 ] |
|
Note that the ORI project is closed and those tickets cannot be used to land patches on master. I opened |
| Comment by Jinshan Xiong (Inactive) [ 25/Nov/17 ] |
|
This problem is seeing again at: https://testing.hpdd.intel.com/test_sets/a6ab66ac-d1ad-11e7-9c63-52540065bddc |
| Comment by Andreas Dilger [ 26/Nov/17 ] |
|
Jinshan, I strongly suspect that if this ancient issue is being seen again that it is a new issue that needs a new Jira ticket, even if the error message is the same. It is also most likely that the problem is FLR related, since this hasn't been reported in over 3 years. |
| Comment by Jinshan Xiong (Inactive) [ 27/Nov/17 ] |
|
It happened only once in these many tests recently. Let's see how it goes. |
| Comment by Andreas Dilger [ 28/Nov/17 ] |
|
This failure is being tracked under |