[LU-1867] replay-single test_89: @@@@@@ FAIL: 4 blocks leaked Created: 10/Sep/12  Updated: 28/Nov/17  Resolved: 16/Oct/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: Lustre 2.4.0

Type: Bug Priority: Critical
Reporter: Xuezhao Liu Assignee: Yang Sheng
Resolution: Fixed Votes: 0
Labels: None

Attachments: File replay-single.test_89.rar    
Issue Links:
Duplicate
Related
is related to LU-5761 replay-single test_89: @@@@@@ FAIL: 2... Resolved
Severity: 3
Rank (Obsolete): 4419

 Description   

Hit this problem on Maloo test on latest master branch:
https://maloo.whamcloud.com/test_sets/07148716-fae1-11e1-a03c-52540035b04c

It is similiar with ORI-412 reported on Orion.

Test logs of test_89 attached.



 Comments   
Comment by Xuezhao Liu [ 10/Sep/12 ]

Again https://maloo.whamcloud.com/test_sets/de889ae4-fb49-11e1-8e05-52540035b04c

test_89
Error: '4 blocks leaked'
Failure Rate: 8.00% of last 100 executions [all branches]

Comment by Peter Jones [ 12/Sep/12 ]

https://maloo.whamcloud.com/test_sessions/dff25e82-fc94-11e1-b09c-52540035b04c

Comment by Peter Jones [ 12/Sep/12 ]

Yangsheng

Could you please look into this one?

Thanks

Peter

Comment by Yang Sheng [ 12/Sep/12 ]

Another one: https://maloo.whamcloud.com/test_sets/f621c0de-fc95-11e1-b09c-52540035b04c

Comment by Bob Glossman (Inactive) [ 12/Sep/12 ]

Another one: https://maloo.whamcloud.com/test_sets/b8eb255e-fcca-11e1-b09c-52540035b04c

Comment by Andreas Dilger [ 13/Sep/12 ]

Increasing priority, since this is causing a fairly high failure rate in tests - about 1 of 6 in the last few days.

Searching through the Maloo history for this test, it seems the majority of bugs hit were testing USE_OFD=yes, for LU-1871 on 09/08 or so.

Comment by Andreas Dilger [ 23/Sep/12 ]

Failed https://maloo.whamcloud.com/test_sets/e3948898-040b-11e2-aec7-52540035b04c, which didn't run with USE_OFD, though it is an ofd related patch.

Comment by Li Wei (Inactive) [ 23/Sep/12 ]

https://maloo.whamcloud.com/test_sets/e0643d4a-0496-11e2-bfd4-52540035b04c

This was master with OFD and LDiskFS OSTs.

Comment by Andreas Dilger [ 24/Sep/12 ]

Has any work been done yet to determine why this test is failing, and what needs to be done to fix it?

Comment by Xuezhao Liu [ 24/Sep/12 ]

Can http://review.whamcloud.com/#change,1704 on Orion help to resolve/reduce this issue on master?

Comment by Li Wei (Inactive) [ 24/Sep/12 ]

Liu Xuezhao,

Yes, I think the extra wait_delete_completed() should help reduce the failure rate. That change is already included in http://review.whamcloud.com/2982, which hopefully could land soon.

Comment by Li Wei (Inactive) [ 24/Sep/12 ]

https://maloo.whamcloud.com/test_sets/abc64ce8-060f-11e2-9b17-52540035b04c

This was master with OFD and LDiskFS OSTs.

Comment by Ian Colle (Inactive) [ 26/Sep/12 ]

https://maloo.whamcloud.com/test_sets/a22a10ee-07df-11e2-9e76-52540035b04c

Comment by Alex Zhuravlev [ 26/Sep/12 ]

let's disable this test until the root cause is understood. the issue looks pretty local and not affecting other tests, functionality.

Comment by Yang Sheng [ 28/Sep/12 ]

I have doing some investigate as below. This issue caused by config-llog data not sync between mgs & ost. We count free block first as BLOCK1. Then write data to OST...etc. There have some data wrote in OST of config-data on MGS, but not in OST. Then ost umount, And the config-data will sync when ost remount. There we count free block as BLOCK2. So the BLOCK2 - BLOCK1 is the config-data changes. It may or may not cause a new block be allocated(4k). So we encounter this issue very randomly and the leak block always 4k.

Comment by Andreas Dilger [ 28/Sep/12 ]

In this case, the test pass condition should be changed to allow 4 blocks (16kB) difference between BLOCKS2 and BLOCKS1 and still pass, along with a comment explaining this. I guess this doesn't explain the "1536 blocks leaked" problem seen in other test failures.

Comment by Yang Sheng [ 29/Sep/12 ]

Patch commit to: http://review.whamcloud.com/#change,4130

Comment by Yang Sheng [ 01/Oct/12 ]

Patch landed. Close bug.

Comment by Andreas Dilger [ 01/Oct/14 ]

replay-single test_89 is still being skipped on ZFS due to this bug. It looks like the landed patch may resolve the test failure, so a patch to re-enable it should be submitted.

Comment by Yang Sheng [ 08/Oct/14 ]

re-enable test patch: http://review.whamcloud.com/12227

Comment by Andreas Dilger [ 15/Oct/14 ]

This is still failing for ZFS so the above patch only re-enables test_89 for ldiskfs:
https://testing.hpdd.intel.com/test_sets/f4f00d1a-4ffe-11e4-8734-5254006e85c2

Comment by Yang Sheng [ 16/Oct/14 ]

I think this ticket is for ldiskfs issue. ZFS has similar issue but it shows more blocks leak and OR-412 is more proper to handle it. So i close this one first.

Comment by Andreas Dilger [ 17/Oct/14 ]

Note that the ORI project is closed and those tickets cannot be used to land patches on master. I opened LU-5761 for tracking the ZFS issue.

Comment by Jinshan Xiong (Inactive) [ 25/Nov/17 ]

This problem is seeing again at:

https://testing.hpdd.intel.com/test_sets/a6ab66ac-d1ad-11e7-9c63-52540065bddc

Comment by Andreas Dilger [ 26/Nov/17 ]

Jinshan, I strongly suspect that if this ancient issue is being seen again that it is a new issue that needs a new Jira ticket, even if the error message is the same. It is also most likely that the problem is FLR related, since this hasn't been reported in over 3 years.

Comment by Jinshan Xiong (Inactive) [ 27/Nov/17 ]

It happened only once in these many tests recently. Let's see how it goes.

Comment by Andreas Dilger [ 28/Nov/17 ]

This failure is being tracked under LU-5761.

Generated at Sat Feb 10 01:20:23 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.