[LU-16420] sanity test_51d: FAIL: stripecount=3: OST 1 has more objects vs. OST 0 - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: Lustre 2.16.0
Affects Version/s: Lustre 2.16.0
Labels:
None

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

This issue was created by maloo for jianyu <yujian@whamcloud.com>

This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/19ce9196-c2db-40c4-9140-ebea855a919f

test_51d failed with the following error:

OST0 has 265 objects, 86 are index 0
OST1 has 322 objects, 145 are index 0
OST2 has 320 objects, 85 are index 0
OST3 has 322 objects, 86 are index 0
OST4 has 265 objects, 87 are index 0
OST5 has 323 objects, 143 are index 0
OST6 has 324 objects, 88 are index 0
OST7 has 323 objects, 88 are index 0

'stripecount=3:  OST 1 has more objects vs. OST 0  (341 > 197 x 5/4)'

VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
sanity test_51d - 'stripecount=3: OST 1 has more objects vs. OST 0 (341 > 197 x 5/4)'

Attachments

Issue Links

duplicates

LU-16864 OST 1 has fewer objects vs. OST 0 (stripecount=3)

Resolved

is duplicated by

LU-16902 sanity test_51d: OST 3 has more objects vs. OST 2

Resolved

is related to

LU-16877 sanity: test_51d stripecount=3: OST 1 has more objects vs. OST 0 (327 > 249 x 5/4)'

Open

is related to

LU-13748 'lfs setstripe -C -1' stripes too widely

Resolved

LU-16830 mdtest SEL jobs aborted with ENOSPC during automated FOFB testing

Resolved

Activity

[LU-16420] sanity test_51d: FAIL: stripecount=3: OST 1 has more objects vs. OST 0

Andreas Dilger made changes - 06/Jun/24 3:37 AM

Link

New: This issue is related to LU-16877 [ LU-16877 ]

Andreas Dilger made changes - 16/Jun/23 4:25 PM

Link

New: This issue is duplicated by ~~LU-16902~~ [ ~~LU-16902~~ ]

Peter Jones made changes - 14/Jun/23 10:01 PM

Fix Version/s		New: Lustre 2.16.0 [ 15190 ]
Assignee	Original: WC Triage [ wc-triage ]	New: Andreas Dilger [ adilger ]
Resolution		New: Fixed [ 1 ]
Status	Original: Open [ 1 ]	New: Resolved [ 5 ]

Peter Jones added a comment - 14/Jun/23 10:01 PM

Landed for 2.16

Peter Jones added a comment - 14/Jun/23 10:01 PM Landed for 2.16

Gerrit Updater added a comment - 14/Jun/23 9:50 PM

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/51268/
Subject: ~~LU-16420~~ tests: move overstriping test to sanity-pfl
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 0a21aa0623b87c643d4b56b9bd45fdcc1e566d4b

Gerrit Updater added a comment - 14/Jun/23 9:50 PM "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/51268/ Subject: LU-16420 tests: move overstriping test to sanity-pfl Project: fs/lustre-release Branch: master Current Patch Set: Commit: 0a21aa0623b87c643d4b56b9bd45fdcc1e566d4b

Gerrit Updater added a comment - 10/Jun/23 1:31 PM

"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51268
Subject: ~~LU-16420~~ tests: move overstriping test to sanity-pfl
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: b05e0394254ca95c0e516acec1a12c945fea2f42

Gerrit Updater added a comment - 10/Jun/23 1:31 PM "Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51268 Subject: LU-16420 tests: move overstriping test to sanity-pfl Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: b05e0394254ca95c0e516acec1a12c945fea2f42

Andreas Dilger made changes - 10/Jun/23 1:17 PM

Priority

Original: Minor [ 4 ]

New: Major [ 3 ]

Andreas Dilger added a comment - 10/Jun/23 12:45 PM

Note that a passing run is having almost perfect distribution:

OST0 has 302 objects, 101 are index 0
OST1 has 302 objects, 101 are index 0
OST2 has 302 objects, 100 are index 0
OST3 has 302 objects, 100 are index 0
OST4 has 302 objects, 100 are index 0
OST5 has 302 objects, 100 are index 0
OST6 has 302 objects, 100 are index 0

unlike a failing run, which (AFAICS) alway has two OSTs that have fewer objects and the following OSTs have more first objects, which is what causes this test to fail:

OST0 has 265 objects, 86 are index 0
OST1 has 322 objects, 145 are index 0
OST2 has 320 objects, 85 are index 0
OST3 has 322 objects, 86 are index 0
OST4 has 265 objects, 87 are index 0
OST5 has 323 objects, 143 are index 0
OST6 has 324 objects, 88 are index 0
OST7 has 323 objects, 88 are index 0

I also see that the passing results are with 7 OSTs (review-ldiskfs-* 0/176 failed) and the failing ones are 8 OSTs (review-dne-part-1, 65/269 failed runs in past week). That lends some credence to "phase shifted MDT creates" causing the problem, since creates between MDTs are not coordinated for performance reasons.

Andreas Dilger added a comment - 10/Jun/23 12:45 PM Note that a passing run is having almost perfect distribution: OST0 has 302 objects, 101 are index 0 OST1 has 302 objects, 101 are index 0 OST2 has 302 objects, 100 are index 0 OST3 has 302 objects, 100 are index 0 OST4 has 302 objects, 100 are index 0 OST5 has 302 objects, 100 are index 0 OST6 has 302 objects, 100 are index 0 unlike a failing run, which (AFAICS) alway has two OSTs that have fewer objects and the following OSTs have more first objects, which is what causes this test to fail: OST0 has 265 objects, 86 are index 0 OST1 has 322 objects, 145 are index 0 OST2 has 320 objects, 85 are index 0 OST3 has 322 objects, 86 are index 0 OST4 has 265 objects, 87 are index 0 OST5 has 323 objects, 143 are index 0 OST6 has 324 objects, 88 are index 0 OST7 has 323 objects, 88 are index 0 I also see that the passing results are with 7 OSTs (review-ldiskfs-* 0/176 failed) and the failing ones are 8 OSTs (review-dne-part-1, 65/269 failed runs in past week). That lends some credence to "phase shifted MDT creates" causing the problem, since creates between MDTs are not coordinated for performance reasons.

Andreas Dilger made changes - 10/Jun/23 12:18 PM

Description

Original: This issue was created by maloo for jianyu <yujian@whamcloud.com>

This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/19ce9196-c2db-40c4-9140-ebea855a919f

test_51d failed with the following error:
{noformat}
'stripecount=3: OST 1 has more objects vs. OST 0 (341 > 197 x 5/4)'
{noformat}

VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
sanity test_51d - 'stripecount=3: OST 1 has more objects vs. OST 0 (341 > 197 x 5/4)'

New: This issue was created by maloo for jianyu <yujian@whamcloud.com>

This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/19ce9196-c2db-40c4-9140-ebea855a919f

test_51d failed with the following error:
{noformat}
OST0 has 265 objects, 86 are index 0
OST1 has 322 objects, 145 are index 0
OST2 has 320 objects, 85 are index 0
OST3 has 322 objects, 86 are index 0
OST4 has 265 objects, 87 are index 0
OST5 has 323 objects, 143 are index 0
OST6 has 324 objects, 88 are index 0
OST7 has 323 objects, 88 are index 0

'stripecount=3: OST 1 has more objects vs. OST 0 (341 > 197 x 5/4)'
{noformat}

VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
sanity test_51d - 'stripecount=3: OST 1 has more objects vs. OST 0 (341 > 197 x 5/4)'

Andreas Dilger added a comment - 10/Jun/23 12:18 PM - edited

It seems possible that patch https://review.whamcloud.com/50532 "LU-13748 mdt: remove LASSERT in mdt_dump_lmm()" has caused this test to start failing much more often than before. It was failing only occasionally on master before 2023-05-22, with the exception of the above patch was failing test_51d regularly before it landed on 2023-05-19.

The code change of that patch is almost impossible to be the cause (it just removes an LASSERT), but the added test_27Ch() that creates two fully overstriped files appears to somehow affect test_51d() in some obscure way. I don't really know why this is the case, since it is run about 20 minutes earlier. The patch stopped failing its testing once the overstriped file and directory were removed at the end of the test, though I also don't know why that was the case (it could have also been coincidence that all of the test runs passed in that case).

Since I don't think the code change itself is causing the test failures, I'm instead going to try skipping the new test and/or move it later in the test series to see if this avoids the problem. It would also be good to understand exactly why this test is causing the problem.

Some unsupported theories include:

lingering exhaustion of objects on some OST after overstriping consumed all objects (but for 20 minutes?)
create thread stuck or with bad state (but that wouldn't just have an imbalance?)
MDS delete thread is still deleting OST objects from overstriped files, somehow affecting create performance?
creating two overstriped files (on two different MDTs) somehow got the OST starting index "in sync" with other OSTs and imbalances the starting OST for the files just enough to trigger the failure?
MDS is somehow stuck in QOS allocation mode, even after overstriped file has been removed?

Andreas Dilger added a comment - 10/Jun/23 12:18 PM - edited It seems possible that patch https://review.whamcloud.com/50532 " LU-13748 mdt: remove LASSERT in mdt_dump_lmm() " has caused this test to start failing much more often than before. It was failing only occasionally on master before 2023-05-22, with the exception of the above patch was failing test_51d regularly before it landed on 2023-05-19. The code change of that patch is almost impossible to be the cause (it just removes an LASSERT ), but the added test_27Ch() that creates two fully overstriped files appears to somehow affect test_51d() in some obscure way. I don't really know why this is the case, since it is run about 20 minutes earlier. The patch stopped failing its testing once the overstriped file and directory were removed at the end of the test, though I also don't know why that was the case (it could have also been coincidence that all of the test runs passed in that case). Since I don't think the code change itself is causing the test failures, I'm instead going to try skipping the new test and/or move it later in the test series to see if this avoids the problem. It would also be good to understand exactly why this test is causing the problem. Some unsupported theories include: lingering exhaustion of objects on some OST after overstriping consumed all objects (but for 20 minutes?) create thread stuck or with bad state (but that wouldn't just have an imbalance?) MDS delete thread is still deleting OST objects from overstriped files, somehow affecting create performance? creating two overstriped files (on two different MDTs) somehow got the OST starting index "in sync" with other OSTs and imbalances the starting OST for the files just enough to trigger the failure? MDS is somehow stuck in QOS allocation mode, even after overstriped file has been removed?

People

Assignee:: Andreas Dilger

Reporter:: Maloo

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 21/Dec/22 7:30 AM

Updated:: 06/Jun/24 3:37 AM

Resolved:: 14/Jun/23 10:01 PM