[LU-16420] sanity test_51d: FAIL: stripecount=3: OST 1 has more objects vs. OST 0 Created: 21/Dec/22 Updated: 16/Jun/23 Resolved: 14/Jun/23 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.16.0 |
| Fix Version/s: | Lustre 2.16.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | Maloo | Assignee: | Andreas Dilger |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||||||
| Description |
|
This issue was created by maloo for jianyu <yujian@whamcloud.com> This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/19ce9196-c2db-40c4-9140-ebea855a919f test_51d failed with the following error: OST0 has 265 objects, 86 are index 0 OST1 has 322 objects, 145 are index 0 OST2 has 320 objects, 85 are index 0 OST3 has 322 objects, 86 are index 0 OST4 has 265 objects, 87 are index 0 OST5 has 323 objects, 143 are index 0 OST6 has 324 objects, 88 are index 0 OST7 has 323 objects, 88 are index 0 'stripecount=3: OST 1 has more objects vs. OST 0 (341 > 197 x 5/4)' VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV |
| Comments |
| Comment by Serguei Smirnov [ 31/May/23 ] |
|
+1 on master: https://testing.whamcloud.com/test_sets/628ac757-7fa7-4dcb-9940-091fea92816f |
| Comment by Andreas Dilger [ 10/Jun/23 ] |
|
There is some chance that patch https://review.whamcloud.com/50996 " |
| Comment by Andreas Dilger [ 10/Jun/23 ] |
|
It seems possible that patch https://review.whamcloud.com/50532 " The code change of that patch is almost impossible to be the cause (it just removes an LASSERT), but the added test_27Ch() that creates two fully overstriped files appears to somehow affect test_51d() in some obscure way. I don't really know why this is the case, since it is run about 20 minutes earlier. The patch stopped failing its testing once the overstriped file and directory were removed at the end of the test, though I also don't know why that was the case (it could have also been coincidence that all of the test runs passed in that case). Since I don't think the code change itself is causing the test failures, I'm instead going to try skipping the new test and/or move it later in the test series to see if this avoids the problem. It would also be good to understand exactly why this test is causing the problem. Some unsupported theories include:
|
| Comment by Andreas Dilger [ 10/Jun/23 ] |
|
Note that a passing run is having almost perfect distribution: OST0 has 302 objects, 101 are index 0 OST1 has 302 objects, 101 are index 0 OST2 has 302 objects, 100 are index 0 OST3 has 302 objects, 100 are index 0 OST4 has 302 objects, 100 are index 0 OST5 has 302 objects, 100 are index 0 OST6 has 302 objects, 100 are index 0 unlike a failing run, which (AFAICS) alway has two OSTs that have fewer objects and the following OSTs have more first objects, which is what causes this test to fail: OST0 has 265 objects, 86 are index 0 OST1 has 322 objects, 145 are index 0 OST2 has 320 objects, 85 are index 0 OST3 has 322 objects, 86 are index 0 OST4 has 265 objects, 87 are index 0 OST5 has 323 objects, 143 are index 0 OST6 has 324 objects, 88 are index 0 OST7 has 323 objects, 88 are index 0 I also see that the passing results are with 7 OSTs (review-ldiskfs-* 0/176 failed) and the failing ones are 8 OSTs (review-dne-part-1, 65/269 failed runs in past week). That lends some credence to "phase shifted MDT creates" causing the problem, since creates between MDTs are not coordinated for performance reasons. |
| Comment by Gerrit Updater [ 10/Jun/23 ] |
|
"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51268 |
| Comment by Gerrit Updater [ 14/Jun/23 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/51268/ |
| Comment by Peter Jones [ 14/Jun/23 ] |
|
Landed for 2.16 |