[LU-12350] sanity-flr test_33: file content error: expected: ost1, actual: ost2 Created: 28/May/19 Updated: 20/Jun/19 Resolved: 01/Jun/19 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.13.0, Lustre 2.12.3 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Maloo | Assignee: | Patrick Farrell (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
This issue was created by maloo for Andreas Dilger <adilger@whamcloud.com> This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/dfadbab6-2668-11e9-a318-52540065bddc test_33 failed with the following error: 'file content error: expected: ost1, actual: ost2' First test failure is on 2019-02-01 on patch 34160 that didn't land until 2019-05-24 (so could not have been the cause). The second test failure is on 2019-02-11 on patch 34186 that hasn't landed as of 2019-05-28, so it must have been a patch landed to master. Not to be confused with There were a bunch of patches landed on 2019-01-30, but looking through the patch summaries doesn't show anything that is related. Since it fails only test_33 about 0.4% of all sanity-flr test runs (about 5x per month), it could have been a patch that landed any time in the previous week or two, but unlikely before that (unless some external environment change contributed to the failure). The test itself was added in 2017-09-15 so had been passing for a long time. My first guess is some kind of a test problem, so dumping "{{lfs getstripe $DIR/ VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV |
| Comments |
| Comment by Andreas Dilger [ 28/May/19 ] |
|
Update: there were no test failures in 2019-01 or 2018-12 so I thought that was the start of the failures, since it was failing about 5x per month after that. However, searching further back there are again about 2-3 failures per month, and an old ticket The first ~250 runs between 2017-08 and 2018-01 appear to be directly on the flr branch under In summary, it doesn't look like this can be isolated to a specific patch, and instead has to be isolated back from the test failure to see if it is a test bug or a code bug. |
| Comment by Andreas Dilger [ 28/May/19 ] |
|
It is a bit sad that we've had this test failure for over 18 months and nobody who has hit the failure on their patch has bothered to file an LU ticket... |
| Comment by Patrick Farrell (Inactive) [ 28/May/19 ] |
|
The reason for this seems likely to be simple: fail ost2 & sleep 1 It's clearly non-deterministic here. It seems to basically assume that ost2 will be unavailable for the subsequent operations, because it's being failed over in the background. Nothing is done to ensure that failover has either actually started (except that sleep) or that it has not completed yet. Seems simple enough - needs to be **'stop ost2' like it does 'stop ost1' above. I'll push a patch. |
| Comment by Gerrit Updater [ 28/May/19 ] |
|
Patrick Farrell (pfarrell@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34985 |
| Comment by Gerrit Updater [ 01/Jun/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34985/ |
| Comment by Peter Jones [ 01/Jun/19 ] |
|
Landed for 2.13 |
| Comment by Gerrit Updater [ 06/Jun/19 ] |
|
Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35086 |
| Comment by Gerrit Updater [ 20/Jun/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35086/ |