[LU-12350] sanity-flr test_33: file content error: expected: ost1, actual: ost2 Created: 28/May/19  Updated: 20/Jun/19  Resolved: 01/Jun/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.13.0, Lustre 2.12.3

Type: Bug Priority: Minor
Reporter: Maloo Assignee: Patrick Farrell (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Duplicate
is duplicated by LU-10925 sanity-flr test_33: ''file content e... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for Andreas Dilger <adilger@whamcloud.com>

This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/dfadbab6-2668-11e9-a318-52540065bddc

test_33 failed with the following error:

'file content error: expected: ost1, actual: ost2'

First test failure is on 2019-02-01 on patch 34160 that didn't land until 2019-05-24 (so could not have been the cause). The second test failure is on 2019-02-11 on patch 34186 that hasn't landed as of 2019-05-28, so it must have been a patch landed to master. Not to be confused with LU-10100, which is a PPC-specific failure that causes many sanity-flr and other test failures.

There were a bunch of patches landed on 2019-01-30, but looking through the patch summaries doesn't show anything that is related. Since it fails only test_33 about 0.4% of all sanity-flr test runs (about 5x per month), it could have been a patch that landed any time in the previous week or two, but unlikely before that (unless some external environment change contributed to the failure). The test itself was added in 2017-09-15 so had been passing for a long time.

My first guess is some kind of a test problem, so dumping "{{lfs getstripe $DIR/

VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
sanity-flr test_33 - 'file content error: expected: ost1, actual: ost2'



 Comments   
Comment by Andreas Dilger [ 28/May/19 ]

Update: there were no test failures in 2019-01 or 2018-12 so I thought that was the start of the failures, since it was failing about 5x per month after that. However, searching further back there are again about 2-3 failures per month, and an old ticket LU-10925 that shows the problem has existed for a long time already, going back to almost when the test was first landed.

The first ~250 runs between 2017-08 and 2018-01 appear to be directly on the flr branch under LU-9771 and all pass. The first failure is 2018-01 shortly after the FLR branch landed to master, with patch 20387 "LU-10287 flr: lfs mirror verify command" but looking at that patch it seems unlikely to be the culprit (the test does not use "lfs mirror verify" at all, and that patch doesn't appear to affect any other code).

In summary, it doesn't look like this can be isolated to a specific patch, and instead has to be isolated back from the test failure to see if it is a test bug or a code bug.

Comment by Andreas Dilger [ 28/May/19 ]

It is a bit sad that we've had this test failure for over 18 months and nobody who has hit the failure on their patch has bothered to file an LU ticket...

Comment by Patrick Farrell (Inactive) [ 28/May/19 ]

The reason for this seems likely to be simple:

 fail ost2 &
 sleep 1

It's clearly non-deterministic here.

It seems to basically assume that ost2 will be unavailable for the subsequent operations, because it's being failed over in the background.  Nothing is done to ensure that failover has either actually started (except that sleep) or that it has not completed yet.

Seems simple enough - needs to be **'stop ost2' like it does 'stop ost1' above.

I'll push a patch.

Comment by Gerrit Updater [ 28/May/19 ]

Patrick Farrell (pfarrell@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34985
Subject: LU-12350 tests: Do not use background failover
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 2bd3fe05689017d26996a02d2231930dd67255ba

Comment by Gerrit Updater [ 01/Jun/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34985/
Subject: LU-12350 tests: Do not use background failover
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 4ac0324fb9d824915b3dd11b75e81e609d9e8e84

Comment by Peter Jones [ 01/Jun/19 ]

Landed for 2.13

Comment by Gerrit Updater [ 06/Jun/19 ]

Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35086
Subject: LU-12350 tests: Do not use background failover
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: c7494b6ef36a1e831d3ccac07c66204538a8130c

Comment by Gerrit Updater [ 20/Jun/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35086/
Subject: LU-12350 tests: Do not use background failover
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: 020f774e0b0ff0f96173655744d976beb5af4a83

Generated at Sat Feb 10 02:51:48 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.