Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12350

sanity-flr test_33: file content error: expected: ost1, actual: ost2

Details

    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for Andreas Dilger <adilger@whamcloud.com>

      This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/dfadbab6-2668-11e9-a318-52540065bddc

      test_33 failed with the following error:

      'file content error: expected: ost1, actual: ost2'
      

      First test failure is on 2019-02-01 on patch 34160 that didn't land until 2019-05-24 (so could not have been the cause). The second test failure is on 2019-02-11 on patch 34186 that hasn't landed as of 2019-05-28, so it must have been a patch landed to master. Not to be confused with LU-10100, which is a PPC-specific failure that causes many sanity-flr and other test failures.

      There were a bunch of patches landed on 2019-01-30, but looking through the patch summaries doesn't show anything that is related. Since it fails only test_33 about 0.4% of all sanity-flr test runs (about 5x per month), it could have been a patch that landed any time in the previous week or two, but unlikely before that (unless some external environment change contributed to the failure). The test itself was added in 2017-09-15 so had been passing for a long time.

      My first guess is some kind of a test problem, so dumping "{{lfs getstripe $DIR/

      VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
      sanity-flr test_33 - 'file content error: expected: ost1, actual: ost2'

      Attachments

        Issue Links

          Activity

            [LU-12350] sanity-flr test_33: file content error: expected: ost1, actual: ost2

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35086/
            Subject: LU-12350 tests: Do not use background failover
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set:
            Commit: 020f774e0b0ff0f96173655744d976beb5af4a83

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35086/ Subject: LU-12350 tests: Do not use background failover Project: fs/lustre-release Branch: b2_12 Current Patch Set: Commit: 020f774e0b0ff0f96173655744d976beb5af4a83

            Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35086
            Subject: LU-12350 tests: Do not use background failover
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set: 1
            Commit: c7494b6ef36a1e831d3ccac07c66204538a8130c

            gerrit Gerrit Updater added a comment - Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35086 Subject: LU-12350 tests: Do not use background failover Project: fs/lustre-release Branch: b2_12 Current Patch Set: 1 Commit: c7494b6ef36a1e831d3ccac07c66204538a8130c
            pjones Peter Jones added a comment -

            Landed for 2.13

            pjones Peter Jones added a comment - Landed for 2.13

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34985/
            Subject: LU-12350 tests: Do not use background failover
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 4ac0324fb9d824915b3dd11b75e81e609d9e8e84

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34985/ Subject: LU-12350 tests: Do not use background failover Project: fs/lustre-release Branch: master Current Patch Set: Commit: 4ac0324fb9d824915b3dd11b75e81e609d9e8e84

            Patrick Farrell (pfarrell@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34985
            Subject: LU-12350 tests: Do not use background failover
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 2bd3fe05689017d26996a02d2231930dd67255ba

            gerrit Gerrit Updater added a comment - Patrick Farrell (pfarrell@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34985 Subject: LU-12350 tests: Do not use background failover Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 2bd3fe05689017d26996a02d2231930dd67255ba

            The reason for this seems likely to be simple:

             fail ost2 &
             sleep 1

            It's clearly non-deterministic here.

            It seems to basically assume that ost2 will be unavailable for the subsequent operations, because it's being failed over in the background.  Nothing is done to ensure that failover has either actually started (except that sleep) or that it has not completed yet.

            Seems simple enough - needs to be **'stop ost2' like it does 'stop ost1' above.

            I'll push a patch.

            pfarrell Patrick Farrell (Inactive) added a comment - The reason for this seems likely to be simple: fail ost2 & sleep 1 It's clearly non-deterministic here. It seems to basically assume that ost2 will be unavailable for the subsequent operations, because it's being failed over in the background.  Nothing is done to ensure that failover has either actually started (except that sleep) or that it has not completed yet. Seems simple enough - needs to be **'stop ost2' like it does 'stop ost1' above. I'll push a patch.

            It is a bit sad that we've had this test failure for over 18 months and nobody who has hit the failure on their patch has bothered to file an LU ticket...

            adilger Andreas Dilger added a comment - It is a bit sad that we've had this test failure for over 18 months and nobody who has hit the failure on their patch has bothered to file an LU ticket...
            adilger Andreas Dilger added a comment - - edited

            Update: there were no test failures in 2019-01 or 2018-12 so I thought that was the start of the failures, since it was failing about 5x per month after that. However, searching further back there are again about 2-3 failures per month, and an old ticket LU-10925 that shows the problem has existed for a long time already, going back to almost when the test was first landed.

            The first ~250 runs between 2017-08 and 2018-01 appear to be directly on the flr branch under LU-9771 and all pass. The first failure is 2018-01 shortly after the FLR branch landed to master, with patch 20387 "LU-10287 flr: lfs mirror verify command" but looking at that patch it seems unlikely to be the culprit (the test does not use "lfs mirror verify" at all, and that patch doesn't appear to affect any other code).

            In summary, it doesn't look like this can be isolated to a specific patch, and instead has to be isolated back from the test failure to see if it is a test bug or a code bug.

            adilger Andreas Dilger added a comment - - edited Update: there were no test failures in 2019-01 or 2018-12 so I thought that was the start of the failures, since it was failing about 5x per month after that. However, searching further back there are again about 2-3 failures per month, and an old ticket LU-10925 that shows the problem has existed for a long time already, going back to almost when the test was first landed. The first ~250 runs between 2017-08 and 2018-01 appear to be directly on the flr branch under LU-9771 and all pass. The first failure is 2018-01 shortly after the FLR branch landed to master, with patch 20387 " LU-10287 flr: lfs mirror verify command " but looking at that patch it seems unlikely to be the culprit (the test does not use " lfs mirror verify " at all, and that patch doesn't appear to affect any other code). In summary, it doesn't look like this can be isolated to a specific patch, and instead has to be isolated back from the test failure to see if it is a test bug or a code bug.

            People

              pfarrell Patrick Farrell (Inactive)
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: