Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11127

sanity-flr test_34b: @@@@@@ FAIL: can\'t put import for osc into FULL state after 40 sec, have REPLAY_WAIT

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.12.0
    • Lustre 2.12.0
    • None
    • 3
    • 9223372036854775807

    Description

      The failure rate for sanity-flr.sh test_34b is about 10%. Test is failed always with the same error like below:

      [12008.812898] Lustre: DEBUG MARKER: trevis-17vm1.trevis.whamcloud.com: executing wait_import_state FULL osc.lustre-OST0001-osc-ffff926024c37800.ost_server_uuid 40
      [12049.499056] Lustre: DEBUG MARKER: /usr/sbin/lctl mark rpc test_34b: @@@@@@ FAIL: can\'t put import for osc.lustre-OST0001-osc-ffff926024c37800.ost_server_uuid into FULL state after 40 sec, have REPLAY_WAIT
      

      examples:
      https://testing.whamcloud.com/sub_tests/82aec676-7dd9-11e8-8b8a-52540065bddc
      https://testing.whamcloud.com/sub_tests/08e9d360-7ead-11e8-8fe6-52540065bddc
      https://testing.whamcloud.com/sub_tests/20d551f2-7f07-11e8-97ff-52540065bddc
      https://testing.whamcloud.com/sub_tests/1ca17f50-7fea-11e8-b441-52540065bddc

      Attachments

        Issue Links

          Activity

            [LU-11127] sanity-flr test_34b: @@@@@@ FAIL: can\'t put import for osc into FULL state after 40 sec, have REPLAY_WAIT
            pjones Peter Jones added a comment -

            Looks like we don't!

            pjones Peter Jones added a comment - Looks like we don't!

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/32922/
            Subject: LU-11127 test: sanity-flr OST not recovery fast enough
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 5c8dfe2bdf8313b6bbb2055dd22290f328d94549

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/32922/ Subject: LU-11127 test: sanity-flr OST not recovery fast enough Project: fs/lustre-release Branch: master Current Patch Set: Commit: 5c8dfe2bdf8313b6bbb2055dd22290f328d94549

            Bobi Jam (bobijam@hotmail.com) uploaded a new patch: https://review.whamcloud.com/32922
            Subject: LU-11127 test: sanity-flr OST not recovery fast enough
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: acb4dd8dac495e554f3b86645a65bc4de80cbe87

            gerrit Gerrit Updater added a comment - Bobi Jam (bobijam@hotmail.com) uploaded a new patch: https://review.whamcloud.com/32922 Subject: LU-11127 test: sanity-flr OST not recovery fast enough Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: acb4dd8dac495e554f3b86645a65bc4de80cbe87

            Uploaded patch https://review.whamcloud.com/32917 to stop running test 34b in case we need to employ these drastic measures!

            jamesanunez James Nunez (Inactive) added a comment - Uploaded patch https://review.whamcloud.com/32917 to stop running test 34b in case we need to employ these drastic measures!

            James Nunez (jnunez@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/32917
            Subject: LU-11127 tests: stop running sanity-flr test 34b
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 2e588819a7d1291622326ccafcf163ad4bc86492

            gerrit Gerrit Updater added a comment - James Nunez (jnunez@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/32917 Subject: LU-11127 tests: stop running sanity-flr test 34b Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 2e588819a7d1291622326ccafcf163ad4bc86492

            There isn't anything in test_34b() that seems any different than test_34a(), but 34a has not had any failures.

            This might relate to some kind of problem with stopping and restarting the OSTs twice in a row quickly under DNE (e.g. MDT reconnections slow, or the previous restart has reset at_min)? One option would be to increase the minimum time that _wait_osc_import_state() waits with multiple MDTs by some amount to compensate.

            adilger Andreas Dilger added a comment - There isn't anything in test_34b() that seems any different than test_34a() , but 34a has not had any failures. This might relate to some kind of problem with stopping and restarting the OSTs twice in a row quickly under DNE (e.g. MDT reconnections slow, or the previous restart has reset at_min )? One option would be to increase the minimum time that _wait_osc_import_state() waits with multiple MDTs by some amount to compensate.

            it is failing a lot of runs:

            Failure Rate: 22.41% of most recent 58 runs, 42 skipped (all branches)

            tappro Mikhail Pershin added a comment - it is failing a lot of runs: Failure Rate: 22.41% of most recent 58 runs, 42 skipped (all branches)

            This test started failing on July 2, 2018 and is failing only in DNE testing.

            jamesanunez James Nunez (Inactive) added a comment - This test started failing on July 2, 2018 and is failing only in DNE testing.

            People

              bobijam Zhenyu Xu
              tappro Mikhail Pershin
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: