Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4178

Test failure on test suite sanity-hsm, subtest test_200

Details

    • 3
    • 11309

    Description

      This issue was created by maloo for Bob Glossman <bob.glossman@intel.com>

      This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/4c3bcdec-4025-11e3-bfaf-52540035b04c.

      test_200 was only recently enabled by commit 38695729d61958ab10e9e108175298f8a7d40536. before that is was always skipped due to being in ALWAYS_EXCEPT. I'm wondering if it was a mistake to turn this test on at all. maloo reports:

      Failure Rate: 66.00% of last 100 executions [all branches]

      This failure looks not at all related to the change under test, at least in this case.

      The sub-test test_200 failed with the following error:

      request on sanity-hsm is not @@@@@@

      Info required for matching: sanity-hsm 200

      Attachments

        Issue Links

          Activity

            [LU-4178] Test failure on test suite sanity-hsm, subtest test_200

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13173/
            Subject: LU-4178 tests: Wait requests to reach CDT before Cancel
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 6a31cf92555182a23f14d3385c8c14266887070a

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13173/ Subject: LU-4178 tests: Wait requests to reach CDT before Cancel Project: fs/lustre-release Branch: master Current Patch Set: Commit: 6a31cf92555182a23f14d3385c8c14266887070a

            Closing ticket because sanity-hsm tests 200, 201 and 202 are passing on master for the past month. If any more work needs to be done for this ticket, please open a new ticket and we'll track the work there.

            jamesanunez James Nunez (Inactive) added a comment - Closing ticket because sanity-hsm tests 200, 201 and 202 are passing on master for the past month. If any more work needs to be done for this ticket, please open a new ticket and we'll track the work there.

            James Nunez (james.a.nunez@intel.com) uploaded a new patch: http://review.whamcloud.com/13826
            Subject: LU-4178 tests: increase sanity-hsm wait_request_state tiemout
            Project: fs/lustre-release
            Branch: b2_5
            Current Patch Set: 1
            Commit: 9c56a8f64d2ab4b8db8b3f38dff2d019b8cd3e40

            gerrit Gerrit Updater added a comment - James Nunez (james.a.nunez@intel.com) uploaded a new patch: http://review.whamcloud.com/13826 Subject: LU-4178 tests: increase sanity-hsm wait_request_state tiemout Project: fs/lustre-release Branch: b2_5 Current Patch Set: 1 Commit: 9c56a8f64d2ab4b8db8b3f38dff2d019b8cd3e40

            James Nunez (james.a.nunez@intel.com) uploaded a new patch: http://review.whamcloud.com/13825
            Subject: LU-4178 tests: add messages to sanity-hsm
            Project: fs/lustre-release
            Branch: b2_5
            Current Patch Set: 1
            Commit: 75d6d35bcc48eefe490e8b4efd673c58b3373507

            gerrit Gerrit Updater added a comment - James Nunez (james.a.nunez@intel.com) uploaded a new patch: http://review.whamcloud.com/13825 Subject: LU-4178 tests: add messages to sanity-hsm Project: fs/lustre-release Branch: b2_5 Current Patch Set: 1 Commit: 75d6d35bcc48eefe490e8b4efd673c58b3373507

            Reopening ticket because there is one more patch for this ticket that has not landed. The patch is at: http://review.whamcloud.com/#/c/13173/

            jamesanunez James Nunez (Inactive) added a comment - Reopening ticket because there is one more patch for this ticket that has not landed. The patch is at: http://review.whamcloud.com/#/c/13173/

            Patches landed to Master.

            jlevi Jodi Levi (Inactive) added a comment - Patches landed to Master.

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13206/
            Subject: LU-4178 tests: increase sanity-hsm wait_request_state tiemout
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 4cb51c76ed2afa168f19e999190a315803580258

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13206/ Subject: LU-4178 tests: increase sanity-hsm wait_request_state tiemout Project: fs/lustre-release Branch: master Current Patch Set: Commit: 4cb51c76ed2afa168f19e999190a315803580258
            jhammond John Hammond added a comment -

            The cancel action does not succeed until the CT reports that the archive is complete. In test_200 we use make_large_for_cancel() which gives us a 100MB file. Because of the 1MB/s bandwidth limit the CT will take at least 100s to archive the file. Since wait_request_state() uses a 100s timeout this make for a very racy test. And since most of these tests still use NFS for the archive there can be additional delays.

            I suggest that we double the timeout in wait_request_state(). Please see http://review.whamcloud.com/13206.

            jhammond John Hammond added a comment - The cancel action does not succeed until the CT reports that the archive is complete. In test_200 we use make_large_for_cancel() which gives us a 100MB file. Because of the 1MB/s bandwidth limit the CT will take at least 100s to archive the file. Since wait_request_state() uses a 100s timeout this make for a very racy test. And since most of these tests still use NFS for the archive there can be additional delays. I suggest that we double the timeout in wait_request_state(). Please see http://review.whamcloud.com/13206 .

            John L. Hammond (john.hammond@intel.com) uploaded a new patch: http://review.whamcloud.com/13206
            Subject: LU-4178 tests: increase sanity-hsm wait_request_state tiemout
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 278c1fb845c2cdd7905f717435176e94a4ad7057

            gerrit Gerrit Updater added a comment - John L. Hammond (john.hammond@intel.com) uploaded a new patch: http://review.whamcloud.com/13206 Subject: LU-4178 tests: increase sanity-hsm wait_request_state tiemout Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 278c1fb845c2cdd7905f717435176e94a4ad7057

            The analysis and patch look good but I am surprised because the CDT command registration should be synchronous. So after hsm_archive the CDT entry should be recorded.

            jcl jacques-charles lafoucriere added a comment - The analysis and patch look good but I am surprised because the CDT command registration should be synchronous. So after hsm_archive the CDT entry should be recorded.

            John,
            I think that sub-tests 200-202 verify that requests can be canceled when CDT operations have been started but disabled.
            BTW, your finding that sometime (very likely due to some threading cause at Agent and/or MDS side!) the Cancel requests is treated by CDT before the action it targets needs to be addressed and also explains why these failures are not solid.

            Patch that adds verification that the operation has already been registered at CDT before to send the Cancel in sanity-hsm/test_[200-202], is at http://review.whamcloud.com/13173.

            bfaccini Bruno Faccini (Inactive) added a comment - John, I think that sub-tests 200-202 verify that requests can be canceled when CDT operations have been started but disabled. BTW, your finding that sometime (very likely due to some threading cause at Agent and/or MDS side!) the Cancel requests is treated by CDT before the action it targets needs to be addressed and also explains why these failures are not solid. Patch that adds verification that the operation has already been registered at CDT before to send the Cancel in sanity-hsm/test_ [200-202] , is at http://review.whamcloud.com/13173 .

            People

              bfaccini Bruno Faccini (Inactive)
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: