Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8899

Restart test-group from next test when a test hangs

Details

    • Improvement
    • Resolution: Low Priority
    • Major
    • None
    • None
    • 5
    • 7
    • Test Infrastructure
    • 5338
    • DCO-2016_Jun20_Jul10

    Description

      As an engineer, when autotest starts over from hanging on a particular test, it should start from the next test rather than starting over completely, so that the time to get through the entire test suite is not as long.

      QUESTION: IN TT-926, WE ASK THAT WHEN A TEST FAILS WE STOP TESTING AND NOT CONTINUE TO NEXT TEST. IS THIS CONTRADICTING THAT REQUEST?

      Attachments

        Activity

          [LU-8899] Restart test-group from next test when a test hangs

          The --start-at option only applies to test suites listed in the command and does not apply to the "-g GROUP Test group file (Overrides tests listed on command line)" option which Autotest uses. Can this be changed to apply when -g is used? It would need to factor in the value of the -S option as well.

          colmstea Charlie Olmstead added a comment - The --start-at option only applies to test suites listed in the command and does not apply to the "-g GROUP Test group file (Overrides tests listed on command line)" option which Autotest uses. Can this be changed to apply when -g is used? It would need to factor in the value of the -S option as well.
          yujian Jian Yu added a comment -

          The Lustre TF is responsible for writing the results.yml file it should also be responsible for knowing where it should resume from when this flag is given.

          For writing the results.yml file, init_logging() called by each test suite (e.g., sanity.sh, sanityn.sh, etc.) does that.
          For knowing where it should resume, "--start-at" option for auster does that.

          yujian Jian Yu added a comment - The Lustre TF is responsible for writing the results.yml file it should also be responsible for knowing where it should resume from when this flag is given. For writing the results.yml file, init_logging() called by each test suite (e.g., sanity.sh, sanityn.sh, etc.) does that. For knowing where it should resume, "--start-at" option for auster does that.

          I understand the Lustre test-framework isn't responsible for rebooting/re-provisioning the nodes. Once a test hangs, Autotest would re-provision the group of nodes and then call auster with a flag to restart the tests from where it left off. The Lustre TF is responsible for writing the results.yml file it should also be responsible for knowing where it should resume from when this flag is given.

          colmstea Charlie Olmstead added a comment - I understand the Lustre test-framework isn't responsible for rebooting/re-provisioning the nodes. Once a test hangs, Autotest would re-provision the group of nodes and then call auster with a flag to restart the tests from where it left off. The Lustre TF is responsible for writing the results.yml file it should also be responsible for knowing where it should resume from when this flag is given.
          yujian Jian Yu added a comment -

          Hi Charlie,
          After a subtest hangs, the test nodes usually need to be rebooted or re-provisioned by autotest system, which could not be done by Lustre test framework. Autotest knows which subtest hangs, and can just start running the next subtest by performing auster with "--start-at" option.

          yujian Jian Yu added a comment - Hi Charlie, After a subtest hangs, the test nodes usually need to be rebooted or re-provisioned by autotest system, which could not be done by Lustre test framework. Autotest knows which subtest hangs, and can just start running the next subtest by performing auster with "--start-at" option.

          This should be an option in the Lustre test-framework, not the responsibility of Autotest.

          colmstea Charlie Olmstead added a comment - This should be an option in the Lustre test-framework, not the responsibility of Autotest.

          Reopening into TEI backlog

          jlevi Jodi Levi (Inactive) added a comment - Reopening into TEI backlog

          this is related to the modularization effort

          agarcia Andrea Garcia (Inactive) added a comment - this is related to the modularization effort

          wIll be part of upgrade

          jlevi Jodi Levi (Inactive) added a comment - wIll be part of upgrade

          Confirmed with Yu Jian that this change could save quite a lot of time in testing. Depending on when the test hangs and how many subtests are remaining. The remaining subtests must be run manually in order to be tested at all. So if the person testing does not run the tests, then that coverage is lost.

          jlevi Jodi Levi (Inactive) added a comment - Confirmed with Yu Jian that this change could save quite a lot of time in testing. Depending on when the test hangs and how many subtests are remaining. The remaining subtests must be run manually in order to be tested at all. So if the person testing does not run the tests, then that coverage is lost.
          yujian Jian Yu added a comment -

          QUESTION: IN TT-926, WE ASK THAT WHEN A TEST FAILS WE STOP TESTING AND NOT CONTINUE TO NEXT TEST. IS THIS CONTRADICTING THAT REQUEST?

          No contradiction. This ticket is typically required for build/release testing sessions, which need cover as many tests as possible.

          E.g., in the following "full" test group session, test 23b hung. The requirement is that after the nodes are rebooted, autotest can start testing test 24,25,26, etc. instead of skipping them.
          https://maloo.whamcloud.com/test_sets/9b37618e-41db-11e2-adcf-52540035b04c

          yujian Jian Yu added a comment - QUESTION: IN TT-926, WE ASK THAT WHEN A TEST FAILS WE STOP TESTING AND NOT CONTINUE TO NEXT TEST. IS THIS CONTRADICTING THAT REQUEST? No contradiction. This ticket is typically required for build/release testing sessions, which need cover as many tests as possible. E.g., in the following "full" test group session, test 23b hung. The requirement is that after the nodes are rebooted, autotest can start testing test 24,25,26, etc. instead of skipping them. https://maloo.whamcloud.com/test_sets/9b37618e-41db-11e2-adcf-52540035b04c

          People

            colmstea Charlie Olmstead
            chris Chris Gearing (Inactive)
            Votes:
            3 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: