[LU-8899] Restart test-group from next test when a test hangs - Whamcloud Community JIRA

Details

Type: Improvement
Resolution: Low Priority
Priority: Major
Fix Version/s: None
Affects Version/s: None
Labels:
- groomed-lustre-test
- triaged

Story Points:
5
Business Value:
7
Project:
Test Infrastructure
Rank (Obsolete):
5338
Sprint:
DCO-2016_Jun20_Jul10

Description

As an engineer, when autotest starts over from hanging on a particular test, it should start from the next test rather than starting over completely, so that the time to get through the entire test suite is not as long.

QUESTION: IN TT-926, WE ASK THAT WHEN A TEST FAILS WE STOP TESTING AND NOT CONTINUE TO NEXT TEST. IS THIS CONTRADICTING THAT REQUEST?

Attachments

Activity

[LU-8899] Restart test-group from next test when a test hangs

Charlie Olmstead added a comment - 05/Dec/16 10:13 PM

The --start-at option only applies to test suites listed in the command and does not apply to the "-g GROUP Test group file (Overrides tests listed on command line)" option which Autotest uses. Can this be changed to apply when -g is used? It would need to factor in the value of the -S option as well.

Charlie Olmstead added a comment - 05/Dec/16 10:13 PM The --start-at option only applies to test suites listed in the command and does not apply to the "-g GROUP Test group file (Overrides tests listed on command line)" option which Autotest uses. Can this be changed to apply when -g is used? It would need to factor in the value of the -S option as well.

Jian Yu added a comment - 02/Dec/16 11:20 PM

The Lustre TF is responsible for writing the results.yml file it should also be responsible for knowing where it should resume from when this flag is given.

For writing the results.yml file, init_logging() called by each test suite (e.g., sanity.sh, sanityn.sh, etc.) does that.
For knowing where it should resume, "--start-at" option for auster does that.

Jian Yu added a comment - 02/Dec/16 11:20 PM The Lustre TF is responsible for writing the results.yml file it should also be responsible for knowing where it should resume from when this flag is given. For writing the results.yml file, init_logging() called by each test suite (e.g., sanity.sh, sanityn.sh, etc.) does that. For knowing where it should resume, "--start-at" option for auster does that.

Charlie Olmstead added a comment - 02/Dec/16 10:40 PM

I understand the Lustre test-framework isn't responsible for rebooting/re-provisioning the nodes. Once a test hangs, Autotest would re-provision the group of nodes and then call auster with a flag to restart the tests from where it left off. The Lustre TF is responsible for writing the results.yml file it should also be responsible for knowing where it should resume from when this flag is given.

Charlie Olmstead added a comment - 02/Dec/16 10:40 PM I understand the Lustre test-framework isn't responsible for rebooting/re-provisioning the nodes. Once a test hangs, Autotest would re-provision the group of nodes and then call auster with a flag to restart the tests from where it left off. The Lustre TF is responsible for writing the results.yml file it should also be responsible for knowing where it should resume from when this flag is given.

Jian Yu added a comment - 02/Dec/16 10:32 PM

Hi Charlie,
After a subtest hangs, the test nodes usually need to be rebooted or re-provisioned by autotest system, which could not be done by Lustre test framework. Autotest knows which subtest hangs, and can just start running the next subtest by performing auster with "--start-at" option.

Jian Yu added a comment - 02/Dec/16 10:32 PM Hi Charlie, After a subtest hangs, the test nodes usually need to be rebooted or re-provisioned by autotest system, which could not be done by Lustre test framework. Autotest knows which subtest hangs, and can just start running the next subtest by performing auster with "--start-at" option.

Charlie Olmstead added a comment - 02/Dec/16 10:09 PM

This should be an option in the Lustre test-framework, not the responsibility of Autotest.

Charlie Olmstead added a comment - 02/Dec/16 10:09 PM This should be an option in the Lustre test-framework, not the responsibility of Autotest.

Jodi Levi (Inactive) added a comment - 18/Feb/16 3:09 PM

Reopening into TEI backlog

Jodi Levi (Inactive) added a comment - 18/Feb/16 3:09 PM Reopening into TEI backlog

Andrea Garcia (Inactive) added a comment - 09/Jul/15 9:42 PM

this is related to the modularization effort

Andrea Garcia (Inactive) added a comment - 09/Jul/15 9:42 PM this is related to the modularization effort

Jodi Levi (Inactive) added a comment - 09/Jul/15 9:42 PM

wIll be part of upgrade

Jodi Levi (Inactive) added a comment - 09/Jul/15 9:42 PM wIll be part of upgrade

Jodi Levi (Inactive) added a comment - 14/Dec/12 8:29 AM

Confirmed with Yu Jian that this change could save quite a lot of time in testing. Depending on when the test hangs and how many subtests are remaining. The remaining subtests must be run manually in order to be tested at all. So if the person testing does not run the tests, then that coverage is lost.

Jodi Levi (Inactive) added a comment - 14/Dec/12 8:29 AM Confirmed with Yu Jian that this change could save quite a lot of time in testing. Depending on when the test hangs and how many subtests are remaining. The remaining subtests must be run manually in order to be tested at all. So if the person testing does not run the tests, then that coverage is lost.

Jian Yu added a comment - 10/Dec/12 10:46 PM

QUESTION: IN TT-926, WE ASK THAT WHEN A TEST FAILS WE STOP TESTING AND NOT CONTINUE TO NEXT TEST. IS THIS CONTRADICTING THAT REQUEST?

No contradiction. This ticket is typically required for build/release testing sessions, which need cover as many tests as possible.

E.g., in the following "full" test group session, test 23b hung. The requirement is that after the nodes are rebooted, autotest can start testing test 24,25,26, etc. instead of skipping them.
https://maloo.whamcloud.com/test_sets/9b37618e-41db-11e2-adcf-52540035b04c

Jian Yu added a comment - 10/Dec/12 10:46 PM QUESTION: IN TT-926, WE ASK THAT WHEN A TEST FAILS WE STOP TESTING AND NOT CONTINUE TO NEXT TEST. IS THIS CONTRADICTING THAT REQUEST? No contradiction. This ticket is typically required for build/release testing sessions, which need cover as many tests as possible. E.g., in the following "full" test group session, test 23b hung. The requirement is that after the nodes are rebooted, autotest can start testing test 24,25,26, etc. instead of skipping them. https://maloo.whamcloud.com/test_sets/9b37618e-41db-11e2-adcf-52540035b04c

People

Assignee:: Charlie Olmstead

Reporter:: Chris Gearing (Inactive)

Votes:: 3 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 29/Oct/12 4:04 PM

Updated:: 06/Jul/21 6:55 PM

Resolved:: 06/Jul/21 6:55 PM