[LU-8899] Restart test-group from next test when a test hangs Created: 29/Oct/12  Updated: 06/Jul/21  Resolved: 06/Jul/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major
Reporter: Chris Gearing (Inactive) Assignee: Charlie Olmstead
Resolution: Low Priority Votes: 3
Labels: groomed-lustre-test, triaged

Issue Links:
Related
Story Points: 5
Business Value: 7
Project: Test Infrastructure
Rank (Obsolete): 5338
Sprint: DCO-2016_Jun20_Jul10

 Description   

As an engineer, when autotest starts over from hanging on a particular test, it should start from the next test rather than starting over completely, so that the time to get through the entire test suite is not as long.

QUESTION: IN TT-926, WE ASK THAT WHEN A TEST FAILS WE STOP TESTING AND NOT CONTINUE TO NEXT TEST. IS THIS CONTRADICTING THAT REQUEST?



 Comments   
Comment by Jian Yu [ 10/Dec/12 ]

QUESTION: IN TT-926, WE ASK THAT WHEN A TEST FAILS WE STOP TESTING AND NOT CONTINUE TO NEXT TEST. IS THIS CONTRADICTING THAT REQUEST?

No contradiction. This ticket is typically required for build/release testing sessions, which need cover as many tests as possible.

E.g., in the following "full" test group session, test 23b hung. The requirement is that after the nodes are rebooted, autotest can start testing test 24,25,26, etc. instead of skipping them.
https://maloo.whamcloud.com/test_sets/9b37618e-41db-11e2-adcf-52540035b04c

Comment by Jodi Levi (Inactive) [ 14/Dec/12 ]

Confirmed with Yu Jian that this change could save quite a lot of time in testing. Depending on when the test hangs and how many subtests are remaining. The remaining subtests must be run manually in order to be tested at all. So if the person testing does not run the tests, then that coverage is lost.

Comment by Jodi Levi (Inactive) [ 09/Jul/15 ]

wIll be part of upgrade

Comment by Andrea Garcia (Inactive) [ 09/Jul/15 ]

this is related to the modularization effort

Comment by Jodi Levi (Inactive) [ 18/Feb/16 ]

Reopening into TEI backlog

Comment by Charlie Olmstead [ 02/Dec/16 ]

This should be an option in the Lustre test-framework, not the responsibility of Autotest.

Comment by Jian Yu [ 02/Dec/16 ]

Hi Charlie,
After a subtest hangs, the test nodes usually need to be rebooted or re-provisioned by autotest system, which could not be done by Lustre test framework. Autotest knows which subtest hangs, and can just start running the next subtest by performing auster with "--start-at" option.

Comment by Charlie Olmstead [ 02/Dec/16 ]

I understand the Lustre test-framework isn't responsible for rebooting/re-provisioning the nodes. Once a test hangs, Autotest would re-provision the group of nodes and then call auster with a flag to restart the tests from where it left off. The Lustre TF is responsible for writing the results.yml file it should also be responsible for knowing where it should resume from when this flag is given.

Comment by Jian Yu [ 02/Dec/16 ]

The Lustre TF is responsible for writing the results.yml file it should also be responsible for knowing where it should resume from when this flag is given.

For writing the results.yml file, init_logging() called by each test suite (e.g., sanity.sh, sanityn.sh, etc.) does that.
For knowing where it should resume, "--start-at" option for auster does that.

Comment by Charlie Olmstead [ 05/Dec/16 ]

The --start-at option only applies to test suites listed in the command and does not apply to the "-g GROUP Test group file (Overrides tests listed on command line)" option which Autotest uses. Can this be changed to apply when -g is used? It would need to factor in the value of the -S option as well.

Generated at Sat Feb 10 02:21:30 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.