[LU-11841] Reduce the run time of the failover test group - Whamcloud Community JIRA

Details

Type: Task
Resolution: Unresolved
Priority: Minor
Fix Version/s: None
Affects Version/s: Lustre 2.11.0, Lustre 2.12.0, Lustre 2.13.0, Lustre 2.10.6
Labels:
- test_script_improvements
- tests

Rank (Obsolete):
9223372036854775807

Description

Right now, the failover test group contains recovery-mds-scale, recovery-random-scale, recovery-double-scale, recovery-small, replay-ost-single, replay-dual, replay-vbr, mmp and replay-single. Note: for the failover group, FAILURE_MODE=HARD. The duration for each test in each of the test scripts recovery-mds-scale, recovery-random-scale, and recovery-double-scale is 24 hours.

recovery-mds-scale is composed of two tests, failover_mds and failover_ost. In the failover test group that is run for every branch build, each test takes 24 hours to run. Since these tests are run serially, it takes 48 hours to run recovery-mds-scale and there is no hope of running these tests in parallel. Thus, we should break up recovery-mds-scale into recovery-mds-scale, containing MDS failover/recovery, and recovery-oss-scale containing OSS failover/recovery.

If you look at the time it takes to run the failover test group that is run for every branch build, it takes approximately 3 days. We should break up this test group.

We should do this in several steps:
1. submit a patch that duplicates recovery-mds-scale and names it recovery-oss-scale. We can make some cosmetic clean up to the code, but current script will be kept as is so that it is easy to review and land if we can get a recovery tests to pass.
2. As this point, we should break up the failover test group. After step 1 is complete, we could break up the test failover group into four different failover test groups:
failover_mds test group containing recovery-mds-scale with approximate run time of 24 hours
failover_oss test group containing recovery-oss-scale with approximate run time of 24 hours
failover_random test group containing recovery-oss-scale with approximate run time of 24 hours
(new) failover test group containing recovery-small, replay-ost-single, replay-dual, replay-vbr, mmp and replay-single with approximate run time of 10 hours

3. We can now clean up the duplicate code created by just duplicating recovery-mds-scale.sh. Let’s create a new test function library in lustre/tests, called recovery-scale-lib (or something like that), that contains all common code to recovery-mds-scale, recovery-oss-scale, recovery-random-scale.
There are also functions in test-framework.sh that are only used by the recovery test suites, including start_client_load(), check_client_load(), restart_client_loads(), print_end_run_file(), etc., marked with “# recovery-scale functions” and “# End recovery-scale functions” that can be moved into the new recovery library.

Attachments

Sub-Tasks

Progress

split recovery-mds-scale into two test sets: recovery-mds-scale and recovery-ost-scale

In Progress

Alex Deiter

Activity

People

Assignee:: Alex Deiter

Reporter:: James Nunez (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 09/Jan/19 2:05 AM

Updated:: 26/Jan/23 1:52 AM