Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11841

Reduce the run time of the failover test group

    Details

    • Type: Task
    • Status: Open
    • Priority: Minor
    • Resolution: Unresolved
    • Affects Version/s: Lustre 2.11.0, Lustre 2.12.0, Lustre 2.13.0, Lustre 2.10.6
    • Fix Version/s: None
    • Rank (Obsolete):
      9223372036854775807

      Description

      Right now, the failover test group contains recovery-mds-scale, recovery-random-scale, recovery-double-scale, recovery-small, replay-ost-single, replay-dual, replay-vbr, mmp and replay-single. Note: for the failover group, FAILURE_MODE=HARD. The duration for each test in each of the test scripts recovery-mds-scale, recovery-random-scale, and recovery-double-scale is 24 hours.

      recovery-mds-scale is composed of two tests, failover_mds and failover_ost. In the failover test group that is run for every branch build, each test takes 24 hours to run. Since these tests are run serially, it takes 48 hours to run recovery-mds-scale and there is no hope of running these tests in parallel. Thus, we should break up recovery-mds-scale into recovery-mds-scale, containing MDS failover/recovery, and recovery-oss-scale containing OSS failover/recovery.

      If you look at the time it takes to run the failover test group that is run for every branch build, it takes approximately 3 days. We should break up this test group.

      We should do this in several steps:
      1. submit a patch that duplicates recovery-mds-scale and names it recovery-oss-scale. We can make some cosmetic clean up to the code, but current script will be kept as is so that it is easy to review and land if we can get a recovery tests to pass.
      2. As this point, we should break up the failover test group. After step 1 is complete, we could break up the test failover group into four different failover test groups:
      failover_mds test group containing recovery-mds-scale with approximate run time of 24 hours
      failover_oss test group containing recovery-oss-scale with approximate run time of 24 hours
      failover_random test group containing recovery-oss-scale with approximate run time of 24 hours
      (new) failover test group containing recovery-small, replay-ost-single, replay-dual, replay-vbr, mmp and replay-single with approximate run time of 10 hours

      3. We can now clean up the duplicate code created by just duplicating recovery-mds-scale.sh. Let’s create a new test function library in lustre/tests, called recovery-scale-lib (or something like that), that contains all common code to recovery-mds-scale, recovery-oss-scale, recovery-random-scale.
      There are also functions in test-framework.sh that are only used by the recovery test suites, including start_client_load(), check_client_load(), restart_client_loads(), print_end_run_file(), etc., marked with “# recovery-scale functions” and “# End recovery-scale functions” that can be moved into the new recovery library.

        Attachments

          Activity

            People

            • Assignee:
              wc-triage WC Triage
              Reporter:
              jamesanunez James Nunez
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated: