Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6773

DNE2 Failover and recovery soak testing

Details

    • Task
    • Resolution: Fixed
    • Blocker
    • None
    • None
    • None
    • 9223372036854775807

    Description

      With async update, cross-MDT operation do not need to synchronize updates on each target. Instead, updates are recorded on each target and recovery of the filesystem from failure takes place using these update records. All operations across MDTs are enabled, for example cross-MDT rename and link succeeds and does not return -EXDEV, so a workload like dbench that is doing renames should function correctly in a striped directory.
      1. Setup Lustre with 4 MDS (each MDS has one MDT), 4 OSTs, and at least 8 clients.
      2. Each client will create a striped directory (with stripe count = 4). Under each striped directory,
      1. 1/2 of clients will keep doing tar, untar in the striped directory.
      2. 1/2 of clients will do dbench under striped directory.
      3. Randomly reboot one of the MDSes at least once every 30 minutes and fail over to the backup MDS if the test configuration allows it.
      4. The test should keep running at least 24 hours without report application error
      The goal of the failover and recovery soak testing is not necessarily to resolve every issue found during testing, especially non-DNE issues, but rather to have a good idea of the relative stability of DNE + Async Commits during recovery.

      Attachments

        Issue Links

          Activity

            [LU-6773] DNE2 Failover and recovery soak testing
            di.wang Di Wang added a comment - The build https://build.hpdd.intel.com/job/lustre-reviews/33759/ is based on master with the patch http://review.whamcloud.com/#/c/15812/ ( LU-6928 ) http://review.whamcloud.com/#/c/15793/ ( LU-6924 ) http://review.whamcloud.com/#/c/15730/ ( LU-6846 ) http://review.whamcloud.com/#/c/15725/ ( LU-6905 ) http://review.whamcloud.com/#/c/15721/ ( LU-6896 ) http://review.whamcloud.com/#/c/15691/ ( LU-6875 ) http://review.whamcloud.com/#/c/15690/ ( LU-6881 ) http://review.whamcloud.com/#/c/15682/ ( LU-6882 ) http://review.whamcloud.com/#/c/15595/ ( LU-6846 ) http://review.whamcloud.com/#/c/15594/ ( LU-6819 ) http://review.whamcloud.com/#/c/15576/ ( LU-6840 ) http://review.whamcloud.com/#/c/14497/ ( LU-6475 ) http://review.whamcloud.com/#/c/13224/ ( LU-6852 )
            di.wang Di Wang added a comment - - edited

            Ok, this test just passed with the build https://build.hpdd.intel.com/job/lustre-reviews/33759/ Here is the test log.

            ==== Checking the clients loads AFTER failover -- failure NOT OK
            mds4 has failed over 9 times, and counting...
            2015-08-01 16:59:58 Terminating clients loads ...
            Duration:               86400
            Server failover period: 1800 seconds
            Exited after:           84832 seconds
            Number of failovers before exit:
            mds1: 16 times
            mds2: 7 times
            mds3: 16 times
            mds4: 9 times
            ost1: 0 times
            ost2: 0 times
            ost3: 0 times
            ost4: 0 times
            Status: PASS: rc=0
            PASS failover_mds (84837s)
            
            di.wang Di Wang added a comment - - edited Ok, this test just passed with the build https://build.hpdd.intel.com/job/lustre-reviews/33759/ Here is the test log. ==== Checking the clients loads AFTER failover -- failure NOT OK mds4 has failed over 9 times, and counting... 2015-08-01 16:59:58 Terminating clients loads ... Duration: 86400 Server failover period: 1800 seconds Exited after: 84832 seconds Number of failovers before exit: mds1: 16 times mds2: 7 times mds3: 16 times mds4: 9 times ost1: 0 times ost2: 0 times ost3: 0 times ost4: 0 times Status: PASS: rc=0 PASS failover_mds (84837s)
            di.wang Di Wang added a comment - - edited

            The test fails after 41 failover with build https://build.hpdd.intel.com/job/lustre-reviews/33612/

            Duration:               86400
            Server failover period: 1800 seconds
            Exited after:           72283 seconds
            Number of failovers before exit:
            mds1: 10 times
            mds2: 10 times
            mds3: 10 times
            mds4: 11 times
            ost1: 0 times
            ost2: 0 times
            ost3: 0 times
            ost4: 0 times
            Status: FAIL: rc=7
            
            di.wang Di Wang added a comment - - edited The test fails after 41 failover with build https://build.hpdd.intel.com/job/lustre-reviews/33612/ Duration: 86400 Server failover period: 1800 seconds Exited after: 72283 seconds Number of failovers before exit: mds1: 10 times mds2: 10 times mds3: 10 times mds4: 11 times ost1: 0 times ost2: 0 times ost3: 0 times ost4: 0 times Status: FAIL: rc=7
            di.wang Di Wang added a comment -

            With build https://build.hpdd.intel.com/job/lustre-reviews/33580/ , hard reboot(10 mins reboot interval). The test fails after 35 failover. The target is 48 failover.

            Duration:               86400
            Server failover period: 600 seconds
            Exited after:           22249 seconds
            Number of failovers before exit:
            mds1: 6 times
            mds2: 12 times
            mds3: 9 times
            mds4: 8 times
            ost1: 0 times
            ost2: 0 times
            ost3: 0 times
            ost4: 0 times
            Status: FAIL: rc=7
            
            di.wang Di Wang added a comment - With build https://build.hpdd.intel.com/job/lustre-reviews/33580/ , hard reboot(10 mins reboot interval). The test fails after 35 failover. The target is 48 failover. Duration: 86400 Server failover period: 600 seconds Exited after: 22249 seconds Number of failovers before exit: mds1: 6 times mds2: 12 times mds3: 9 times mds4: 8 times ost1: 0 times ost2: 0 times ost3: 0 times ost4: 0 times Status: FAIL: rc=7

            During discussion, we have identified three stages of testing for this ticket:

            1. soft recovery: forced unmount of an active file system. Remount after a period of time.
            2. hard recovery: hard reboot an MDS of an active file system.
            3. hard recovery with fail-over: hard reboot an MDS of an active file system. The file system remains available through-out.
            rhenwood Richard Henwood (Inactive) added a comment - - edited During discussion, we have identified three stages of testing for this ticket: soft recovery: forced unmount of an active file system. Remount after a period of time. hard recovery: hard reboot an MDS of an active file system. hard recovery with fail-over: hard reboot an MDS of an active file system. The file system remains available through-out.

            An Update on this activity:

            James Nunez has been engaged with recovery testing on the OpenSFS cluster hosted by IU for the last two weeks. Over the past week, IU have been fully engaged helping us with a stretch goal to enable a fail-over configuration. No mechanism to gracefully force the logical drives to all run on a single controller could be identified. It is expected that physically pulling a controller may force a fail-over but this activity can not be scheduled for a 24 hour duration, required by the test.

            I'm investigating alternatives.

            rhenwood Richard Henwood (Inactive) added a comment - An Update on this activity: James Nunez has been engaged with recovery testing on the OpenSFS cluster hosted by IU for the last two weeks. Over the past week, IU have been fully engaged helping us with a stretch goal to enable a fail-over configuration. No mechanism to gracefully force the logical drives to all run on a single controller could be identified. It is expected that physically pulling a controller may force a fail-over but this activity can not be scheduled for a 24 hour duration, required by the test. I'm investigating alternatives.

            Also: Please collect logs from this work during the run and attached them to this ticket.

            rhenwood Richard Henwood (Inactive) added a comment - Also: Please collect logs from this work during the run and attached them to this ticket.
            di.wang Di Wang added a comment -

            The rpm has been installed on all of nodes, https://build.hpdd.intel.com/job/lustre-reviews/33136/ .

            di.wang Di Wang added a comment - The rpm has been installed on all of nodes, https://build.hpdd.intel.com/job/lustre-reviews/33136/ .

            People

              jamesanunez James Nunez (Inactive)
              rhenwood Richard Henwood (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: