Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6773

DNE2 Failover and recovery soak testing

Details

    • Task
    • Resolution: Fixed
    • Blocker
    • None
    • None
    • None
    • 9223372036854775807

    Description

      With async update, cross-MDT operation do not need to synchronize updates on each target. Instead, updates are recorded on each target and recovery of the filesystem from failure takes place using these update records. All operations across MDTs are enabled, for example cross-MDT rename and link succeeds and does not return -EXDEV, so a workload like dbench that is doing renames should function correctly in a striped directory.
      1. Setup Lustre with 4 MDS (each MDS has one MDT), 4 OSTs, and at least 8 clients.
      2. Each client will create a striped directory (with stripe count = 4). Under each striped directory,
      1. 1/2 of clients will keep doing tar, untar in the striped directory.
      2. 1/2 of clients will do dbench under striped directory.
      3. Randomly reboot one of the MDSes at least once every 30 minutes and fail over to the backup MDS if the test configuration allows it.
      4. The test should keep running at least 24 hours without report application error
      The goal of the failover and recovery soak testing is not necessarily to resolve every issue found during testing, especially non-DNE issues, but rather to have a good idea of the relative stability of DNE + Async Commits during recovery.

      Attachments

        Issue Links

          Activity

            [LU-6773] DNE2 Failover and recovery soak testing
            donut-crowd Donut Crowd (Inactive) made changes -
            Remote Link Original: This issue links to "Page (HPDD Community Wiki)" [ 14886 ]
            di.wang Di Wang made changes -
            Resolution New: Fixed [ 1 ]
            Status Original: Open [ 1 ] New: Closed [ 6 ]
            di.wang Di Wang made changes -
            Description Original: With async update, cross-MDT operation do not need to synchronize updates on each target. Instead, updates are recorded on each target and recovery of the filesystem from failure takes place using these update records. All operations across MDTs are enabled, for example cross-MDT rename and link succeeds and does not return -EXDEV, so a workload like dbench that is doing renames should function correctly in a striped directory.
            1. Setup Lustre with 4 MDS (each MDS has two MDTs), 4 OSTs, and at least 8 clients.
            2. Each client will create a striped directory (with stripe count = 4). Under each striped directory,
            1. 1/2 of clients will keep doing tar, untar in the striped directory.
            2. 1/2 of clients will do dbench under striped directory.
            3. Randomly reboot one of the MDSes at least once every 30 minutes and fail over to the backup MDS if the test configuration allows it.
            4. The test should keep running at least 24 hours without report application error
            The goal of the failover and recovery soak testing is not necessarily to resolve every issue found during testing, especially non-DNE issues, but rather to have a good idea of the relative stability of DNE + Async Commits during recovery.
            New: With async update, cross-MDT operation do not need to synchronize updates on each target. Instead, updates are recorded on each target and recovery of the filesystem from failure takes place using these update records. All operations across MDTs are enabled, for example cross-MDT rename and link succeeds and does not return -EXDEV, so a workload like dbench that is doing renames should function correctly in a striped directory.
            1. Setup Lustre with 4 MDS (each MDS has one MDT), 4 OSTs, and at least 8 clients.
            2. Each client will create a striped directory (with stripe count = 4). Under each striped directory,
            1. 1/2 of clients will keep doing tar, untar in the striped directory.
            2. 1/2 of clients will do dbench under striped directory.
            3. Randomly reboot one of the MDSes at least once every 30 minutes and fail over to the backup MDS if the test configuration allows it.
            4. The test should keep running at least 24 hours without report application error
            The goal of the failover and recovery soak testing is not necessarily to resolve every issue found during testing, especially non-DNE issues, but rather to have a good idea of the relative stability of DNE + Async Commits during recovery.
            di.wang Di Wang made changes -
            Attachment New: recovery-mds-scale.suite_log.c24.log [ 18538 ]
            Attachment New: test_logs.tgz [ 18539 ]
            di.wang Di Wang made changes -
            Link New: This issue is related to LU-6831 [ LU-6831 ]
            di.wang Di Wang made changes -
            di.wang Di Wang made changes -
            Attachment New: recovery-mds-scale.suite_log.c24.log [ 18481 ]
            rhenwood Richard Henwood (Inactive) made changes -
            Link New: This issue is blocking LU-6858 [ LU-6858 ]
            rhenwood Richard Henwood (Inactive) made changes -
            Link Original: This issue is blocking LU-5658 [ LU-5658 ]
            rhenwood Richard Henwood (Inactive) made changes -
            Link New: This issue is blocking LU-5658 [ LU-5658 ]

            People

              jamesanunez James Nunez (Inactive)
              rhenwood Richard Henwood (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: