[LU-6773] DNE2 Failover and recovery soak testing Created: 29/Jun/15  Updated: 14/Jun/18  Resolved: 26/Aug/15

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Task Priority: Blocker
Reporter: Richard Henwood (Inactive) Assignee: James Nunez (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Attachments: Text File recovery-mds-scale.suite_log.c24.log     Text File recovery-mds-scale.suite_log.c24.log     Text File recovery-mds-scale.test_failover_mds.test_log.c24.log     Text File test_logs.tgz    
Issue Links:
Blocker
is blocking LU-6858 Demonstrate DNE2 functionality Open
is blocked by LU-6837 MDS panic during 24 hours failover test. Resolved
is blocked by LU-6840 update memory reply data in DNE updat... Resolved
is blocked by LU-6852 MDS is evicted during 24-24 hours fai... Resolved
Related
is related to LU-6831 The ticket for tracking all DNE2 bugs Reopened
Rank (Obsolete): 9223372036854775807

 Description   

With async update, cross-MDT operation do not need to synchronize updates on each target. Instead, updates are recorded on each target and recovery of the filesystem from failure takes place using these update records. All operations across MDTs are enabled, for example cross-MDT rename and link succeeds and does not return -EXDEV, so a workload like dbench that is doing renames should function correctly in a striped directory.
1. Setup Lustre with 4 MDS (each MDS has one MDT), 4 OSTs, and at least 8 clients.
2. Each client will create a striped directory (with stripe count = 4). Under each striped directory,
1. 1/2 of clients will keep doing tar, untar in the striped directory.
2. 1/2 of clients will do dbench under striped directory.
3. Randomly reboot one of the MDSes at least once every 30 minutes and fail over to the backup MDS if the test configuration allows it.
4. The test should keep running at least 24 hours without report application error
The goal of the failover and recovery soak testing is not necessarily to resolve every issue found during testing, especially non-DNE issues, but rather to have a good idea of the relative stability of DNE + Async Commits during recovery.



 Comments   
Comment by Di Wang [ 06/Jul/15 ]

The rpm has been installed on all of nodes, https://build.hpdd.intel.com/job/lustre-reviews/33136/ .

Comment by Richard Henwood (Inactive) [ 09/Jul/15 ]

Also: Please collect logs from this work during the run and attached them to this ticket.

Comment by Richard Henwood (Inactive) [ 15/Jul/15 ]

An Update on this activity:

James Nunez has been engaged with recovery testing on the OpenSFS cluster hosted by IU for the last two weeks. Over the past week, IU have been fully engaged helping us with a stretch goal to enable a fail-over configuration. No mechanism to gracefully force the logical drives to all run on a single controller could be identified. It is expected that physically pulling a controller may force a fail-over but this activity can not be scheduled for a 24 hour duration, required by the test.

I'm investigating alternatives.

Comment by Richard Henwood (Inactive) [ 15/Jul/15 ]

During discussion, we have identified three stages of testing for this ticket:

  1. soft recovery: forced unmount of an active file system. Remount after a period of time.
  2. hard recovery: hard reboot an MDS of an active file system.
  3. hard recovery with fail-over: hard reboot an MDS of an active file system. The file system remains available through-out.
Comment by Di Wang [ 25/Jul/15 ]

With build https://build.hpdd.intel.com/job/lustre-reviews/33580/ , hard reboot(10 mins reboot interval). The test fails after 35 failover. The target is 48 failover.

Duration:               86400
Server failover period: 600 seconds
Exited after:           22249 seconds
Number of failovers before exit:
mds1: 6 times
mds2: 12 times
mds3: 9 times
mds4: 8 times
ost1: 0 times
ost2: 0 times
ost3: 0 times
ost4: 0 times
Status: FAIL: rc=7
Comment by Di Wang [ 28/Jul/15 ]

The test fails after 41 failover with build https://build.hpdd.intel.com/job/lustre-reviews/33612/

Duration:               86400
Server failover period: 1800 seconds
Exited after:           72283 seconds
Number of failovers before exit:
mds1: 10 times
mds2: 10 times
mds3: 10 times
mds4: 11 times
ost1: 0 times
ost2: 0 times
ost3: 0 times
ost4: 0 times
Status: FAIL: rc=7
Comment by Di Wang [ 02/Aug/15 ]

Ok, this test just passed with the build https://build.hpdd.intel.com/job/lustre-reviews/33759/ Here is the test log.

==== Checking the clients loads AFTER failover -- failure NOT OK
mds4 has failed over 9 times, and counting...
2015-08-01 16:59:58 Terminating clients loads ...
Duration:               86400
Server failover period: 1800 seconds
Exited after:           84832 seconds
Number of failovers before exit:
mds1: 16 times
mds2: 7 times
mds3: 16 times
mds4: 9 times
ost1: 0 times
ost2: 0 times
ost3: 0 times
ost4: 0 times
Status: PASS: rc=0
PASS failover_mds (84837s)
Comment by Di Wang [ 03/Aug/15 ]

The build https://build.hpdd.intel.com/job/lustre-reviews/33759/ is based on master with the patch

http://review.whamcloud.com/#/c/15812/ (LU-6928)
http://review.whamcloud.com/#/c/15793/ (LU-6924)
http://review.whamcloud.com/#/c/15730/ (LU-6846)
http://review.whamcloud.com/#/c/15725/ (LU-6905)
http://review.whamcloud.com/#/c/15721/ (LU-6896)
http://review.whamcloud.com/#/c/15691/ (LU-6875)
http://review.whamcloud.com/#/c/15690/ (LU-6881)
http://review.whamcloud.com/#/c/15682/ (LU-6882)
http://review.whamcloud.com/#/c/15595/ (LU-6846)
http://review.whamcloud.com/#/c/15594/ (LU-6819)
http://review.whamcloud.com/#/c/15576/ (LU-6840)
http://review.whamcloud.com/#/c/14497/ (LU-6475)
http://review.whamcloud.com/#/c/13224/ (LU-6852)

Generated at Sat Feb 10 02:03:07 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.