[LU-6773] DNE2 Failover and recovery soak testing Created: 29/Jun/15 Updated: 14/Jun/18 Resolved: 26/Aug/15 |
|
| Status: | Closed |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Task | Priority: | Blocker |
| Reporter: | Richard Henwood (Inactive) | Assignee: | James Nunez (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
||||||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||||||||||
| Description |
|
With async update, cross-MDT operation do not need to synchronize updates on each target. Instead, updates are recorded on each target and recovery of the filesystem from failure takes place using these update records. All operations across MDTs are enabled, for example cross-MDT rename and link succeeds and does not return -EXDEV, so a workload like dbench that is doing renames should function correctly in a striped directory. |
| Comments |
| Comment by Di Wang [ 06/Jul/15 ] |
|
The rpm has been installed on all of nodes, https://build.hpdd.intel.com/job/lustre-reviews/33136/ . |
| Comment by Richard Henwood (Inactive) [ 09/Jul/15 ] |
|
Also: Please collect logs from this work during the run and attached them to this ticket. |
| Comment by Richard Henwood (Inactive) [ 15/Jul/15 ] |
|
An Update on this activity: James Nunez has been engaged with recovery testing on the OpenSFS cluster hosted by IU for the last two weeks. Over the past week, IU have been fully engaged helping us with a stretch goal to enable a fail-over configuration. No mechanism to gracefully force the logical drives to all run on a single controller could be identified. It is expected that physically pulling a controller may force a fail-over but this activity can not be scheduled for a 24 hour duration, required by the test. I'm investigating alternatives. |
| Comment by Richard Henwood (Inactive) [ 15/Jul/15 ] |
|
During discussion, we have identified three stages of testing for this ticket:
|
| Comment by Di Wang [ 25/Jul/15 ] |
|
With build https://build.hpdd.intel.com/job/lustre-reviews/33580/ , hard reboot(10 mins reboot interval). The test fails after 35 failover. The target is 48 failover. Duration: 86400 Server failover period: 600 seconds Exited after: 22249 seconds Number of failovers before exit: mds1: 6 times mds2: 12 times mds3: 9 times mds4: 8 times ost1: 0 times ost2: 0 times ost3: 0 times ost4: 0 times Status: FAIL: rc=7 |
| Comment by Di Wang [ 28/Jul/15 ] |
|
The test fails after 41 failover with build https://build.hpdd.intel.com/job/lustre-reviews/33612/ Duration: 86400 Server failover period: 1800 seconds Exited after: 72283 seconds Number of failovers before exit: mds1: 10 times mds2: 10 times mds3: 10 times mds4: 11 times ost1: 0 times ost2: 0 times ost3: 0 times ost4: 0 times Status: FAIL: rc=7 |
| Comment by Di Wang [ 02/Aug/15 ] |
|
Ok, this test just passed with the build https://build.hpdd.intel.com/job/lustre-reviews/33759/ Here is the test log. ==== Checking the clients loads AFTER failover -- failure NOT OK mds4 has failed over 9 times, and counting... 2015-08-01 16:59:58 Terminating clients loads ... Duration: 86400 Server failover period: 1800 seconds Exited after: 84832 seconds Number of failovers before exit: mds1: 16 times mds2: 7 times mds3: 16 times mds4: 9 times ost1: 0 times ost2: 0 times ost3: 0 times ost4: 0 times Status: PASS: rc=0 PASS failover_mds (84837s) |
| Comment by Di Wang [ 03/Aug/15 ] |
|
The build https://build.hpdd.intel.com/job/lustre-reviews/33759/ is based on master with the patch http://review.whamcloud.com/#/c/15812/ ( |