[LU-2605] reduce the failover time when the MDS and OSS are on the same machine in High Availability Created: 11/Jan/13 Updated: 11/Jun/20 Resolved: 11/Jun/20 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.1.0 |
| Fix Version/s: | None |
| Type: | Improvement | Priority: | Trivial |
| Reporter: | gmsw | Assignee: | WC Triage |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | server | ||
| Environment: |
Linux, 2 physical servers, |
||
| Issue Links: |
|
||||||||
| Rank (Obsolete): | 6074 | ||||||||
| Description |
|
In case of failure of the 1st server, the MDS1 and the OSS1 will fail in the same time. Then the failover on the 2nd node will wait 5 more minutes because the moved OST will wait the timeout for the failed client MDS1. This case of 2 servers with failover and 2 active OSS is important for little configurations. (BULL) |
| Comments |
| Comment by Andreas Dilger [ 11/Jan/13 ] |
|
This is somewhat expected behaviour if the MDT and OST are on the same node, which is why we do not recommend this configuration. However, the MDT connection to the OST1 (regardless of which node it is running on) should always use the same connection UUID even after a reboot (unlike a client node, which is always gets a new UUID for each mount), and so the failed-over MDT should be able to participate in recovery on node 2 as well. Do you mount the MDT on node 2 at the same time as OST1? Do you also have a lustre client mounted locally on the MDS/OSS server node? This would be more likely to be the cause of this recovery slowdown. Does this slow recovery also happen if you restart the MDT and OST on node 1 again, instead of failing over to node 2? Having your actual console logs from during the recovery would make this more clear. |
| Comment by Johann Lombardi (Inactive) [ 15/Jan/13 ] |
|
This ticket is actually related to a question i got during a training class (i.e. is it safe to run the MDT on the same node as OSTs?). |
| Comment by Andreas Dilger [ 11/Jun/20 ] |
|
This should be fixed with patch https://review.whamcloud.com/36025 " |