Details
-
Task
-
Resolution: Unresolved
-
Medium
-
None
-
None
-
3
-
9223372036854775807
Description
We should verify that all known fast recovery type features are enabled and working correctly. There are a number of existing Lustre features that could be used (and further improved) to reduce recovery time.
Depending on the IO model of the applications running on the cluster (e.g. shared file writers in a monolithic MPI application vs. independent "ensemble" processes working on their own files and directories) it should be possible to "tune" recovery to be more responsive, and potentially avoid waiting for unresponsive clients if they are not using directories or files of interest to the recovered clients. The complex part is to automatically determine if clients have overlapping domains of interest or not.
Attachments
Issue Links
- is related to
-
LU-20135 Client reconnection extending recovery causes unnecessary delays and can lead to extra evictions
-
- Open
-
-
LU-18681 Histogram of client reconnection times during recovery
-
- Open
-
- is related to
-
LU-13643
FLR3-IM: Immediate write mirroring
-
- Open
-
-
LU-10911
FLR2: Read only erasure coding
-
- In Progress
-
-
LU-17819
LMR1a: Replicate MGS Services to Multiple MDS
-
- Open
-
-
LU-13932 reduce maximum wait_time for MMP recovery
-
- Resolved
-