Details
-
Epic
-
Resolution: Unresolved
-
Minor
-
None
-
None
-
9223372036854775807
Description
FLR currently implements delayed file write mirroring, where the initial write is done to a single mirror, and an external tool eventually synchronizes the data to the other mirror(s) of the file. If the file is ever modified, then the mirror component(s) other than the one modified is marked stale, and needs to be synchronized by the external tool. This mechanism saves bandwidth from the clients, and still provides data availability unless the file is lost immediately after it is written.
However, if the file is modified afterward, resyncing the mirrors of the modified component may cause a large amount of write amplification, or potentially prevent the stale mirrors from being resync'd if it is continuously being modified. It would be preferable to implement immediate file write mirroring, so that the client can submit the same page to multiple RPCs to different OST objects and keep them both updated concurrently.
Design Summary
Immediate write mirroring (IWM) adds a per-component LCME_FL_IMMEDIATE flag. When set, clients duplicate all writes to all mirrors in real time, providing immediate data redundancy without requiring external resync.
The core mechanism is the Active Writer (AW) lock — a CW lock on a new ACTIVE_WRITERS IBITs bit on the MDS, separate from the layout lock. Multiple clients hold it concurrently while writing. The AW lock defines a write epoch: the period during which mirrors are being actively written. Key properties:
- INFLIGHT flag: Secondary mirrors are marked LCME_FL_INFLIGHT (not STALE) during writes, making them unreadable but distinguishing active writes from errors. A single primary mirror remains readable during the epoch.
- IO duplication: Writes are duplicated at the BRW/RPC layer via a fan-out from the primary mirror's assembled page array. This supports heterogeneous mirror layouts (different stripe count/size across mirrors).
- Write ordering: Enforced by requiring client-side LDLM extent locks on the primary mirror for all IO, including DIO. This ensures consistent write order across mirrors.
- Error reporting: Per-mirror errors are reported to the MDS via a lock value block (LVB) on AW lock cancellation. The MDS clears INFLIGHT from clean mirrors and sets STALE on failed ones at epoch close.
- Epoch close: When all clients release their AW locks (or the MDS forces release via EX mode), the MDS transitions the layout back to RDONLY. Forced close is used on error or for administrative operations (resync, mirror replacement).
- MDS recovery: A durable epoch FID set tracks files with active write epochs, enabling safe epoch close after MDS failover even when clients are evicted.
- Fast-fail on secondaries: Secondary mirror write failures do not block the application — the primary succeeds and the failed mirror degrades to delayed replication (marked STALE at epoch close).
The AW lock is deliberately separate from the layout lock: layout changes (component instantiation, SEL extension) do not close the write epoch, avoiding expensive flushes at component boundaries. The epoch spans layout changes naturally.
IWM is the foundation for immediate erasure coding — the write duplication infrastructure, AW lock protocol, epoch management, page consistency mechanisms, and error reporting all carry over. Immediate EC will be covered in a separate design document.
References
- Design document: Immediate Mirror Design
- Development epic: EX-12429
- Requirement: EXR-687
- Fault domain mapping: LU-19066