[LU-13643] FLR3: Immediate file write mirroring Created: 06/Jun/20  Updated: 20/Jul/20

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: New Feature Priority: Minor
Reporter: Andreas Dilger Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Issue Links:
Related
is related to LU-12649 Tracker for ongoing FLR improvements Open
Rank (Obsolete): 9223372036854775807

 Description   

FLR currently implements delayed file write mirroring, where the initial write is done to a single mirror, and an external tool eventually synchronizes the data to the other mirror(s) of the file. If the file is ever modified, then the mirror component(s) other than the one modified is marked stale, and needs to be synchronized by the external tool. This mechanism saves bandwidth from the clients, and still provides data availability unless the file is lost immediately after it is written.

However, if the file is modified afterward, resyncing the mirrors of the modified component may be cause a large amount of write amplification, or potentially prevent the stale mirrors from being resync'd if it is continuously being modified. It would be preferable to implement immediate file write mirroring, so that the client can submit the same page to multiple RPCs to different OST objects and keep them both updated concurrently.

Immediate file write (IFW) mirroring may not be desired for all applications, so it should have an LCME_FL_IMMEDIATE flag stored in the component(s) indicating the clients should keep both copies uptodate. Having per-component flags will allow configurations where e.g. two flash mirrors are immediately written, but a third/fourth disk mirror could use delayed resync for emergency recovery and/or cold storage (e.g. if the flash mirrors are only short term hot copies of the file).

The IFW client will need to notify the MDS whether it can keep the mirrors in sync, otherwise it needs to maintain the current behavior of marking all but one mirror LCME_FL_STALE when the client writes to the file. At the coarsest grain, this means the client needs an OBD_CONNECT2_IMMEDIATE connection-time flag, but it may be able to indicate its intent on a per-file level (e.g. with the initial write intent) so that it can do this on a case-by-case basis (e.g. remote clients should probably not do double writes).



 Comments   
Comment by Nathan Rutman [ 20/Jul/20 ]

There are a lot of complicated questions related to this feature.
1. Are locks held for each OST mirror? What if the extents don't match? Pages are assumed to be under a single lock at the moment; would now need to be checked against multiple locks. What if one lock is called back, but not the other? Can we use a "primary mirror" as a lock proxy for all other stripes instead? Does that mean layouts have to match?
2. What happens on write failure? Success if successfully written to a majority of mirrors, or EIO if any one of them fails? What happens if one mirror is unresponsive? Choose a new mirror layout? Revert to unmirrored? How long before we give up? How do we track which OST has the latest write? What if mirror2 success, mirror1 failure, and client dies before retry – how do we know which copy is correct?
3. How do we detect errors? Verify-on-read everything? What if two mirrors disagree? How do we determine which is correct? Who initiates/executes a reconstruction?

I think IFW is a very desirable feature to improve Lustre's data integrity and availability; I'm just concerned that there is a whole lot of detail that needs to be ironed out in an HLD before anything can be implemented.

Generated at Sat Feb 10 03:03:00 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.