[LU-12310] MDT Device-level Replication/Mirroring Created: 16/May/19  Updated: 15/Apr/23

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: New Feature Priority: Minor
Reporter: Joe Mervini Assignee: Peter Jones
Resolution: Unresolved Votes: 0
Labels: LMR

Attachments: Microsoft PowerPoint Lustre_Metadata_Redundancy-202112.pptx    
Issue Links:
Related
is related to LU-10329 DNE3: REMOTE_PARENT_DIR scalability Open
is related to LU-7426 DNE3: Current llog format for remote ... Open
is related to LU-4215 Some expected improvements for OUT Open
is related to LU-7319 OUT: continue updates processing upon... Open
is related to LU-7427 DNE3: multiple entries for BATCHID Open
is related to LU-9206 DNE - allow partial access to striped... Resolved
is related to LU-16742 e2fsck/ldiskfs/ZFS to allow multiple ... Open
Rank (Obsolete): 9223372036854775807

 Description   

During a discussion a lunch today at LUG we were talking about the work being done on DOM/DNE/PFL/FLR. We were also talking about Lustre becoming more than just a scratch file system and it occurred to me that one thing that really hampers that concept is the vulnerability of metadata in its present state. 

I don't recall metadata replication ever being mentioned but I think that I would be a valuable feature to be explored. 



 Comments   
Comment by Andreas Dilger [ 16/May/19 ]

We've discussed this internally and worked up an initial design, but ended up deciding against that implementation. We have not had resources to work up a new design and work on the implementation.

In the meantime, my recommendation would be to use MD-RAID/dm-mirror (or hardware-based mirror) for ldiskfs, or ZFS VDEV mirror to replicate the MDT storage across nodes (assuming non-shared storage is the goal), and continue to use failover to manage which MDS is exporting the MDT. The MD-RAID or ZFS VDEV would use 2x or 3x devices per mirror (each one in a separate node), and a network storage transport like NVMe-over-fabrics, SRP, or iSCSI. With modern storage transports, local storage and remote storage is equally fast, and the upper-layer RAID handles the replication and recovery of the storage devices in the same manner as local disks.

This approach is described for ldiskfs at High Availability Lustre Using SRP-mirrored LUNs, and similar approaches have been discussed for ZFS, but I don't have a link handy.

Comment by Andreas Dilger [ 15/Dec/21 ]

I think that parts of this work could be split into some smaller features/implementation tasks in order to reduce the amount of effort needed to get something usable out of the development:

  • improvements to performance of distributed transactions (LU-7426) so that synchronous/ordered disk transactions are not needed. This would be very useful independent of MDT mirroring to improve creation of remote and striped directories, cross-MDT rename/link, etc.
  • improve handling of distributed recovery when an MDT is offline (e.g. save transaction logs, don't block filesystem access for unrelated MDTs (LU-9206 ++)
  • fault-tolerance for services that run on MDT0000, such as the quota master, FLDB, MGT, etc.
  • scalability of REMOTE_PARENT_DIR to allow handling more disconnected filesystem objects (LU-10329)
  • mirroring of top-level directories in the filesystem (initiallly ROOT/, and then first level of subdirectories below it, etc.) so that the filesystem is "more" available if MDT0000 or other MDTs in a top-level striped directory are unavailable. This would not include mirroring of the regular inodes for files, only the directories themselves. Since the top-level directories are changed relatively less often than lower-level subdirectories, some extra overhead creating directories at this level is worthwhile for higher availability.
    • mirrored directories would be similar to striped directories, but each directory entry name could be looked up in at least two different directory shards (e.g. lmv_locate_tgt_by_name(), ...+1, ...+2), depending on replication level, allowing the target to be found even if one MDT is offline (LU-9206)
    • each mirrored directory entry would contain two or more different FIDs referencing inodes on separate MDTs (for subdirectories), or the same FID (for regular files), similar to how ZFS Block Pointers can be referenced by and directly reference up to 3x different DVAs (block numbers) that have copies of the same data
    • each mirrored directory inode would have the full layout of all shards in the directory, and client can determine which shard to use for lookup
    • updates to the mirrored directory would always need distributed transactions that inserted or removed the redundant dirents together
    • normal DNE distributed transaction recovery would apply to recover incomplete transactions if an MDT is offline during an update
Generated at Sat Feb 10 02:51:27 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.