Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12310

MDT Device-level Replication/Mirroring

Details

    • New Feature
    • Resolution: Unresolved
    • Minor
    • None
    • None
    • 9223372036854775807

    Description

      During a discussion a lunch today at LUG we were talking about the work being done on DOM/DNE/PFL/FLR. We were also talking about Lustre becoming more than just a scratch file system and it occurred to me that one thing that really hampers that concept is the vulnerability of metadata in its present state. 

      I don't recall metadata replication ever being mentioned but I think that I would be a valuable feature to be explored. 

      Attachments

        Issue Links

          Activity

            [LU-12310] MDT Device-level Replication/Mirroring
            adilger Andreas Dilger added a comment - - edited

            I think that parts of this work could be split into some smaller features/implementation tasks in order to reduce the amount of effort needed to get something usable out of the development:

            • improvements to performance of distributed transactions (LU-7426) so that synchronous/ordered disk transactions are not needed. This would be very useful independent of MDT mirroring to improve creation of remote and striped directories, cross-MDT rename/link, etc.
            • improve handling of distributed recovery when an MDT is offline (e.g. save transaction logs, don't block filesystem access for unrelated MDTs (LU-9206 ++)
            • fault-tolerance for services that run on MDT0000, such as the quota master, FLDB, MGT, etc.
            • scalability of REMOTE_PARENT_DIR to allow handling more disconnected filesystem objects (LU-10329)
            • mirroring of top-level directories in the filesystem (initiallly ROOT/, and then first level of subdirectories below it, etc.) so that the filesystem is "more" available if MDT0000 or other MDTs in a top-level striped directory are unavailable. This would not include mirroring of the regular inodes for files, only the directories themselves. Since the top-level directories are changed relatively less often than lower-level subdirectories, some extra overhead creating directories at this level is worthwhile for higher availability.
              • mirrored directories would be similar to striped directories, but each directory entry name could be looked up in at least two different directory shards (e.g. lmv_locate_tgt_by_name(), ...+1, ...+2), depending on replication level, allowing the target to be found even if one MDT is offline (LU-9206)
              • each mirrored directory entry would contain two or more different FIDs referencing inodes on separate MDTs (for subdirectories), or the same FID (for regular files), similar to how ZFS Block Pointers can be referenced by and directly reference up to 3x different DVAs (block numbers) that have copies of the same data
              • each mirrored directory inode would have the full layout of all shards in the directory, and client can determine which shard to use for lookup
              • updates to the mirrored directory would always need distributed transactions that inserted or removed the redundant dirents together
              • normal DNE distributed transaction recovery would apply to recover incomplete transactions if an MDT is offline during an update
            adilger Andreas Dilger added a comment - - edited I think that parts of this work could be split into some smaller features/implementation tasks in order to reduce the amount of effort needed to get something usable out of the development: improvements to performance of distributed transactions ( LU-7426 ) so that synchronous/ordered disk transactions are not needed. This would be very useful independent of MDT mirroring to improve creation of remote and striped directories, cross-MDT rename/link, etc. improve handling of distributed recovery when an MDT is offline (e.g. save transaction logs, don't block filesystem access for unrelated MDTs ( LU-9206 ++) fault-tolerance for services that run on MDT0000, such as the quota master, FLDB, MGT, etc. scalability of REMOTE_PARENT_DIR to allow handling more disconnected filesystem objects ( LU-10329 ) mirroring of top-level directories in the filesystem (initiallly ROOT/ , and then first level of subdirectories below it, etc.) so that the filesystem is "more" available if MDT0000 or other MDTs in a top-level striped directory are unavailable. This would not include mirroring of the regular inodes for files, only the directories themselves. Since the top-level directories are changed relatively less often than lower-level subdirectories, some extra overhead creating directories at this level is worthwhile for higher availability. mirrored directories would be similar to striped directories, but each directory entry name could be looked up in at least two different directory shards (e.g. lmv_locate_tgt_by_name() , ...+1, ...+2), depending on replication level, allowing the target to be found even if one MDT is offline ( LU-9206 ) each mirrored directory entry would contain two or more different FIDs referencing inodes on separate MDTs (for subdirectories), or the same FID (for regular files), similar to how ZFS Block Pointers can be referenced by and directly reference up to 3x different DVAs (block numbers) that have copies of the same data each mirrored directory inode would have the full layout of all shards in the directory, and client can determine which shard to use for lookup updates to the mirrored directory would always need distributed transactions that inserted or removed the redundant dirents together normal DNE distributed transaction recovery would apply to recover incomplete transactions if an MDT is offline during an update
            adilger Andreas Dilger added a comment - - edited

            We've discussed this internally and worked up an initial design, but ended up deciding against that implementation. We have not had resources to work up a new design and work on the implementation.

            In the meantime, my recommendation would be to use MD-RAID/dm-mirror (or hardware-based mirror) for ldiskfs, or ZFS VDEV mirror to replicate the MDT storage across nodes (assuming non-shared storage is the goal), and continue to use failover to manage which MDS is exporting the MDT. The MD-RAID or ZFS VDEV would use 2x or 3x devices per mirror (each one in a separate node), and a network storage transport like NVMe-over-fabrics, SRP, or iSCSI. With modern storage transports, local storage and remote storage is equally fast, and the upper-layer RAID handles the replication and recovery of the storage devices in the same manner as local disks.

            This approach is described for ldiskfs at High Availability Lustre Using SRP-mirrored LUNs, and similar approaches have been discussed for ZFS, but I don't have a link handy.

            adilger Andreas Dilger added a comment - - edited We've discussed this internally and worked up an initial design, but ended up deciding against that implementation. We have not had resources to work up a new design and work on the implementation. In the meantime, my recommendation would be to use MD-RAID/dm-mirror (or hardware-based mirror) for ldiskfs, or ZFS VDEV mirror to replicate the MDT storage across nodes (assuming non-shared storage is the goal), and continue to use failover to manage which MDS is exporting the MDT. The MD-RAID or ZFS VDEV would use 2x or 3x devices per mirror (each one in a separate node), and a network storage transport like NVMe-over-fabrics, SRP, or iSCSI. With modern storage transports, local storage and remote storage is equally fast, and the upper-layer RAID handles the replication and recovery of the storage devices in the same manner as local disks. This approach is described for ldiskfs at High Availability Lustre Using SRP-mirrored LUNs , and similar approaches have been discussed for ZFS, but I don't have a link handy.

            People

              pjones Peter Jones
              jamervi Joe Mervini
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated: