Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-19066

FLR2: identify rack/PSU failure domains for servers

Details

    • Improvement
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.16.0
    • None
    • 3
    • 9223372036854775807

    Description

      To improve OST object allocation and minimize data and mirrors and EC parity sharing the same fault domain (as much as possible) it would be desirable to identify independent fault domains for Lustre OSTs that the MDS can use during layout creation or extension.

      The MDS will always avoid allocating file stripes on the same OST, unless layout overstriping is used. The MDS can already identify whether OSTs are on the same OSS by comparing the (current) peer NID(s), and should avoid allocating mirror and EC stripes on OSTs sharing the same OSS nodes as the data stripes. The MDS can additionally identify whether OSTs are in the same HA failover domain (controller pair) by comparing the OST's failover NIDs, and should prefer not to allocate mirror and EC stripes in the same HA group, when possible.

      To maximize file-level availability (e.g. rack-level fault tolerance) and data redundancy it is necessary to provide additional fault domain information to the MDS that Lustre cannot determine for itself. One option would be to add a "domain" tunable parameter (possibly a rack unit with the same PDU/switch, or something else) that can be set on an OST (and passed to the MDS/client with the connection data) that indicates which fault domain an OST is in (and presumably cannot escape from due to hardware-level constraints).

      Then, when the MDS is allocating OST objects for an FLR-EC file, it should preferentially select OSTs with different domain values for each data and EC/mirror stripe (within the other constraints of pool, stripe_count, space, etc). Similarly, when extending a RAID-0 file with an EC/mirror component, the MDS should prefer to add stripes on OSTs that do not overlap with the same domain as the data stripes as much as possible.

      In some cases, pre-existing RAID-0 files may not have stripes on different fault domains, nor even on different OSS nodes (which may have been selected depending on availability and capacity at the time of allocation), so with smaller OST counts it may not be possible to fully isolate the data and parity stripes into different top-level fault domains, but the MDS object allocator should strive to minimize the overlap of the domains, HA failover groups, and OSS nodes, within the constraints of OST pools (e.g. NVMe vs. HDD).

      There may also be some benefit even from EC parity that overlaps the same OSTs as the data stripes, if no other options exist. It would still be possible to achieve resilience against any single-OST failure with 2-parity EC. For example d=[0,1,2,3,4,5,6,7] and p=[0,4] would allow any single OST to be unavailable, or any two OSTs other than [0,4], while still maintaining data accessibility and reconstruction. Using 4-parity EC in this example (e.g. "p=[0,2,4,6]") would allow any 2 OSTs to be unavailable, or any one of p=[0,2,4,6] plus any three of d=[1,3,5,7] at 50% data overhead vs. 100% overhead from mirroring, though this is still only half as efficient than having the data and parity stripes on disjoint OSTs/domains.

      For FLR-EC with widely-striped files, the EC parity stripes only need to be on unique OSTs within a single RAID-Set (e.g. 8+2) and not necessarily across all OSTs in the filesystem. This would allow a 40-stripe file on a 40-OST filesystem (e.g. 5 controllers x (4 OSS x 2 OSTs/OSS)/controller) to still add robust EC by ensuring that the 2 EC parity stripes do not overlap with the 8 data stripes in that RAID-Set, even though the EC OSTs may overlap with other data ECs within that file.

      Attachments

        Issue Links

          Activity

            [LU-19066] FLR2: identify rack/PSU failure domains for servers

            Nodemaps are for managing client ID maps and authentication/permission controls. I don't think they have much relation to target fault domains.

            I would think that the targets would each have {obdfilter,mdt}.*.fault_domain tunable parameter, and this would either be set via explicit "lctl set_param -P" or as a formatting option stored in the local target config mount data would be reasonable.

            It could be a simple integer, or we could do something more complex like "rack, row, dc" but I don't think we need that yet.

            adilger Andreas Dilger added a comment - Nodemaps are for managing client ID maps and authentication/permission controls. I don't think they have much relation to target fault domains. I would think that the targets would each have { obdfilter,mdt}.*.fault_domain tunable parameter, and this would either be set via explicit " lctl set_param -P " or as a formatting option stored in the local target config mount data would be reasonable. It could be a simple integer, or we could do something more complex like "rack, row, dc" but I don't think we need that yet.

            Are we thinking the failure domain would just be a parameter set on the MDS?  Or should it be stored somewhere else, eg nodemap?  (nodemap seems like it might be a good candidate?)

            I'm thinking also maybe we don't try to autodetect anything related to failover pairs, since that adds to the complexity, but...

            paf0186 Patrick Farrell added a comment - Are we thinking the failure domain would just be a parameter set on the MDS?  Or should it be stored somewhere else, eg nodemap?  (nodemap seems like it might be a good candidate?) I'm thinking also maybe we don't try to autodetect anything related to failover pairs, since that adds to the complexity, but...

            People

              wc-triage WC Triage
              adilger Andreas Dilger
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated: