Loading...

Details

Type: Improvement
Resolution: Unresolved
Priority: Minor
Fix Version/s: None
Affects Version/s: Lustre 2.16.0
Labels:
None

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

To improve OST object allocation and minimize data and mirrors and EC parity sharing the same fault domain (as much as possible) it would be desirable to identify independent fault domains for Lustre OSTs that the MDS can use during layout creation or extension.

The MDS will always avoid allocating file stripes on the same OST, unless layout overstriping is used. The MDS can already identify whether OSTs are on the same OSS by comparing the (current) peer NID(s), and should avoid allocating mirror and EC stripes on OSTs sharing the same OSS nodes as the data stripes. The MDS can additionally identify whether OSTs are in the same HA failover domain (controller pair) by comparing the OST's failover NIDs, and should prefer not to allocate mirror and EC stripes in the same HA group, when possible.

To maximize file-level availability (e.g. rack-level fault tolerance) and data redundancy it is necessary to provide additional fault domain information to the MDS that Lustre cannot determine for itself. One option would be to add a "domain" tunable parameter (possibly a rack unit with the same PDU/switch, or something else) that can be set on an OST (and passed to the MDS/client with the connection data) that indicates which fault domain an OST is in (and presumably cannot escape from due to hardware-level constraints).

Then, when the MDS is allocating OST objects for an FLR-EC file, it should preferentially select OSTs with different domain values for each data and EC/mirror stripe (within the other constraints of pool, stripe_count, space, etc). Similarly, when extending a RAID-0 file with an EC/mirror component, the MDS should prefer to add stripes on OSTs that do not overlap with the same domain as the data stripes as much as possible.

In some cases, pre-existing RAID-0 files may not have stripes on different fault domains, nor even on different OSS nodes (which may have been selected depending on availability and capacity at the time of allocation), so with smaller OST counts it may not be possible to fully isolate the data and parity stripes into different top-level fault domains, but the MDS object allocator should strive to minimize the overlap of the domains, HA failover groups, and OSS nodes, within the constraints of OST pools (e.g. NVMe vs. HDD).

There may also be some benefit even from EC parity that overlaps the same OSTs as the data stripes, if no other options exist. It would still be possible to achieve resilience against any single-OST failure with 2-parity EC. For example d=[0,1,2,3,4,5,6,7] and p=[0,4] would allow any single OST to be unavailable, or any two OSTs other than [0,4], while still maintaining data accessibility and reconstruction. Using 4-parity EC in this example (e.g. "p=[0,2,4,6]") would allow any 2 OSTs to be unavailable, or any one of p=[0,2,4,6] plus any three of d=[1,3,5,7] at 50% data overhead vs. 100% overhead from mirroring, though this is still only half as efficient than having the data and parity stripes on disjoint OSTs/domains.

For FLR-EC with widely-striped files, the EC parity stripes only need to be on unique OSTs within a single RAID-Set (e.g. 8+2) and not necessarily across all OSTs in the filesystem. This would allow a 40-stripe file on a 40-OST filesystem (e.g. 5 controllers x (4 OSS x 2 OSTs/OSS)/controller) to still add robust EC by ensuring that the 2 EC parity stripes do not overlap with the 8 data stripes in that RAID-Set, even though the EC OSTs may overlap with other data ECs within that file.

Attachments

Issue Links

is duplicated by

LU-19495 suboptimal object placement in lfs mirror extend

Resolved

is related to

LU-12649 Tracker for ongoing FLR improvements

Open

LU-15834 FLR: "lfs mirror extend" should take current OSTs into account

Open

LU-10911 FLR2: Read only erasure coding

In Progress

LU-19545 FLR-EC: Tight binding between erasure code and parity mirrors

Open

LU-12187 FLR-EC: erasure coding layout handling

Open

LU-12669 FLR-EC: recover data from parity code

Open

LU-9 Optimize weighted QOS Round-Robin allocator

Open

LU-10158 FLR2: improve mirror selection policy functions

Open

(4 is related to )

FLR2: identify rack/PSU failure domains for servers

Details

Description

Attachments

Issue Links

Activity

People

Dates