Loading...

XML

Word

Printable

Type: Epic
Resolution: Unresolved
Priority: Minor
Fix Version/s: None
Affects Version/s: None
Labels:
- FLR-EC
- FLR-EC-RO
- FLR2

Severity:
3
Rank (Obsolete):
9223372036854775807

Overview

Erasure coding provides a more space-efficient method for adding data redundancy than mirroring, at a somewhat higher computational cost. This would typically be used for adding redundancy for large and longer-lived files to minimize space overhead. For example, RAID 10+2 adds only 20% space overhead while allowing two OST failures, compared to mirroring which adds 100% overhead for single-failure redundancy or 200% overhead for double-failure redundancy. Erasure coding can add redundancy for an arbitrary number of drive failures (e.g. any 3 drives in a group of 16) with a fraction of the overhead.

The parity stripes are stored in a separate FLR mirror in the file, with a layout (LCME_FL_PARITY flag and LOV_PATTERN_PARITY pattern) that indicates the mirror contains parity data, the number of data and parity stripes, etc. The encoding is similar to RAID-4, with specific "data" stripes (the traditional Lustre RAID-0 layout) in the primary component, and one or more "parity" stripes stored in a separate parity mirror using LOV_PATTERN_RAID0 | LOV_PATTERN_PARITY, unlike RAID-5/6 that have the parity interleaved.

This choice of RAID4 is essential for a few reasons:

We need to add parity to existing files without having to rewrite all the data, which would be required with interleaved parity
We will use FLR state management to manage parity stale-ness. This works naturally with the current code if we place all parity in a separate mirror.

See https://en.wikipedia.org/wiki/Standard_RAID_levels#RAID_4

Each parity component stores the mirror ID of the data mirror it protects via lcme_data_id (stored in the upper 16 bits of the timestamp field). This makes it easy to respect the relationship between data and parity mirrors — for example, ensuring that OSTs used in the data mirror do not overlap with those in the parity mirror within the same RAID set (see OST Allocation below).

RAID Sets

Lustre erasure coding will divide each data component in to a set of RAID Sets, each of which is its own redundancy group. Consider, eg, an 80 stripe file with 1 MiB stripes. To configure 8+2 parity on this file, we will divide the file into 10 RAID sets of 8 OSTs, each corresponding to 2 stripes in the parity mirror.

Without dividing the file into RAID sets, we would end up with impractically large RAID groups, where parity generation requires reading very large amounts of data (80 MiB in our example).

This also allows us to handle when all OSTs in a file system are used in a file — if a file is striped to all the OSTs in the file system (so, an 80 OST file system in our example), it is not possible to select parity stripes to provide redundancy for all 80 stripes in one redundancy group, since all OSTs are in use. By dividing the file into 10 x 8+2 RAID sets, we can select 2 OSTs which are not used in that specific RAID set, and provide redundancy.

If the data stripe count does not divide evenly by k, the ec_split_stripes() algorithm produces two sizes of RAID sets: n0 sets of k0 stripes and n1 sets of k1 = k0 - 1 stripes, balancing the groups.

CLI Interface

EC layouts are created with lfs setstripe using the --ec option:

lfs setstripe -E -1 -c 8 --ec 4+2 /mnt/lustre/file

This creates a data mirror with 8 stripes and a parity mirror with the appropriate number of parity stripes for 4+2 RAID sets. The --ec-expert option allows exceeding standard limits (k > 32 or m > 4).

EC components appear in lfs getstripe output with lcme_flags: parity, lcme_data_id, lcme_dstripe_count, and lcme_cstripe_count fields.

Degraded reads

Reads from an erasure-coded file would normally use only the primary RAID-0 component (unless data verification on read was also desired), as with non-redundant files. If a stripe in the primary component for the file fails, the client would read the data stripes and one or more parity stripes from the parity mirror and reconstruct the data from parity on the fly. A stale parity mirror cannot be used for recovery — the parity must be resynced first.

Writes

Writes to an erasure-coded file would mark the parity mirror stale matching the extent of the data component that was modified, as with a regular mirrored file, and writes would continue on the primary RAID-0 striped file. The main difference from an FLR data mirrored file is that the writes would always need to go to the primary data component, and the parity mirror would always be marked stale. Marking parity stale is recorded in the Lustre ChangeLog so that a resync tool or policy engine can detect and resync files with stale parity. It would not be possible to write to an erasure-coded file that has a failure in a primary stripe without first reconstructing it from parity. A parity stripe failure would not prevent reads.

Older clients that do not understand EC can safely read and write the data mirror; they will skip the parity mirror because they see an unknown flag. They simply cannot reconstruct data if an OST fails.

Parity Resync and Verify

The actual parity generation will be done with the lfs mirror resync tool in userspace. The Lustre client will do normal reads from the RAID-0 data component, unless there is an OST failure or other error reading from a data stripe. Data reconstruction from the data and parity components leverages existing functionality for reading mirrored files.

lfs mirror resync handles both regular mirror resync and EC parity resync in a single operation. EC resync is block-granular: it uses SEEK_DATA/SEEK_HOLE to build a coverage map and only recomputes parity for regions containing actual data. RAID sets fully covered by holes are skipped entirely. Parity is computed using Intel ISA-L.

lfs mirror verify validates parity correctness by recomputing parity from data and comparing against the stored parity, using the same block-granular coverage map.

OST Allocation

The raidset-aware stripe allocator ensures that within each RAID set, all data and parity OSTs are unique. For parity allocation:

If enough unique OSTs are available, each parity stripe gets a unique OST.
Otherwise, OSTs from the data component may be reused for parity, but only from a different RAID set, preserving single-failure isolation within each RAID set.
Overstriping (-C) allows OST reuse within the data component as long as the same OST is not reused within the same RAID set.

Repair

Separately, we will provide a tool which can be used to repair a file with a degraded stripe. In the most basic version, this can simply be lfs migrate — creating a new layout from scratch, copying the data and the parity, relying on the degraded read functionality in the client to regenerate the data. In the future, this could use single stripe replacement (once that feature exists) and/or depend on the resync tool to reconstruct the failed stripe from parity.

Future Development

Immediate EC writes: For new files written linearly by a single client, it may be possible to compute and write parity inline, marking EC components uptodate at close if all writes succeeded. This combines the delayed write and resync steps.
Single stripe replacement: Reconstruct a failed stripe from parity directly, without full file migration.
Immediate write mirroring + EC: Use the IMW infrastructure (Active Writer locks, stale/inconsistent state management) to support concurrent EC writes from multiple clients.
Write redundancy via mirroring: As an alternative to immediate EC writes, the Mirrored File Writes functionality could be used during writes to erasure-coded files, so the data mirroring provides redundancy. Changes would then be merged into the erasure coded component after the file is closed, using the ChangeLog consumer, and the mirror component can be dropped.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

Erasure Coding HDL.docx
57 kB
16/Apr/19 8:49 AM

is related to

LU-19988 FLR-EC-WR: Immediate write erasure coding

Open

LU-20435 OSD-ZFS failing to skip hole during lfs mirror resync

Open

LU-12649 Tracker for ongoing FLR improvements

Open

LU-13643 FLR3-IM: Immediate write mirroring

Open

LU-19199 Ensure all fast recovery features are enabled and working

Open

LU-19562 FLR-EC: Add connect flag support and enable/disable

In Progress

LUDOC-463 Add feature documentation for Erasure Coding

Open

LU-19298 Improve PFL handling in LOD / LOV layer

Open

LU-20212 FLR-ECRO: verify non-write RPCs do not block client

Open

LU-19066 FLR2: identify rack/PSU failure domains for servers

Reopened

LU-16837 interop: client skip unknown component mirror

Resolved

LU-19826 FLR-EC: increase maximum number of mirrors allowed

Resolved

is related to

LU-9961 FLR-EC: Relocating individual objects to a new OST

Open

LU-19100 FLR-EC: restripe PFL components when adding EC

Open

mentioned in: Page Loading...

(7 is related to, 2 is related to , 1 mentioned in)

Progress

1.	FLR-EC: erasure coding layout handling	In Progress	Patrick Farrell
2.	FLR-EC: resync parity components	Reopened	Ronnie Sahlberg
3.	FLR-EC: recover data from parity code	In Progress	Zhenyu Xu
4.	FLR-EC: Add/modify conf-sanity test_32 for erasure coding	Open	WC Triage
5.	FLR-EC: Don't read parity components on old clients	In Progress	Zhenyu Xu
6.	FLR-EC: Prevent stranding of parity mirror	Open	Marc Vef
7.	FLR-EC: support for other lfs mirror commands	In Progress	Marc Vef
8.	FLR-EC: Add connect flag support and enable/disable	In Progress	Patrick Farrell
9.	FLR-EC: mark EC OST objects for LFSCK, rebuild EC components	In Progress	Patrick Farrell
10.	FLR-EC: add 'lfs find' support for EC files	Open	Marc Vef
11.	FLR-EC: Add documentation for the ec feature	Reopened	Max Dilger
12.	FLR-EC: Do no allow writes to EC files if any OSTs are offline	Open	WC Triage
13.	FLR-EC: DOM & erasure coding support	Open	WC Triage
14.	FLR-EC: Integrate failure domains with stripe allocator for EC	Reopened	Ronnie Sahlberg
15.	FLR-EC: RAID Set Aware Stripe allocation	Reopened	Ronnie Sahlberg
16.	FLR-EC-RO: block LCME_FL_PARITY in --comp-set	Open	WC Triage
17.	FLR-EC: lfs getstripe should show raidset for each OST	Open	WC Triage
18.	FLR-EC: Tool to benchmark EC encode/decode speed	Reopened	Ronnie Sahlberg
19.	FLR-EC: Make lfs mirror resync/verify multithreaded	Open	WC Triage
20.	FLR-EC: Improve lfs getstripe presentation for EC files	Open	Keguang Xu
21.	FLR-EC: Update lfsck for RO-EC	Open	WC Triage
22.	FLR-EC: cache matrices used for calculations in client	Open	WC Triage
23.	FLR-EC: Reword "stripe set" to 'data stripe unit'	Open	Ronnie Sahlberg
24.	FLR-EC: failure domain support for FLR mirroring	In Progress	Ronnie Sahlberg
25.	FLR-EC: Convert ec.txt to markup	Open	Ronnie Sahlberg
26.	FLR-EC: optimize resync of mirrors and ec components	In Progress	Ronnie Sahlberg
27.	Tests for INSERT/COLLAPSE/ZERO_RANGE	Open	Ronnie Sahlberg

Assignee:: Patrick Farrell

Reporter:: Andreas Dilger

Votes:: 0 Vote for this issue

Watchers:: 29 Start watching this issue

Created:: 13/Apr/18 8:19 AM

Updated:: 3 days ago 10:42 AM

Details

Description

Overview

RAID Sets

CLI Interface

Degraded reads

Writes

Parity Resync and Verify

OST Allocation

Repair

Future Development

Attachments

Attachments

Issue Links

Sub-Tasks

Activity

People

Dates