Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-10911

FLR2: Read only erasure coding

    XMLWordPrintable

Details

    • Epic
    • Resolution: Unresolved
    • Minor
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      Overview

      Erasure coding provides a more space-efficient method for adding data redundancy than mirroring, at a somewhat higher computational cost. This would typically be used for adding redundancy for large and longer-lived files to minimize space overhead. For example, RAID 10+2 adds only 20% space overhead while allowing two OST failures, compared to mirroring which adds 100% overhead for single-failure redundancy or 200% overhead for double-failure redundancy. Erasure coding can add redundancy for an arbitrary number of drive failures (e.g. any 3 drives in a group of 16) with a fraction of the overhead.

      It is possible to implement delayed erasure coding on striped files in a similar manner to Phase 1 mirrored files, by storing the parity stripes in a separate mirror in the file, having a layout that indicates the mirror contains parity data, the number of data and parity stripes, etc. The encoding would be similar to RAID-4, with specific "data" stripes (the traditional Lustre RAID-0 file layout) in the primary component, and one or more "parity" stripes stored in a separate parity mirror, unlike RAID-5/6 that have the parity interleaved.

      This choice of RAID4 is essential for a few reasons:
      1. We need to add parity to existing files without having to rewrite all the data, which would be required with interleaved parity
      2. We will use FLR state management to manage parity stale-ness. This works naturally with the current code if we place all parity in a separate mirror.

      See https://en.wikipedia.org/wiki/Standard_RAID_levels#RAID_4

      RAID Sets

      Lustre erasure coding will divide each data component in to a set of RAID Sets, each of which is its own redundancy group.
      Consider, eg, an 80 stripe file with 1 MiB stripes. To configure 8+2 parity on this file, we will divide the file into 10 RAID sets of 8 OSTs, each corresponding to 2 stripes in the parity mirror.
      Without dividing the file into RAID sets, we would end up with impractically large RAID groups, where parity generation requires reading very large amounts of data (80 MiB in our example).

      This also allows us to handle when all OSTs in a file system are used in a file – if a file is is striped to all the OSTs in the file system (so, an 80 OST file system in our example), it is not possible to select parity stripes to provide redundancy for all 80 stripes in one redundancy group, since all OSTs are in use. By dividing the file into 10 x 8+2 RAID sets, we can select 2 OSTs which are not used in that specific RAID set, and provide redundancy.

      Degraded reads

      Reads from an erasure-coded file would normally use only the primary RAID-0 component (unless data verification on read was also desired), as with non-redundant files. If a stripe in the primary component for the file fails, the client would read the data stripes and one or more parity stripes from the parity mirror and reconstruct the data from parity on the fly.

      Writes

      Writes to an erasure-coded file would mark the parity mirror stale matching the extent of the data component that was modified, as with a regular mirrored file, and writes would continue on the primary RAID-0 striped file. The main difference from an FLR data mirrored file is that the writes would always need to go to the primary data component, and the parity mirror would always be marked stale. It would not be possible to write to an erasure-coded file that has a failure in a primary stripe without first reconstructing it from parity. A parity stripe failure would not prevent reads.

      Repair

      Separately, we will provide a tool which can be used to repair a file with a degraded stripe. In the most basic version, this can simply be lfs migrate – creating a new layout from scratch, copying the data and the parity, relying on the degraded read functionality in the kernel client to regenerate the data. In the future, this could use single stripe replacement (once that feature exists) and/or depend on the resync tool to reconstruct the failed stripe from parity.

      Space Efficient Data Redundancy

      Erasure coding will add the ability to add full redundancy of large files or whole filesystems, rather than using full mirroring. This will allow striped Lustre files to store redundancy in parity components that allow recovery from a specified number of OST failures (e.g. 3 OST failures per 12 stripes, or 4 OST failures per 24 stripes) in a manner similar to RAID-4 with fixed parity stripes.

      Required Lustre Functionality

      Erasure Coded File Read

      The actual parity generation will be done with the lfs mirror resync tool in userspace. The Lustre client will do normal reads from the RAID-0 data component, unless there is an OST failure or other error reading from a data stripe. Add support for data reconstruction from the data and parity components, leveraging existing functionality for reading mirrored files.

      Writes to Erasure Coded Files

      To avoid losing redundancy on erasure-coded files that are modified, the Mirrored File Writes functionality could be used during writes to such files. So the data mirroring provides redundancy, then changes would be merged into the erasure coded component after the file is closed, using the Phase 1 ChangeLog consumer, and then the mirror component can be dropped.
      Phase 3 FLR will consider immediate erasure coding on writes.

      External Components

      Erasure Coded Resync Tool

      The lfs mirror resync tool needs to be updated to generate the erasure code for the file striped file, storing the parity in a separate component from the main RAID-0 striped file. There are CPU-optimized implementations of the erasure coding algorithms available, so the majority of the work would be integrating these optimized routines into the Lustre kernel modules and userspace tools, rather than actually developing the encoding algorithms. We plan to use Intel’s ISA-L library for this, as it is widely used, performant, and has an appropriate license.

      Attachments

        Issue Links

          1.
          FLR-EC: add necessary structure to adopt erasure coding layout Technical task Resolved Zhenyu Xu
          2.
          FLR-EC: erasure coding layout handling Technical task Open Zhenyu Xu
          3.
          FLR-EC: Parity stripe count from data stripe count Technical task Open Ronnie Sahlberg
          4.
          FLR-EC: Basic do-no-harm lov IO support Technical task Open Patrick Farrell
          5.
          FLR-EC: Implement FLR state transition logic for EC files Technical task Resolved Patrick Farrell
          6.
          FLR-EC: Direct EC component read/write Technical task Resolved Patrick Farrell
          7.
          FLR-EC: Add lfs ec resync and lfs ec verify commands Technical task Resolved WC Triage
          8.
          FLR-EC: import ISA-L library in Lustre build Technical task Resolved James A Simmons
          9.
          FLR-EC: resync parity components Technical task In Progress Ronnie Sahlberg
          10.
          FLR-EC: recover data from parity code Technical task Open Zhenyu Xu
          11.
          FLR-EC: Add/modify conf-sanity test_32 for erasure coding Technical task Open WC Triage
          12.
          FLR-EC: Don't read parity components on old clients Technical task Open Zhenyu Xu
          13.
          FLR-EC: Prevent stranding of parity mirror Technical task Open Patrick Farrell
          14.
          FLR-EC: Never select EC mirror as primary for write or allow setting PREFER flags Technical task In Progress Patrick Farrell
          15.
          FLR-EC: Test FLR state transitions and mirror read/write for EC files Technical task In Progress Patrick Farrell
          16.
          FLR-EC: Tight binding between erasure code and parity mirrors Technical task Open Zhenyu Xu
          17.
          FLR-EC: lfs setstripe support for erasure coding Technical task In Progress Patrick Farrell
          18.
          FLR-EC: support for other lfs mirror commands Technical task Open Marc Vef
          19.
          FLR-EC: Add connect flag support and enable/disable Technical task In Progress Patrick Farrell
          20.
          FLR-EC: mark EC OST objects for LFSCK, rebuild EC components Technical task In Progress Patrick Farrell
          21.
          FLR-EC: add 'lfs find' support for EC files Technical task Open Marc Vef
          22.
          Fix EOF handling for parity mirrors Technical task Open Patrick Farrell
          23.
          FLR-EC: lfs setstripe --ec creates ec mirror with single stripe Technical task Open Ronnie Sahlberg
          24.
          Add documentation for the ec feature Technical task In Progress Ronnie Sahlberg

          Activity

            People

              paf0186 Patrick Farrell
              adilger Andreas Dilger
              Votes:
              0 Vote for this issue
              Watchers:
              23 Start watching this issue

              Dates

                Created:
                Updated: