Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8310

Checksum/erasure code of the EAs for better recovery of Lustre

    XMLWordPrintable

Details

    • New Feature
    • Resolution: Duplicate
    • Minor
    • None
    • None
    • None
    • 9223372036854775807

    Description

      In order to save some private informations of the file/object belonging to a
      Lustre file, Lustre saves a series of extended attributes on inodes of lower
      file system. All of the EAs are important for correct behavior of Lustre
      functions and features. And some of the EAs are so critical that if these EAs
      are lost or corrupted, all the data/metadata of the Lustre file is no longer
      available. For example, if the "trusted.lov" EA has an incorrect value, the
      data of the Lustre file might point to a non-exist object or even worse, to
      another file's data.

      Unfortunately, this situation could happen if a server or storage crashes on
      Lustre. And what makes the situation worse is that it is sometimes hard to
      determine which component is the root cause of the inconsistency when
      recovering the system. For example, a "trusted.lov" EA pointing to non-exist
      object could means 1) the value of the EA is corrupted, or 2) the object on OST
      has been removed although it shouldn't have. And when this happens, the LFSCK
      mechanism which supporses to fix the inconsistency of Lustre file system online
      might need to fix the problem based on wrong values of EAs. This attempt
      obviously won't help.

      Because of these reasons, I am wondering whether a checksum/erasure code of the
      Lustre EAs could be introduced to improve the situation. Following is the idea:

      1) An checksum/erasure code of the Lustre EAs (e.g. trusted.lov + trusted.lma
      + ...) will be calculated and saved as a new EA (e.g. "trusted.mdt_checksum"
      and "trusted.ost_checksum") when the Lustre file is created. Since most of
      (or all)the Lustre EAs will not be updated by normal file system operations on
      the file, the EAs are almost immutable which means almost no performance
      regression will be introduced (except maybe file creation).

      2) When the OST/MDT objects of a Lustre file is accessed/repaired, the
      checksum/erasure code could be used to check (and fix if using erasure code)
      the EAs.

      3) When the Lustre EAs are updated, the checksum/erasure code will be updated.
      As said before, this won't happen frequently. And if some Lustre EAs change
      too frequently (e.g. trusted.hsm when HSM is under heavy use), we could
      exclude those EAs from the checksum. Thus, filter flags could be specified to
      include only part of the Lustre EAs.

      4) The checksum/erasure code of the MDT EA (i.e. "trusted.mdt_checksum") will
      also be saved on OST objects that belongs to the same Lustre file. In this way,
      LFSCK could use the checksum to check the consistency of the file between OSTs
      and MDT. If checksum/erasure code of the MDT EA is inconsistent between MDT and
      OSTs, the LFSCK needs to either smartly determine which one is broken or just
      leave it along to manual decision. And ideally, this file should becomes
      readonly to prevent any further corruption.

      5) A series of ultilities should be provided for better recovering of the
      Lustre files, including checksum/erasure code of EAs. Given the fact that
      Lustre is so complex, and is still evolving rapidly, it is ideal but not
      currently ture that LFSCK is able to fix all of the problems online without
      any manual intervention. It is not a rare condition that the Lustre file
      system needs to be recovered offline directly on lower file system (i.e.
      ldiskfs/zfs). And the checksum/erasure code of EAs would make it harder to fix
      a broken file offline since the changing values of the EAs needs to be
      consistent with the checksum/erasure code. A lot of tools and scripts should
      be provided for this purpose even if LFSCK is doing well, because, as have
      been proven, userspace tools are much more flexible than online mechanism when
      recovering data. Also, for online recover, LFSCK should provide interfaces
      to administrators to make decisions manually on the recovering of the file
      system.

      We could use similar mechanism from lower file system, for example, the
      metadata checksum of ext4. However, the Lustre level checksum of EAs still has
      some advantages. First of all, the selected Lustre EAs are almost constant,
      that means the performance regression is likely to be minimum. And also, this
      implementation won't depend on any internal feature of the lower file system,
      and thus it can be used on both ZFS and ldiskfs.

      Attachments

        Activity

          People

            pjones Peter Jones
            lixi Li Xi (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: