[LU-8315] Checksum/erasure code of EAs for better recovery of Lustre Created: 21/Jun/16  Updated: 23/Jun/16  Resolved: 23/Jun/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: New Feature Priority: Minor
Reporter: Li Xi (Inactive) Assignee: WC Triage
Resolution: Incomplete Votes: 0
Labels: None

Rank (Obsolete): 9223372036854775807

 Description   

In order to save some private informations of the file/object belonging to a
Lustre file, both OST and MDT save a series of extended attributes on inodes
of lower file system. All of the EAs are important for correct behavior of
Lustre functions and features. And some of the EAs are so critical that if
these EAs are lost or corrupted, all the data/metadata of the Lustre file is no
longer available. For example, if the "trusted.lov" EA has an incorrect value,
the data of the Lustre file might point to a non-exist object or even worse, to
another file's data.

Unfortunately, this situation could happen if a server or storage crashes on
Lustre. And what makes the situation worse is that it is sometimes hard to
determine which component is the root cause of the inconsistency when
recovering the system. For example, a "trusted.lov" EA pointing to non-exist
object could means 1) the value of the EA is corrupted, or 2) the object on OST
has been removed although it shouldn't have. And when this happens, the LFSCK
mechanism which supporses to fix the inconsistency of Lustre file system online
might need to fix the problem based on wrong values of EAs. This attempt
obviously won't help.

Because of these reasons, I am wondering whether a checksum/erasure code of the
Lustre EAs could be introduced to improve the situation. Following is the idea:

1) An checksum/erasure code of the Lustre EAs (e.g. trusted.lov + trusted.lma
+ ...) will be calculated and saved as a new EA (e.g. "trusted.mdt_checksum"
and "trusted.ost_checksum") when the Lustre file is created. Since most of
(or all)the Lustre EAs will not be updated by normal file system operations on
the file, the EAs are almost immutable which means almost no performance
regression will be introduced (except maybe file creation).

2) When the OST/MDT objects of a Lustre file is accessed/repaired, the
checksum/erasure code could be used to check (and fix if using erasure code)
the EAs.

3) When the Lustre EAs are updated, the checksum/erasure code will be updated.
As said before, this won't happen frequently. And if some Lustre EAs change
too frequently (e.g. trusted.hsm when HSM is under heavy use), we could
exclude those EAs from the checksum. Thus, filter flags could be specified to
include only part of the Lustre EAs.

4) The checksum/erasure code of the MDT EA (i.e. "trusted.mdt_checksum") will
also be saved on OST objects that belongs to the same Lustre file. In this way,
LFSCK could use the checksum to check the consistency of the file between OSTs
and MDT. If checksum/erasure code of the MDT EA is inconsistent between MDT and
OSTs, the LFSCK needs to either smartly determine which one is broken or just
leave it along to manual decision. And ideally, this file should becomes
readonly to prevent any further corruption.

5) A series of ultilities should be provided for better recovering of the
Lustre files, including checksum/erasure code of EAs. Given the fact that
Lustre is so complex, and is still evolving rapidly, it is ideal but not
currently ture that LFSCK is able to fix all of the problems online without
any manual intervention. It is not a rare condition that the Lustre file
system needs to be recovered offline directly on lower file system (i.e.
ldiskfs/zfs). And the checksum/erasure code of EAs would make it harder to fix
a broken file offline since the changing values of the EAs needs to be
consistent with the checksum/erasure code. A lot of tools and scripts should
be provided for this purpose even if LFSCK is doing well, because, as have
been proven, userspace tools are much more flexible than online mechanism when
recovering data. Also, for online recover, LFSCK should provide interfaces
to administrators to make decisions manually on the recovering of the file
system.

We could use similar mechanism from lower file system, for example, the
metadata checksum of ext4. However, the Lustre level checksum of EAs still has
some advantages. First of all, the selected Lustre EAs are almost constant,
that means the performance regression is likely to be minimum. And also, this
implementation won't depend on any internal feature of the lower file system,
and thus it can be used on both ZFS and ldiskfs.



 Comments   
Comment by Andreas Dilger [ 22/Jun/16 ]

There is already a feature in upstream ext4 that adds checksums for all metadata (metacsum) that should be used. For ZFS the checksum is already present for all metadata. I don't think that random EA corruption is likely to be a source of problems here, since there are 128-bit identifiers for the objects on the MDT and OST so it is unlikely that random data will match any existing object. The PFID EA on the OST is essentially already a distributed copy of the LOV EA that can be used by LFSCK to recover the LOV EA if it is corrupted. If LFSCK were run periodically on a filesystem it will ensure that the PFID EA copy is updated and correct on the OSTs.

One area that could be improved is the connection between the LOV EA on the MDT and the PFID on the OST. For the PFL project there will be more information in the PFID to contain the stripe size and total stripe count for the file (or component) layout that will help in recovery of the LOV EA.

For the File Level Redundancy project there will also be a layout generation stored with each object that matches the MDT EA to the OST EA to ensure they are in sync, but this will not help with recovering the layout if it is corrupted.

Comment by Li Xi (Inactive) [ 22/Jun/16 ]

Thank you Andreas for the information!

Comment by Andreas Dilger [ 22/Jun/16 ]

Li Xi, I think that all of your feature requests are already being addressed in other ways, can this ticket be closed?

Comment by Li Xi (Inactive) [ 23/Jun/16 ]

OK. Please close it.

Comment by Andreas Dilger [ 23/Jun/16 ]

According to https://ext4.wiki.kernel.org/index.php/Ext4_Metadata_Checksums the metadata_csum feature started to land in 3.6, and since RHEL 7 is 3.10 based it is possible that this would be usable already with e2fsprogs 1.43?

Please reopen if you try any of this out.

Generated at Sat Feb 10 02:16:28 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.