|
In order to save some private informations of the file/object belonging to a
Lustre file, Lustre saves a series of extended attributes on inodes of lower
file system. All of the EAs are important for correct behavior of Lustre
functions and features. And some of the EAs are so critical that if these EAs
are lost or corrupted, all the data/metadata of the Lustre file is no longer
available. For example, if the "trusted.lov" EA has an incorrect value, the
data of the Lustre file might point to a non-exist object or even worse, to
another file's data.
Unfortunately, this situation could happen if a server or storage crashes on
Lustre. And it makes the situation worse that it is sometimes hard to
determine which component is the root cause of the inconsistency when
recovering the system. For example, a "trusted.lov" EA pointing to non-exist
object could means 1) the value of the EA is corrupted, or 2) the object on OST
has been removed although it shouldn't have. And when this happens, the LFSCK
mechanism which supporses to fix the inconsistency of Lustre file system online
might need to fix the problem based on wrong values of EAs. This attempt
obviously won't help.
Because of these reasons, I am wondering whether a checksum/erasure code of the
Lustre EAs could be introduced to improve the situation. Following is the idea:
1) A checksum/erasure code of the Lustre EAs (e.g. trusted.lov + trusted.lma
+ ...) will be calculated and saved as a new EA (e.g. "trusted.mdt_checksum"
and "trusted.ost_checksum") when the Lustre file is created. Since most of
(or all)the Lustre EAs will not be updated by normal file system operations on
the file, the EAs are almost immutable which means almost no performance
regression will be introduced (except maybe file creation).
2) When the OST/MDT objects of a Lustre file is accessed/repaired, the
checksum/erasure code could be used to check (and fix if using erasure code)
the EAs.
3) When the Lustre EAs are updated, the checksum/erasure code will be updated.
As said before, this won't happen frequently. And if some Lustre EAs change
too frequently (e.g. trusted.hsm when HSM is under heavy use), we could
exclude those EAs from the checksum. Thus, filter flags could be specified to
include only part of the Lustre EAs.
4) The checksum/erasure code of the MDT EA (i.e. "trusted.mdt_checksum") will
also be saved on OST objects that belongs to the same Lustre file. In this way,
LFSCK could use the checksum to check the consistency of the file between OSTs
and MDT. If checksum/erasure code of the MDT EA is inconsistent between MDT and
OSTs, the LFSCK needs to either smartly determine which one is broken or just
leave it along to manual decision. And ideally, this file should becomes
readonly to prevent any further corruption.
5) A series of ultilities should be provided for better recovering of the
Lustre files, including checksum/erasure code of EAs. Given the fact that
Lustre is so complex, and is still evolving rapidly, it is ideal but not
currently ture that LFSCK is able to fix all of the problems online without
any manual intervention. It is not a rare condition that the Lustre file
system needs to be recovered offline directly on lower file system (i.e.
ldiskfs/zfs). And the checksum/erasure code of EAs would make it harder to fix
a broken file offline since the changing values of the EAs needs to be
consistent with the checksum/erasure code. A lot of tools and scripts should
be provided for this purpose even if LFSCK is doing well, because, as have
been proven, userspace tools are much more flexible than online mechanism when
recovering data. Also, for online recover, LFSCK should provide interfaces
to administrators to make decisions manually on the recovering of the file
system.
We could use similar mechanism from lower file system, for example, the
metadata checksum of ext4. However, the Lustre level checksum of EAs still has
some advantages. First of all, the selected Lustre EAs are almost constant,
that means the performance regression is likely to be minimum. And also, this
design doesn't depend on any internal feature of the lower file system, thus it
can be used on both ZFS and ldiskfs.
|