Details
-
Improvement
-
Resolution: Unresolved
-
Medium
-
Lustre 2.18.0
-
3
-
9223372036854775807
Description
The goal of FLR-EC is to allow a file to be accessible for reads when a file's OST object is inaccessible (for whatever reason), by reconstructing data from a parity object on another OST.
If the file data can be reconstructed from parity in limited cases, but some other common operation (e.g. open() or stat()) is blocked by the inaccessible OST object then the operational benefits of FLR-EC are lost.
Testing should be done with a variety of workloads that access the filesystem in a read-only manner to check that they do not become blocked by RPC timeout and retry. For example, scanning and reading with "[lfs] find /mnt/testfs -size +1G -print0 | xargs -0 md5sum" or "grep -Rq test /mnt/testfs" or running "/mnt/testfs/executable" should not hang, and ideally should not pause more than a few seconds before the EC read reconstruction is activated.
The stat() calls should be handled by the SOM xattr on the MDS to return the file size and blocks instead of sending RPCs to OSTs to fetch these attributes.
LU-20211 contains a proposal to handle open(O_TRUNC) and truncate(0) in a way that would allow "read-only" EC to remove inaccessible OST objects from the file completely.
statfs()/df already has a "lazy" mechanism that should timeout if the OST is not responsive, and LU-20200 proposes that "lfs df" also send OST_STATFS RPCs in parallel to avoid long sequential waits, though the latter is not critical functionality for most workloads
Other file and filesystem access calls should be systematically reviewed and tested to ensure that file operations do not block, and fixed or the wait minimized if at all possible.
Ideally, only write(), truncate(), and maybe fallocate() to missing OST objects would block access waiting on OST recovery.
Attachments
Issue Links
- is related to
-
LU-10911
FLR2: Read only erasure coding
-
- In Progress
-
-
LU-20200 FLR-EC: parallel 'lfs df' RPC submission
-
- Open
-
-
LU-20211 FLR-EC: allow 'truncate(file, 0)' and 'open(file, O_TRUNC)' on degraded EC file
-
- Open
-
-
LU-11962 File LSOM updates to store proper size via FLR for regular stat() usage
-
- Reopened
-