Details
-
Improvement
-
Resolution: Unresolved
-
Medium
-
Lustre 2.18.0
-
3
-
9223372036854775807
Description
To improve FLR-ECRO usability in production when an OST is inaccessible, it would be desirable to handle operations like open("file", O_TRUNC) and truncate("file", 0) transparently to the application. Since these file operations are data-discarding operations, there is no need to reconstruct any data, and generally open("file", O_TRUNC) is preceding write(file), so it presents an ideal opportunity to "repair" a file with an inaccessible stripe without having full FLR-ECIW. Otherwise, in the FLR-ECRO paradigm that file access would immediately hang the application until the OST is recovered.
The MDS always gets the MDS_REINT_SETATTR RPC first, before the OST objects are truncated, in order to check and block truncate on files opened for execution.
I think there are two possible options for replacing inaccessible objects during truncate:
- the MDS replaces objects individually for OSTs (OSPs) that are inactive and puts them in the delete queue (possibly attached to a temporary inode)
- the MDS swaps the whole file layout in this case and reallocates it from the current set of OSTs
While the replacing individual objects seems appealing at first glance for lower overhead, I think there are several arguments in favor of replacing the whole file layout:
- replacing individual objects is a more complex layout operation (which we don't implement) compared to replacing the whole file layout (which could easily be done with layout swap today)
- in some configurations it may not be possible to allocate replacement object(s) that obey the failure-domain restrictions. For example, with a 6+2 layout on an 8-OST system, if there are 1 or 2 OSTs offline, there is no other OST that could be used for allocation, while a full layout reallocation could fall back to 4+2 and continue operation uninterrupted.
- "deleting" the whole file layout actually integrates better with Trash Can, because it would allow an older version of a file to be preserved in case of "overwrite in place" workloads, in a similar manner to how rename("file.new", "file") is currently also handled by TCU by moving the over-renamed file into trash.
- depending on your point of view, recreating the file layout in this case also (potentially) allows the MDS to make a better layout decision for the file on the next go-around, assuming that it will be a similar size as before. For example, if the file has a PFL 1-4-40 stripe layout, and the file size has grown into the 4- or 40-stripe component, then it would probably make sense to skip the 1-stripe component entirely and go straight to the wider-stripe component on the second try. Not only would wider striping help bandwidth and space balance for early file offsets that were previously on the 1-stripe component, it would allow EC to immediately use parity stripes from offset 0 instead of a mirror for the first component.