Details
-
Improvement
-
Resolution: Unresolved
-
Minor
-
None
-
None
-
3
-
9223372036854775807
Description
It would be possible to implement the fallocate(FALLOC_FL_INSERT_RANGE) and FALLOC_FL_COLLAPSE_RANGE options for Lustre, with specific restrictions:
- in all cases there must be a DLM write lock held on the object from offset to OBD_OBJECT_EOF to flush any dirty cache from the clients and prevent access while the layout is being modified. If this was done on the OST it would ensure the client cache is flushed and fetched again with the correct data.
- as a potentially separate step, if the client sends a lock handle in the RPC for the appropriate range of the object, then the OST could assume that the client is adjusting its page cache appropriately and avoid flushing the entire file from cache. The ext4_fallocate() call handles flushing the page cache for local operations, so the llite code would need to do the same for the local page cache to keep it consistent (probably after the OST RPC is successful so that it does not lose local state if the RPC failed for any reason).
- for 1-stripe plain layout (non-PFL) files it would only require blocksize alignment limitations for offset and len, which appear to be enforced by the backing ldiskfs filesystem code itself. This could basically be implemented today as a straight pass-through with no effort except checking the flags and layout and continuing to return -EOPNOTSUPP from both the client and server for these modes for files with more than one stripe.
- for multi-striped plain layouts the offset must be aligned to an integer multiple of lmm_stripe_size, and len must be an integer multiple of stride = lmm_stripe_count * lmm_stripe_size. This ensures that whole "stride units" of the file are added/removed at once and the data does not need to be moved between OST stripes of the file when it is shifted. Otherwise the client would continue to return -EOPNOTSUPP for PFL files. The offset should be mapped in LOV to the proper starting offset of the OST object, and len should be divided by lmm_stripe_count so that there is an appropriate amount of space added/removed from each object by calling fallocate() on each object individually.
- for a PFL file, this alignment/size restriction applies to both the layout of the current component (and any overlapping mirror components at that offset) and any later components in the file (if allocated), to ensure that any data shifts in the later components can also be handled without data movement since they will also need to have fallocate() called on all allocated objects for the component. It would also be necessary to shift the lcme_extent.e_start and .e_end for the following component(s) so that the file layout is suitably mapped to the new data offset. It is also necessary for the OST to update ost_layout.ol_comp_start and ol_comp_end in the filter_fid xattr on the OST object as part of the same transaction as fallocate() so that the data stays consistent.
- in a far distant future where this feature is heavily used and important for some workload, it might be possible to reduce the stride alignment to only lmm_stripe_size (largest in current and later components) by reordering the OST objects in the current and later component layouts. That still avoids the need to move between objects on different OSTs, but is of course much more complex to get right, and can be avoided by selecting suitable stride = stripe_count * stripe_size for all components of a file (e.g. smaller stripe_size for later components to compensate for larger stripe_count).
There would be some risk when using these operations on a file, since at least fallocate(FALLOC_FL_COLLAPSE_RANGE) cannot be reversed if there is an error when the operation is partially applied to multiple stripes/components of the file. This is not totally different from truncate() or fallocate(FALLOC_FL_PUNCH_HOLE), but the added risk is that a partially-applied data shift would leave any incomplete parts of the file with the wrong data (as opposed to stale but formerly correct data). It may be necessary to implement recoverability for a partial operation via a logged transaction from the MDS to ensure that it is applied to an OST object after recovery (in an idempotent way, since repeated shifts would also corrupt the data). For a mirrored file, one option would be to mark a mirror stale if the shift partly fails, and leave it up to a resync agent to copy the data again with the correct offset in that case.