Details
-
Improvement
-
Resolution: Unresolved
-
Minor
-
None
-
None
-
9223372036854775807
Description
The current "-o discard" mount option for ldiskfs enables on-the-fly TRIM of underlying flash devices (or thinly-provisioned LUNs). However, the current implementation hurts performance because it tracks each block free request explicitly in memory, and submits trim requests to storage on transaction commit.
It would be better to have an async worker thread to issue the TRIM commands using the standard fstrim mechanism, and do this on a per-blockgroup basis, rather than tracking and issuing the trim on a per-extent basis. This reduces both memory and IO overhead, by aggregating TRIM commands for many blocks in a single group.
This would be based on the patches in LU-14712 that make the TRIM state for a block group persistent, so that running TRIM with mke2fs does not also lead to fstrim resubmitting TRIM requests for all of the groups again immediately after mount/remount.
The persistent trim patch https://review.whamcloud.com/51923 ("
LU-14712ldiskfs: introduce EXT4_BG_TRIMMED to optimize fstrim") and patch https://review.whamcloud.com/55567 ("LU-14712ldiskfs: add bg_trimmed_threshold interface") landed to master for 2.15.65.The next step here would be to trigger automatic "fstrim" functionality in the background so that users don't need to manage this themselves. It is already possible to schedule "fstrim" command periodically (e.g. every 6h) to do a full filesystems scan, so there is relatively little benefit to trigger this from within the kernel. A more significant optimization would be to keep an in-memory list of groups that have exceeded a threshold for the number of blocks that can be trimmed, and then execute ext4_trim_all_free(sb, group, group_start, group_end, minlen) for each group in the list.
This should be done after waiting some short delay (60s?) after deletes in this group have stopped, to allow aggregating deletes within a single group instead of doing the trim on a single group multiple times, or after the next journal commit for groups that have had all of the (non-metadata) blocks freed.
As a further enhancement this automatic fstrim activity could be tied to idle filesystem detection (e.g. small transaction size) to avoid impacting running workloads. It might be possible to tie this "recently freed block group" into the new mballoc allocator, so that groups which are seeing a lot of deletes are excluded from new allocations for some time, to maximize the number of freed blocks, rather than doing repeated alloc/free? However, I'm not sure whether this would be important for performance or not.
Doing the fstrim only on recently-freed groups keeps the incremental cost of fstrim relatively low, while doing the trm often enough to maintain performance, avoiding to wait on FTL erase block cleanup on demand when the filesystem is busy.