[LU-17980] improve ldiskfs "-o discard" performance - Whamcloud Community JIRA

Details

Type: Improvement
Resolution: Unresolved
Priority: Minor
Fix Version/s: None
Affects Version/s: None
Labels:
- NVME
- ldiskfs
- performance

Rank (Obsolete):
9223372036854775807

Description

The current "-o discard" mount option for ldiskfs enables on-the-fly TRIM of underlying flash devices (or thinly-provisioned LUNs). However, the current implementation hurts performance because it tracks each block free request explicitly in memory, and submits trim requests to storage on transaction commit.

It would be better to have an async worker thread to issue the TRIM commands using the standard fstrim mechanism, and do this on a per-blockgroup basis, rather than tracking and issuing the trim on a per-extent basis. This reduces both memory and IO overhead, by aggregating TRIM commands for many blocks in a single group.

This would be based on the patches in ~~LU-14712~~ that make the TRIM state for a block group persistent, so that running TRIM with mke2fs does not also lead to fstrim resubmitting TRIM requests for all of the groups again immediately after mount/remount.

Attachments

Issue Links

Clones

LU-14712 make TRIM state persistent across reboots

Resolved

is related to

LU-16750 optimize ldiskfs internal metadata allocation for hybrid storage LUNs

Open

is related to

LU-14438 backport ldiskfs mballoc patches

Resolved

mentioned in: Page No Confluence page found with the given URL.

Activity

[LU-17980] improve ldiskfs "-o discard" performance

Andreas Dilger added a comment - 21/Jan/25 9:39 PM

The persistent trim patch https://review.whamcloud.com/51923 ("LU-14712 ldiskfs: introduce EXT4_BG_TRIMMED to optimize fstrim") and patch https://review.whamcloud.com/55567 ("LU-14712 ldiskfs: add bg_trimmed_threshold interface") landed to master for 2.15.65.

The next step here would be to trigger automatic "fstrim" functionality in the background so that users don't need to manage this themselves. It is already possible to schedule "fstrim" command periodically (e.g. every 6h) to do a full filesystems scan, so there is relatively little benefit to trigger this from within the kernel. A more significant optimization would be to keep an in-memory list of groups that have exceeded a threshold for the number of blocks that can be trimmed, and then execute ext4_trim_all_free(sb, group, group_start, group_end, minlen) for each group in the list.

This should be done after waiting some short delay (60s?) after deletes in this group have stopped, to allow aggregating deletes within a single group instead of doing the trim on a single group multiple times, or after the next journal commit for groups that have had all of the (non-metadata) blocks freed.

As a further enhancement this automatic fstrim activity could be tied to idle filesystem detection (e.g. small transaction size) to avoid impacting running workloads. It might be possible to tie this "recently freed block group" into the new mballoc allocator, so that groups which are seeing a lot of deletes are excluded from new allocations for some time, to maximize the number of freed blocks, rather than doing repeated alloc/free? However, I'm not sure whether this would be important for performance or not.

Doing the fstrim only on recently-freed groups keeps the incremental cost of fstrim relatively low, while doing the trm often enough to maintain performance, avoiding to wait on FTL erase block cleanup on demand when the filesystem is busy.

Andreas Dilger added a comment - 21/Jan/25 9:39 PM The persistent trim patch https://review.whamcloud.com/51923 (" LU-14712 ldiskfs: introduce EXT4_BG_TRIMMED to optimize fstrim ") and patch https://review.whamcloud.com/55567 (" LU-14712 ldiskfs: add bg_trimmed_threshold interface ") landed to master for 2.15.65. The next step here would be to trigger automatic "fstrim" functionality in the background so that users don't need to manage this themselves. It is already possible to schedule " fstrim " command periodically (e.g. every 6h) to do a full filesystems scan, so there is relatively little benefit to trigger this from within the kernel. A more significant optimization would be to keep an in-memory list of groups that have exceeded a threshold for the number of blocks that can be trimmed, and then execute ext4_trim_all_free(sb, group, group_start, group_end, minlen) for each group in the list. This should be done after waiting some short delay (60s?) after deletes in this group have stopped, to allow aggregating deletes within a single group instead of doing the trim on a single group multiple times, or after the next journal commit for groups that have had all of the (non-metadata) blocks freed. As a further enhancement this automatic fstrim activity could be tied to idle filesystem detection (e.g. small transaction size) to avoid impacting running workloads. It might be possible to tie this "recently freed block group" into the new mballoc allocator, so that groups which are seeing a lot of deletes are excluded from new allocations for some time, to maximize the number of freed blocks, rather than doing repeated alloc/free? However, I'm not sure whether this would be important for performance or not. Doing the fstrim only on recently-freed groups keeps the incremental cost of fstrim relatively low, while doing the trm often enough to maintain performance, avoiding to wait on FTL erase block cleanup on demand when the filesystem is busy.

Andreas Dilger added a comment - 27/Jun/24 3:50 PM

I think Shuichi's testing showed that even with the upstream discard patches that this still hurt performance. It would be better to just change the "-o discard" option to track which groups have recently had a large number of blocks freed and call ext4_trim_all_free() on those groups after the commit instead of using the extent status tree to submit individual TRIM requests for each extent.

Not only is it inefficient to submit many separate TRIM requests, but if the blocks in any group are freed in a non-linear order then many small TRIM requests may be discarded by the device and not merged. IMHO, it is better to use the per-group EXT4_BG_TRIMMED state and call TRIM when there are enough free blocks to make it worthwhile.

Andreas Dilger added a comment - 27/Jun/24 3:50 PM I think Shuichi's testing showed that even with the upstream discard patches that this still hurt performance. It would be better to just change the "-o discard" option to track which groups have recently had a large number of blocks freed and call ext4_trim_all_free() on those groups after the commit instead of using the extent status tree to submit individual TRIM requests for each extent. Not only is it inefficient to submit many separate TRIM requests, but if the blocks in any group are freed in a non-linear order then many small TRIM requests may be discarded by the device and not merged. IMHO, it is better to use the per-group EXT4_BG_TRIMMED state and call TRIM when there are enough free blocks to make it worthwhile.

Dongyang Li added a comment - 27/Jun/24 10:44 AM

The patches got merged in mainline since 5.14, and it also appears in rhel kernel since 9.1, but not 9.0 which is odd.
Will create a ldiskfs patch for rhel8 series.

Dongyang Li added a comment - 27/Jun/24 10:44 AM The patches got merged in mainline since 5.14, and it also appears in rhel kernel since 9.1, but not 9.0 which is odd. Will create a ldiskfs patch for rhel8 series.

improve ldiskfs "-o discard" performance

Details

Description

Attachments

Issue Links

Activity

People

Dates