Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-17980

improve ldiskfs "-o discard" performance

Details

    • Improvement
    • Resolution: Unresolved
    • Minor
    • None
    • None
    • 9223372036854775807

    Description

      The current "-o discard" mount option for ldiskfs enables on-the-fly TRIM of underlying flash devices (or thinly-provisioned LUNs). However, the current implementation hurts performance because it tracks each block free request explicitly in memory, and submits trim requests to storage on transaction commit.

      It would be better to have an async worker thread to issue the TRIM commands using the standard fstrim mechanism, and do this on a per-blockgroup basis, rather than tracking and issuing the trim on a per-extent basis. This reduces both memory and IO overhead, by aggregating TRIM commands for many blocks in a single group.

      This would be based on the patches in LU-14712 that make the TRIM state for a block group persistent, so that running TRIM with mke2fs does not also lead to fstrim resubmitting TRIM requests for all of the groups again immediately after mount/remount.

      Attachments

        Issue Links

          Activity

            [LU-17980] improve ldiskfs "-o discard" performance

            The persistent trim patch https://review.whamcloud.com/51923 ("LU-14712 ldiskfs: introduce EXT4_BG_TRIMMED to optimize fstrim") and patch https://review.whamcloud.com/55567 ("LU-14712 ldiskfs: add bg_trimmed_threshold interface") landed to master for 2.15.65.

            The next step here would be to trigger automatic "fstrim" functionality in the background so that users don't need to manage this themselves. It is already possible to schedule "fstrim" command periodically (e.g. every 6h) to do a full filesystems scan, so there is relatively little benefit to trigger this from within the kernel. A more significant optimization would be to keep an in-memory list of groups that have exceeded a threshold for the number of blocks that can be trimmed, and then execute ext4_trim_all_free(sb, group, group_start, group_end, minlen) for each group in the list.

            This should be done after waiting some short delay (60s?) after deletes in this group have stopped, to allow aggregating deletes within a single group instead of doing the trim on a single group multiple times, or after the next journal commit for groups that have had all of the (non-metadata) blocks freed.

            As a further enhancement this automatic fstrim activity could be tied to idle filesystem detection (e.g. small transaction size) to avoid impacting running workloads. It might be possible to tie this "recently freed block group" into the new mballoc allocator, so that groups which are seeing a lot of deletes are excluded from new allocations for some time, to maximize the number of freed blocks, rather than doing repeated alloc/free? However, I'm not sure whether this would be important for performance or not.

            Doing the fstrim only on recently-freed groups keeps the incremental cost of fstrim relatively low, while doing the trm often enough to maintain performance, avoiding to wait on FTL erase block cleanup on demand when the filesystem is busy.

            adilger Andreas Dilger added a comment - The persistent trim patch https://review.whamcloud.com/51923 (" LU-14712 ldiskfs: introduce EXT4_BG_TRIMMED to optimize fstrim ") and patch https://review.whamcloud.com/55567 (" LU-14712 ldiskfs: add bg_trimmed_threshold interface ") landed to master for 2.15.65. The next step here would be to trigger automatic "fstrim" functionality in the background so that users don't need to manage this themselves. It is already possible to schedule " fstrim " command periodically (e.g. every 6h) to do a full filesystems scan, so there is relatively little benefit to trigger this from within the kernel. A more significant optimization would be to keep an in-memory list of groups that have exceeded a threshold for the number of blocks that can be trimmed, and then execute ext4_trim_all_free(sb, group, group_start, group_end, minlen) for each group in the list. This should be done after waiting some short delay (60s?) after deletes in this group have stopped, to allow aggregating deletes within a single group instead of doing the trim on a single group multiple times, or after the next journal commit for groups that have had all of the (non-metadata) blocks freed. As a further enhancement this automatic fstrim activity could be tied to idle filesystem detection (e.g. small transaction size) to avoid impacting running workloads. It might be possible to tie this "recently freed block group" into the new mballoc allocator, so that groups which are seeing a lot of deletes are excluded from new allocations for some time, to maximize the number of freed blocks, rather than doing repeated alloc/free? However, I'm not sure whether this would be important for performance or not. Doing the fstrim only on recently-freed groups keeps the incremental cost of fstrim relatively low, while doing the trm often enough to maintain performance, avoiding to wait on FTL erase block cleanup on demand when the filesystem is busy.

            I think Shuichi's testing showed that even with the upstream discard patches that this still hurt performance. It would be better to just change the "-o discard" option to track which groups have recently had a large number of blocks freed and call ext4_trim_all_free() on those groups after the commit instead of using the extent status tree to submit individual TRIM requests for each extent.

            Not only is it inefficient to submit many separate TRIM requests, but if the blocks in any group are freed in a non-linear order then many small TRIM requests may be discarded by the device and not merged. IMHO, it is better to use the per-group EXT4_BG_TRIMMED state and call TRIM when there are enough free blocks to make it worthwhile.

            adilger Andreas Dilger added a comment - I think Shuichi's testing showed that even with the upstream discard patches that this still hurt performance. It would be better to just change the "-o discard" option to track which groups have recently had a large number of blocks freed and call ext4_trim_all_free() on those groups after the commit instead of using the extent status tree to submit individual TRIM requests for each extent. Not only is it inefficient to submit many separate TRIM requests, but if the blocks in any group are freed in a non-linear order then many small TRIM requests may be discarded by the device and not merged. IMHO, it is better to use the per-group EXT4_BG_TRIMMED state and call TRIM when there are enough free blocks to make it worthwhile.
            dongyang Dongyang Li added a comment -

            The patches got merged in mainline since 5.14, and it also appears in rhel kernel since 9.1, but not 9.0 which is odd.
            Will create a ldiskfs patch for rhel8 series.

            dongyang Dongyang Li added a comment - The patches got merged in mainline since 5.14, and it also appears in rhel kernel since 9.1, but not 9.0 which is odd. Will create a ldiskfs patch for rhel8 series.

            People

              dongyang Dongyang Li
              adilger Andreas Dilger
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated: