Details

    • Improvement
    • Resolution: Fixed
    • Minor
    • Lustre 2.17.0
    • None
    • 9223372036854775807

    Description

      There is an upstream patch series that is adding improved mballoc handling for efficiently finding suitable allocation groups in a filesystem. In particular, patch
      https://patchwork.ozlabs.org/project/linux-ext4/patch/20210209202857.4185846-5-harshadshirwadkar@gmail.com/ "ext4: improve cr 0 / cr 1 group scanning" is the important part of the series.

      Attachments

        Issue Links

          Activity

            [LU-14438] backport ldiskfs mballoc patches
            pjones Peter Jones added a comment -

            Merged for 2.17

            pjones Peter Jones added a comment - Merged for 2.17

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/51472/
            Subject: LU-14438 ldiskfs: backport ldiskfs mballoc patches
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 1534c43ccb034048d8ab0a22cb55635116eebe09

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/51472/ Subject: LU-14438 ldiskfs: backport ldiskfs mballoc patches Project: fs/lustre-release Branch: master Current Patch Set: Commit: 1534c43ccb034048d8ab0a22cb55635116eebe09
            bobijam Zhenyu Xu added a comment -

            https://review.whamcloud.com/c/fs/lustre-release/+/51472 has separated patches for ldiskfs series.

            bobijam Zhenyu Xu added a comment - https://review.whamcloud.com/c/fs/lustre-release/+/51472 has separated patches for ldiskfs series.

            There are cases where we may want to make empty filesystem performance worse, but the 90% performance better. We could use the new mballoc array lists to spread out allocations across the disk more evenly.

            I had previously considered that we might split groups into two arrays (as we are doing with IOPS groups in LU-16750) 80% at the start of the disk and 20% at the end of the disk (or 90/10%) so groups at end of the filesystem are only used when the first groups are mostly full. However, this would mean that performance would suddenly drop once the filesystem hit 80% full.

            We could instead do things like split the groups into eg. 16 separate arrays by offset, and then have a clock that rotates allocations around the regions eg. every second, so that groups are not used start-to-end during allocation. We would still want some locality in allocations, so we are not seeking wildly around the disk for files being written concurrently, but are always using the end of the disk some fraction of the time. This would hopefully even out the performance over the filesystem lifetime for uses that demand more consistent performance instead of "best possible".

            We could even hint via "lfs ladvise" and/or "ionice" for a file or process to force all file allocations to the slow part of the disk for cases of archiving old files. I don't think it makes sense to allow "improving" allocations because everyone would want that and it would be no different than today.

            adilger Andreas Dilger added a comment - There are cases where we may want to make empty filesystem performance worse , but the 90% performance better. We could use the new mballoc array lists to spread out allocations across the disk more evenly. I had previously considered that we might split groups into two arrays (as we are doing with IOPS groups in LU-16750 ) 80% at the start of the disk and 20% at the end of the disk (or 90/10%) so groups at end of the filesystem are only used when the first groups are mostly full. However, this would mean that performance would suddenly drop once the filesystem hit 80% full. We could instead do things like split the groups into eg. 16 separate arrays by offset, and then have a clock that rotates allocations around the regions eg. every second, so that groups are not used start-to-end during allocation. We would still want some locality in allocations, so we are not seeking wildly around the disk for files being written concurrently, but are always using the end of the disk some fraction of the time. This would hopefully even out the performance over the filesystem lifetime for uses that demand more consistent performance instead of "best possible". We could even hint via " lfs ladvise " and/or " ionice " for a file or process to force all file allocations to the slow part of the disk for cases of archiving old files. I don't think it makes sense to allow "improving" allocations because everyone would want that and it would be no different than today.
            gerrit Gerrit Updater added a comment - - edited

            I've tried to port some of the upstream mballoc patches in this, but it looks too big for a single patch.

            "Zhenyu Xu <bobijam@hotmail.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51472
            Subject: LU-14438 ldiskfs: backport ldiskfs mballoc patches
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 2439579001a928714a640ec469a2d833ea5e8337

            gerrit Gerrit Updater added a comment - - edited I've tried to port some of the upstream mballoc patches in this, but it looks too big for a single patch. "Zhenyu Xu <bobijam@hotmail.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51472 Subject: LU-14438 ldiskfs: backport ldiskfs mballoc patches Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 2439579001a928714a640ec469a2d833ea5e8337

            I've filed LU-16155 to enhance debugfs to allow "importing" the block and inode allocation maps into a newly-formatted filesystem to simplify testing of this problem. We could collect the debugfs information from real filesystems that are having allocation performance issues as needed in order to test changes to mballoc.

            adilger Andreas Dilger added a comment - I've filed LU-16155 to enhance debugfs to allow "importing" the block and inode allocation maps into a newly-formatted filesystem to simplify testing of this problem. We could collect the debugfs information from real filesystems that are having allocation performance issues as needed in order to test changes to mballoc.

            There are additional patches to fix the mballoc mb_optimized_scan=1 use case:
            https://patchwork.ozlabs.org/project/linux-ext4/list/?series=317391

            These fix a number of sub-optimal allocation decisions in the earlier patches.

            adilger Andreas Dilger added a comment - There are additional patches to fix the mballoc mb_optimized_scan=1 use case: https://patchwork.ozlabs.org/project/linux-ext4/list/?series=317391 These fix a number of sub-optimal allocation decisions in the earlier patches.
            dauchy Nathan Dauchy (Inactive) added a comment - - edited

            Andreas,

            Regarding the comment in the code patch: "the groups may not get traversed linearly. That may result in subsequent allocations being not close to each other. And so, the underlying device may get filled up in a non-linear fashion."... rather than using a fixed MB_DEFAULT_LINEAR_LIMIT what do you think about using something like the following algorithm? (I have been playing with external setting of mb_last_group to an "optimal" value, but incorporating the idea into ext4 would be much cleaner.)

            The general idea is to work backwards through mb_groups info and use a "decay" algorithm to determine an adjusted value for free block count for each group. Then set mb_last_group based on the largest adjusted block count value. Attached is a simple script pick_mb_last_group.sh to demonstrate the approach. In my limited testing, it does seem to pick a group number that not only has a large "bfree" value but is also followed by other groups that generally have large-ish bfree values as well.
            Obviously more cleanup on the script would be needed to make it "production ready", and some theory and testing applied to set a good $decay value, and there is also a question of how often to run the tool and change the value... but hopefully the script at least clarifies the approach. Further enhancements could include a check of "/sys/block/${dev}/queue/rotational"; if a device is spinning rust, then adjust the weighted score further with a penalty for higher group numbers.

            Thanks,
            Nathan

            dauchy Nathan Dauchy (Inactive) added a comment - - edited Andreas, Regarding the comment in the code patch: " the groups may not get traversed linearly. That may result in subsequent allocations being not close to each other. And so, the underlying device may get filled up in a non-linear fashion. "... rather than using a fixed MB_DEFAULT_LINEAR_LIMIT what do you think about using something like the following algorithm? (I have been playing with external setting of mb_last_group to an "optimal" value, but incorporating the idea into ext4 would be much cleaner.) The general idea is to work backwards through mb_groups info and use a "decay" algorithm to determine an adjusted value for free block count for each group. Then set mb_last_group based on the largest adjusted block count value. Attached is a simple script pick_mb_last_group.sh to demonstrate the approach. In my limited testing, it does seem to pick a group number that not only has a large "bfree" value but is also followed by other groups that generally have large-ish bfree values as well. Obviously more cleanup on the script would be needed to make it "production ready", and some theory and testing applied to set a good $decay value, and there is also a question of how often to run the tool and change the value... but hopefully the script at least clarifies the approach. Further enhancements could include a check of "/sys/block/${dev}/queue/rotational"; if a device is spinning rust, then adjust the weighted score further with a penalty for higher group numbers. Thanks, Nathan
            adilger Andreas Dilger added a comment - - edited

            The new mballoc patch from the upstream kernel keeps an array with 2^n order of the largest range of free blocks in the group (between 2^0=1 free block and 2^16 = 32768 free blocks), and puts each group into the appropriate list after each alloc/free. It uses round-robin selection for groups in the per-order list, so it is still possible to get into situations similar to mb_last_group being very large, where the allocations are done at the end of the filesystem (lower bandwidth) even though there are many groups with free space available at the start of the filesystem (higher bandwidth).

            It would make sense to enhance the new allocator to have two per-order lists for tracking the free blocks (on HDD OSTs at least, based the "rotational" parameter of the block device) - one list for the groups in the first ~70% of the filesystem that have good performance and a second list for groups in the last ~30% of the filesystem that have lower performance. Groups in the second list would only be used if there are no free groups of the right order in the first list. That would bias allocations to the start of the device so that it avoids needless slowdowns when the filesystem is not full. Since the amount of memory used for the per-order array itself is small (array of 17 pointers), and it is easy to decide which array a given group is put into based on the group number this would not increase allocation overhead. That would probably be much more efficient than trying to keep the groups within each order totally sorted.

            adilger Andreas Dilger added a comment - - edited The new mballoc patch from the upstream kernel keeps an array with 2^n order of the largest range of free blocks in the group (between 2^0=1 free block and 2^16 = 32768 free blocks), and puts each group into the appropriate list after each alloc/free. It uses round-robin selection for groups in the per-order list, so it is still possible to get into situations similar to mb_last_group being very large, where the allocations are done at the end of the filesystem (lower bandwidth) even though there are many groups with free space available at the start of the filesystem (higher bandwidth). It would make sense to enhance the new allocator to have two per-order lists for tracking the free blocks (on HDD OSTs at least, based the "rotational" parameter of the block device) - one list for the groups in the first ~70% of the filesystem that have good performance and a second list for groups in the last ~30% of the filesystem that have lower performance. Groups in the second list would only be used if there are no free groups of the right order in the first list. That would bias allocations to the start of the device so that it avoids needless slowdowns when the filesystem is not full. Since the amount of memory used for the per-order array itself is small (array of 17 pointers), and it is easy to decide which array a given group is put into based on the group number this would not increase allocation overhead. That would probably be much more efficient than trying to keep the groups within each order totally sorted.

            Alex Zhuravlev (bzzz@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/43232
            Subject: LU-14438 ldiskfs: improvements to mballoc
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: b7e2d9466f2a45d3c9a687cf06155d4e75b020c9

            gerrit Gerrit Updater added a comment - Alex Zhuravlev (bzzz@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/43232 Subject: LU-14438 ldiskfs: improvements to mballoc Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: b7e2d9466f2a45d3c9a687cf06155d4e75b020c9

            People

              ablagodarenko Artem Blagodarenko
              adilger Andreas Dilger
              Votes:
              2 Vote for this issue
              Watchers:
              20 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: