[LU-14438] backport ldiskfs mballoc patches Created: 16/Feb/21  Updated: 28/Sep/23

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Minor
Reporter: Andreas Dilger Assignee: Artem Blagodarenko
Resolution: Unresolved Votes: 2
Labels: ldiskfs

Attachments: Text File ext4-improve-cr-0-cr-1-group-scanning-v2.patch     File pick_mb_last_group.sh    
Issue Links:
Duplicate
Related
is related to LU-8365 Fix mballoc stream allocator to bette... Open
is related to LU-15319 Weird mballoc behaviour Resolved
is related to LU-17153 Random block allocation policy in ldi... Open
is related to LU-16162 ldiskfs: use low disk tracks for bloc... Open
is related to LU-16750 optimize ldiskfs internal metadata al... Open
is related to LU-12970 improve mballoc for huge filesystems Open
is related to LU-16155 allow importing inode/block allocatio... Open
is related to LU-14305 add persistent tuning for mb_c3_thres... Resolved
Rank (Obsolete): 9223372036854775807

 Description   

There is an upstream patch series that is adding improved mballoc handling for efficiently finding suitable allocation groups in a filesystem. In particular, patch
https://patchwork.ozlabs.org/project/linux-ext4/patch/20210209202857.4185846-5-harshadshirwadkar@gmail.com/ "ext4: improve cr 0 / cr 1 group scanning" is the important part of the series.



 Comments   
Comment by Andreas Dilger [ 16/Feb/21 ]

I've attached v2 of the ext4-improve-cr-0-cr-1-group-scanning-v2.patch from the list (against current Linux master, not ported to any RHEL kernel yet). While there is still work being done to improve this patch, I think it would be useful to see how much this improves performance for a large fragmented filesystem and/or hurts performance for a large empty filesystem. Having some performance feedback earlier would allow improving the patch before it is included in the upstream kernel, and if it shows good promise I think it is a better long-term solution than the current ext4-simple-blockalloc.patch that we are carrying, since that patch just reduces the number of times useless groups are scanned but doesn't avoid sequential scanning completely like this new patch does.

Comment by Artem Blagodarenko (Inactive) [ 17/Feb/21 ]

I believe for testing purpose and later, after successful ext4-improve-cr-0-cr-1-group-scanning-v2.patch testing, ext4-simple-blockalloc.patch should be dropped completely because it makes porting difficult. 

Comment by Gerrit Updater [ 08/Apr/21 ]

Alex Zhuravlev (bzzz@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/43232
Subject: LU-14438 ldiskfs: improvements to mballoc
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: b7e2d9466f2a45d3c9a687cf06155d4e75b020c9

Comment by Andreas Dilger [ 28/Sep/21 ]

The new mballoc patch from the upstream kernel keeps an array with 2^n order of the largest range of free blocks in the group (between 2^0=1 free block and 2^16 = 32768 free blocks), and puts each group into the appropriate list after each alloc/free. It uses round-robin selection for groups in the per-order list, so it is still possible to get into situations similar to mb_last_group being very large, where the allocations are done at the end of the filesystem (lower bandwidth) even though there are many groups with free space available at the start of the filesystem (higher bandwidth).

It would make sense to enhance the new allocator to have two per-order lists for tracking the free blocks (on HDD OSTs at least, based the "rotational" parameter of the block device) - one list for the groups in the first ~70% of the filesystem that have good performance and a second list for groups in the last ~30% of the filesystem that have lower performance. Groups in the second list would only be used if there are no free groups of the right order in the first list. That would bias allocations to the start of the device so that it avoids needless slowdowns when the filesystem is not full. Since the amount of memory used for the per-order array itself is small (array of 17 pointers), and it is easy to decide which array a given group is put into based on the group number this would not increase allocation overhead. That would probably be much more efficient than trying to keep the groups within each order totally sorted.

Comment by Nathan Dauchy (Inactive) [ 06/Oct/21 ]

Andreas,

Regarding the comment in the code patch: "the groups may not get traversed linearly. That may result in subsequent allocations being not close to each other. And so, the underlying device may get filled up in a non-linear fashion."... rather than using a fixed MB_DEFAULT_LINEAR_LIMIT what do you think about using something like the following algorithm? (I have been playing with external setting of mb_last_group to an "optimal" value, but incorporating the idea into ext4 would be much cleaner.)

The general idea is to work backwards through mb_groups info and use a "decay" algorithm to determine an adjusted value for free block count for each group. Then set mb_last_group based on the largest adjusted block count value. Attached is a simple script pick_mb_last_group.sh to demonstrate the approach. In my limited testing, it does seem to pick a group number that not only has a large "bfree" value but is also followed by other groups that generally have large-ish bfree values as well.
Obviously more cleanup on the script would be needed to make it "production ready", and some theory and testing applied to set a good $decay value, and there is also a question of how often to run the tool and change the value... but hopefully the script at least clarifies the approach. Further enhancements could include a check of "/sys/block/${dev}/queue/rotational"; if a device is spinning rust, then adjust the weighted score further with a penalty for higher group numbers.

Thanks,
Nathan

Comment by Andreas Dilger [ 14/Sep/22 ]

There are additional patches to fix the mballoc mb_optimized_scan=1 use case:
https://patchwork.ozlabs.org/project/linux-ext4/list/?series=317391

These fix a number of sub-optimal allocation decisions in the earlier patches.

Comment by Andreas Dilger [ 14/Sep/22 ]

I've filed LU-16155 to enhance debugfs to allow "importing" the block and inode allocation maps into a newly-formatted filesystem to simplify testing of this problem. We could collect the debugfs information from real filesystems that are having allocation performance issues as needed in order to test changes to mballoc.

Comment by Gerrit Updater [ 27/Jun/23 ]

I've tried to port some of the upstream mballoc patches in this, but it looks too big for a single patch.

"Zhenyu Xu <bobijam@hotmail.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51472
Subject: LU-14438 ldiskfs: backport ldiskfs mballoc patches
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 2439579001a928714a640ec469a2d833ea5e8337

Comment by Andreas Dilger [ 29/Jul/23 ]

There are cases where we may want to make empty filesystem performance worse, but the 90% performance better. We could use the new mballoc array lists to spread out allocations across the disk more evenly.

I had previously considered that we might split groups into two arrays (as we are doing with IOPS groups in LU-16750) 80% at the start of the disk and 20% at the end of the disk (or 90/10%) so groups at end of the filesystem are only used when the first groups are mostly full. However, this would mean that performance would suddenly drop once the filesystem hit 80% full.

We could instead do things like split the groups into eg. 16 separate arrays by offset, and then have a clock that rotates allocations around the regions eg. every second, so that groups are not used start-to-end during allocation. We would still want some locality in allocations, so we are not seeking wildly around the disk for files being written concurrently, but are always using the end of the disk some fraction of the time. This would hopefully even out the performance over the filesystem lifetime for uses that demand more consistent performance instead of "best possible".

We could even hint via "lfs ladvise" and/or "ionice" for a file or process to force all file allocations to the slow part of the disk for cases of archiving old files. I don't think it makes sense to allow "improving" allocations because everyone would want that and it would be no different than today.

Generated at Sat Feb 10 03:09:45 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.