[LU-8365] Fix mballoc stream allocator to better use free space at start of drive Created: 04/Jul/16  Updated: 10/Aug/21

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Lokesh Nagappa Jaliminche (Inactive) Assignee: Yang Sheng
Resolution: Unresolved Votes: 0
Labels: ldiskfs

Attachments: Text File 0001-ext4-Fix-bugs-in-mballoc-s-stream-allocation-mode.patch    
Issue Links:
Duplicate
is duplicated by LU-2377 Provide a mechanism to reset the ldis... Resolved
Related
is related to LU-12970 improve mballoc for huge filesystems Open
is related to LU-14438 backport ldiskfs mballoc patches Open
is related to LU-12103 Improve block allocation for large pa... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Provide a mechanism to reset the ldiskfs extents allocation position to near the beginning of a drive



 Comments   
Comment by Gerrit Updater [ 04/Jul/16 ]

lokesh.jaliminche (lokesh.jaliminche@seagate.com) uploaded a new patch: http://review.whamcloud.com/21142
Subject: LU-8365 ldiskfs: procfs entries for mballoc
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 7218c37a694df7b0f057a6078dad24a9166300e7

Comment by Andreas Dilger [ 17/Sep/16 ]

This patch exposes that mballoc is not doing as good a job in group selection for empty HDDs as it might. Biasing allocations to the start of the disk can improve performance, but only if the start of the disk has free space.

Some possibilities to try that may actually fix mballoc, in order of increasing difficulty:
1) when freeing blocks (in the group descriptor update) if the group changes to less than X allocated blocks, and the group is below mb_last_group then reset mb_last_group to that group. In cases where a filesystem is being filled and emptied (e.g. benchmarks) this would automatically produce optimal results, without the need for this tunable at all. It would also work for all users. The threshold X should be larger than the count of bitmaps and inode table in a normal group.
2) As an added heuristic to the above, only go back if there are M consecutive groups that meet this criteria, so that we don't keep seeking back to one group that has only 2k allocated blocks and then scanning to the end of the used groups again. The number of groups (M) could be a tunable.
3) As an added heuristic to the above, save the old mb_last_group and return there if the current and next few groups are "full" (if (mb_last_group > current group)), otherwise scan forward as usual. That avoids scanning a lot of potentially useless groups that have recently been scanned in order to get to where mb_last_group was previously.
4) as above, but using free chunk size instead of just free block count (e.g. use maximum free chunk size in the buddy bitmap). This could add in a few level(s) of free chunks, maybe above some threshold like 8MB with a simple shift+add hweight() loop while scanning the buddy bitmap.
5) instead of using mb_last_group at all, keep a "sorted" list or a tree of struct ext4_group_info that track free space in groups. The sorting comparison should prefer groups at the start of the disk in some way (eg. free_blocks - group_number/128). Populate this list/tree during mount-time group descriptor scan and keep it "sorted" as blocks are allocated and freed. Sorting can be lazy to avoid lots of rebalancing. Use the list/tree to find a good new target group if the current and next group are "full". This will also improve performance when the filesystem becomes nearly full, to avoid lots of scanning for groups with free blocks.
6) as above, but use separate list/trees based on fullness. No need to keep groups in sorted lists/trees if they are totally empty. No need to manage groups that are nearly full until enough blocks are freed to make them interesting allocation targets. It may be that this is _more_complex than a single list/tree, but it depends on how much the ongoing tree balancing costs. If the cost is high, and groups usually change from "mostly empty" to "mostly full" then having a "full" list to keep groups until they become "nearly empty" again would be useful.

Comment by Lokesh Nagappa Jaliminche (Inactive) [ 21/Sep/16 ]

Thanks for the details, working on it.

Comment by Andreas Dilger [ 15/Sep/18 ]

Hi Yang Sheng,
would you be able to work on a patch for this issue? I think the first 3-4 steps in the proposed solution shouldn't be too hard to implement. It looks like mb_set_largest_free_order() or in its callers might be the right place to see if s_mb_last_group should be updated? We already track bb_free and bb_largest_free_order for each group, so they can be used to decide if we should reset s_mb_last_group or not. It would be good to save the old s_mb_last_group to return if scanning in ldiskfs_mb_regular_allocator() doesn't find any good groups quickly.

Please feel free to ask if you have questions. I'd like to have something to look at late next week, if possible. We need to run some benchmarks on real hardware to ensure this is doing the right thing. It would be OK to include the patch https://review.whamcloud.com/21142 for testing/debugging, but I don't consider that a real fix for this issue.

In a semi-related area, I also noticed in the current ext4-prealloc.patch while looking at this issue that there is a bug in the code:

        /* don't use group allocation for large files */
        size = max(size, isize);
+       if ((ac->ac_o_ex.fe_len >= sbi->s_mb_small_req) ||
+           (size >= sbi->s_mb_large_req)) {
                ac->ac_flags |= EXT4_MB_STREAM_ALLOC;
                return;
        }
 
+       /*
+        * request is so large that we don't care about
+        * streaming - it overweights any possible seek
+        */
+       if (ac->ac_o_ex.fe_len >= sbi->s_mb_large_req)
+               return;

It looks like we can never get to the second condition because fe_len > s_mb_small_req will always be true first. This has been true all the way back to the original version of this patch (commit d8d8fd9192a5. It seems like the s_mb_large_req check should be moved before EXT4_MB_STREAM_ALLOC is set, so that it allows large allocations to behave differently?

Comment by Yang Sheng [ 18/Sep/18 ]

Hi, Alex,

Looks like the 'stream allocation' has been changed since upstream patch(4ba74d00a2025). Could you please review it whether correct for original purpose. Other question is why we need s_mb_small_req? As i understand, 'stream allocation' would be used while request size less than s_mb_large_req. Then what purpose is s_mb_small_req? Could you give a point for that please?

Thanks,
YangSheng

Comment by Gerrit Updater [ 18/Sep/18 ]

Yang Sheng (ys@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33195
Subject: LU-8365 ldiskfs: try to alloc block toward lower sectors
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 7e6352641724a26685247a21be488e43401437eb

Comment by Yang Sheng [ 27/Sep/18 ]

Hi, Alex,

Could you please give a advice for this patch? 0001-ext4-Fix-bugs-in-mballoc-s-stream-allocation-mode.patch

Thanks,
YangSheng

Comment by Andreas Dilger [ 28/Sep/18 ]

I'm not sure why you attached the patch here? That is what gerrit is for.

Comment by Yang Sheng [ 28/Sep/18 ]

Hi, Andreas,

This is the patch has already landed to upstream. I just want to get some input from Alex whether it is correct for stream allocation. Since it changes logic of stream allocation.

Thanks,
YangSheng

Comment by Gerrit Updater [ 01/Nov/18 ]

Yang Sheng (ys@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33548
Subject: LU-8365 ldiskfs: fix wrong logic of stream allocation
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 52116e1483cb24e3f5d0c2a3f20fa989dbd75e63

Comment by Alexander Zarochentsev [ 26/Feb/19 ]

are there any performance tests for this patch https://review.whamcloud.com/33195 ?

Comment by Andreas Dilger [ 26/Feb/19 ]

Ihara had started running some tests on the patch, but I don't recall ever seeing the results.

The main goal was to automate the original "manually reset to the start of the disk during benchmarking" behavior under normal usage. In particular, jump back to earlier groups when a bunch of free space becomes available, without having to continually scan the earlier groups for free space. The potential drawback is if this happens too frequently it could cause excessive seeking, but since it should only happen when there is a large amount of space any seek overhead should be smaller than the seek rate * IO size.

Comment by Gerrit Updater [ 03/Mar/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/21142/
Subject: LU-8365 ldiskfs: procfs entries for mballoc
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 75703118588f2b23afd8c8815e5ebb768fc7a8ff

Comment by Gerrit Updater [ 10/May/19 ]

Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34842
Subject: LU-8365 ldiskfs: procfs entries for mballoc
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 1afff0d0a40ddf3c413c1db5a21d8d46da61e1c2

Comment by Gerrit Updater [ 03/Jul/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34842/
Subject: LU-8365 ldiskfs: procfs entries for mballoc
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: ea7103b0b1c360b0e6d7fe62e275df366bf4e31d

Generated at Sat Feb 10 02:16:55 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.