[LU-12970] improve mballoc for huge filesystems - Whamcloud Community JIRA

Details

Type: Improvement
Resolution: Unresolved
Priority: Minor
Fix Version/s: None
Affects Version/s: None
Labels:

Rank (Obsolete):
9223372036854775807

Description

there are number of reports demonstrating a poor behaviour of mballoc on huge filesystems. in one report it was 688TB filesystem with 5.3M groups.
mballoc tries to allocate large chunks of space, for small allocations it tries to preallocate and share large chunks. while this is good in terms of fragmentation and streaming IO allocation itself may need to scan many groups to find a good candidate.
mballoc maintains internal in-memory structures (buddy cache) to speed up searching, but that cache is built from regular on-disk bitmaps, meaning IO. and if cache is cold, populating it may take a lot of time.

there are few ideas how to improve that:

skip more groups using less information when possible
stop scanning if too many groups have been scanned (loaded) and use best found
prefetch bitmaps (use lazy init thread? prefetch at scanning)

another option for prefetching would be to skip non-initialized groups, but start an async read for the corresponding bitmap.
also, when mballoc marks the blocks used (allocation has been just made) it could make sense to check/prefetch the subsequent group(s) which is likely a goal for subsequent allocation - while the caller are writting IO to just allocated blocks, the next group(s) will be prefetchted and ready to use.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

ext4-loadbitmaps.patch
7 kB
15/Nov/19 11:06 AM

Issue Links

is related to

LU-8365 Fix mballoc stream allocator to better use free space at start of drive

Open

LU-15319 Weird mballoc behaviour

Resolved

LU-12976 Bigalloc sub cluster allocation for ldiskfs

Open

LU-16155 allow importing inode/block allocation maps to new ldisks filesystem

Open

LU-12988 improve mount time on huge ldiskfs filesystem

Resolved

LU-16691 optimize ldiskfs prealloc (PA) under random read workloads

Resolved

is related to

LU-12103 Improve block allocation for large partitions

Resolved

LU-14438 backport ldiskfs mballoc patches

Resolved

LU-10946 add an interface to load ldiskfs block bitmaps

Closed

(1 is related to, 3 is related to )

Activity

[LU-12970] improve mballoc for huge filesystems

Alex Zhuravlev added a comment - 22/Nov/19 2:29 PM

partly for curiosity attached old SATA 7200 500GB drive to my testing box..:

[root@rz /]# time cat /proc/fs/ext4/sda/mb_groups >/dev/null

real	0m24.081s
user	0m0.000s
sys	0m0.274s

this is 3726 groups, initialized by mke2fs so all to read during that cat.

Alex Zhuravlev added a comment - 22/Nov/19 2:29 PM partly for curiosity attached old SATA 7200 500GB drive to my testing box..: [root@rz /]# time cat /proc/fs/ext4/sda/mb_groups >/dev/ null real 0m24.081s user 0m0.000s sys 0m0.274s this is 3726 groups, initialized by mke2fs so all to read during that cat.

Alex Zhuravlev added a comment - 21/Nov/19 6:52 AM

hmm, why you think this is not likely? few growing files would fill the filesystem group by group.
"just find" - this is exactly the issue. the allocator is supposed to be generic enough to work with small and big files, right?
thus we want to keep some locality, if file A has last extent in the group N, then we should try to write next extent in the same N or nearby, not just any empty group?
and then searching for the group is what is happening in DDN -923, but the groups weren't considered "best" and that got worse due to cold cache.
so that approach I'm trying is to limit coverage of searching.
I think that coverage can be expressed in number of groups to search in and/or number of uninitialized groups causing IO.
on the first try we can search for exactly requested chunk in N groups, if failed relax requirement and search for best in N*m groups, then just anything..

Alex Zhuravlev added a comment - 21/Nov/19 6:52 AM hmm, why you think this is not likely? few growing files would fill the filesystem group by group. "just find" - this is exactly the issue. the allocator is supposed to be generic enough to work with small and big files, right? thus we want to keep some locality, if file A has last extent in the group N, then we should try to write next extent in the same N or nearby, not just any empty group? and then searching for the group is what is happening in DDN -923, but the groups weren't considered "best" and that got worse due to cold cache. so that approach I'm trying is to limit coverage of searching. I think that coverage can be expressed in number of groups to search in and/or number of uninitialized groups causing IO. on the first try we can search for exactly requested chunk in N groups, if failed relax requirement and search for best in N*m groups, then just anything..

Andreas Dilger added a comment - 21/Nov/19 6:17 AM

While it is possible to have the 1/2 full and 1/2 empty groups case you propose, I don't think that this is a likely condition. Even so, in this case, wouldn't the allocator just find the first empty group and allocate linearly from there?

Andreas Dilger added a comment - 21/Nov/19 6:17 AM While it is possible to have the 1/2 full and 1/2 empty groups case you propose, I don't think that this is a likely condition. Even so, in this case, wouldn't the allocator just find the first empty group and allocate linearly from there?

Alex Zhuravlev added a comment - 20/Nov/19 7:10 PM

I think it's a bit different approach. overall fullness doesn't mean we can't find good chunks, IMO.
say, few files have been written very dense so that 1/2 of groups are full, but another 1/2 is nearly empty.
why should we change the algorithm?

Alex Zhuravlev added a comment - 20/Nov/19 7:10 PM I think it's a bit different approach. overall fullness doesn't mean we can't find good chunks, IMO. say, few files have been written very dense so that 1/2 of groups are full, but another 1/2 is nearly empty. why should we change the algorithm?

Andreas Dilger added a comment - 20/Nov/19 6:52 PM

Alex, is this complementary with the ~~LU-12103~~ patch that is already landed?

Andreas Dilger added a comment - 20/Nov/19 6:52 PM Alex, is this complementary with the LU-12103 patch that is already landed?

Alex Zhuravlev added a comment - 20/Nov/19 3:03 PM

https://review.whamcloud.com/#/c/36793/ - this patch limits scanning for a good goup and adds basic prefetching.
currently it's more like an RFC, though I tested it manually

Alex Zhuravlev added a comment - 20/Nov/19 3:03 PM https://review.whamcloud.com/#/c/36793/ - this patch limits scanning for a good goup and adds basic prefetching. currently it's more like an RFC, though I tested it manually

Alex Zhuravlev added a comment - 18/Nov/19 8:07 AM

sure, will try to make the script useful for the outter world.

Alex Zhuravlev added a comment - 18/Nov/19 8:07 AM sure, will try to make the script useful for the outter world.

Andreas Dilger added a comment - 15/Nov/19 10:04 PM

Shilong, could you please post your patch to WC Gerrit so that it can be reviewed. Once the block bitmap is loaded, it makes sense to call mb_regenerate_buddy() to create the buddy bitmap and ext4_group_info as part of the ext4_end_bitmap_read() callback rather than waiting in ext4_wait_block_bitmap() for the bitmaps. That allows submitting IO in batches and letting it complete asyncrhonously (keep an atomic counter of how many blocks need to be processed and submit more IO when it gets large enough), rather than doing read then wait for all blocks to finish, read/wait, ...

I've got a script to prepare a fragmented filesystem using debugfs's setb and freeb commands which basically takes few seconds.

Alex, it would be very useful to submit this upstream to e2fsprogs, since testing fragmented filesystems is always a problem.
It also makes sense for you to see if Shilong's current patch helps your test case, and then we can work on optimizing it further.

Andreas Dilger added a comment - 15/Nov/19 10:04 PM Shilong, could you please post your patch to WC Gerrit so that it can be reviewed. Once the block bitmap is loaded, it makes sense to call mb_regenerate_buddy() to create the buddy bitmap and ext4_group_info as part of the ext4_end_bitmap_read() callback rather than waiting in ext4_wait_block_bitmap() for the bitmaps. That allows submitting IO in batches and letting it complete asyncrhonously (keep an atomic counter of how many blocks need to be processed and submit more IO when it gets large enough), rather than doing read then wait for all blocks to finish, read/wait, ... I've got a script to prepare a fragmented filesystem using debugfs's setb and freeb commands which basically takes few seconds. Alex, it would be very useful to submit this upstream to e2fsprogs, since testing fragmented filesystems is always a problem. It also makes sense for you to see if Shilong's current patch helps your test case, and then we can work on optimizing it further.

Andreas Dilger added a comment - 15/Nov/19 9:46 PM

Alex, I think that setting "RAID stripe size" (sbi->s_stripe) in the superblock may also be a contribute to the problem. For large RAID systems this is typically 512 blocks (2MB), up to 2048 blocks (8MB) or more in order to get allocations sized an aligned with the underlying RAID geometry. That in itself is good for large writes, but for small writes at mount time it can be problematic.

Andreas Dilger added a comment - 15/Nov/19 9:46 PM Alex, I think that setting "RAID stripe size" ( sbi->s_stripe ) in the superblock may also be a contribute to the problem. For large RAID systems this is typically 512 blocks (2MB), up to 2048 blocks (8MB) or more in order to get allocations sized an aligned with the underlying RAID geometry. That in itself is good for large writes, but for small writes at mount time it can be problematic.

Alex Zhuravlev added a comment - 15/Nov/19 6:01 PM

I've got a script to prepare a fragemnted filesystem using debugfs's setb and freeb commands which basically takes few seconds.
so now I can reproduce this issue easily - I see one by one bitmap load of few hundred non-empty groups initiated by a single-block allocation.
the next step is to add some instrumentation..

Alex Zhuravlev added a comment - 15/Nov/19 6:01 PM I've got a script to prepare a fragemnted filesystem using debugfs's setb and freeb commands which basically takes few seconds. so now I can reproduce this issue easily - I see one by one bitmap load of few hundred non-empty groups initiated by a single-block allocation. the next step is to add some instrumentation..

Wang Shilong (Inactive) added a comment - 15/Nov/19 11:06 AM

ext4-loadbitmaps.patch

Wang Shilong (Inactive) added a comment - 15/Nov/19 11:06 AM ext4-loadbitmaps.patch

People

Assignee:: WC Triage

Reporter:: Alex Zhuravlev

Votes:: 0 Vote for this issue

Watchers:: 10 Start watching this issue

Dates

Created:: 14/Nov/19 4:56 PM

Updated:: 18/Mar/25 6:46 AM