[LU-12970] improve mballoc for huge filesystems Created: 14/Nov/19 Updated: 07/Jun/23 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Improvement | Priority: | Minor |
| Reporter: | Alex Zhuravlev | Assignee: | WC Triage |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | ldiskfs | ||
| Attachments: |
|
||||||||||||||||||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||||||||||||||||||||||
| Description |
|
there are number of reports demonstrating a poor behaviour of mballoc on huge filesystems. in one report it was 688TB filesystem with 5.3M groups. there are few ideas how to improve that:
another option for prefetching would be to skip non-initialized groups, but start an async read for the corresponding bitmap. |
| Comments |
| Comment by Andreas Dilger [ 15/Nov/19 ] |
|
I think that prefetching the block bitmaps in large chunks should be relatively easily implemented using the lazy_init thread. There is already patch https://review.whamcloud.com/32347 " Instead, the block bitmap prefetch should be done a whole flex_bg at a time (256 blocks), asynchronously during mount and the buddy and group info calculated in the end_io completion handler. It would make sense to keep the same sysfs interface to allow pinning the bitmaps as 32347 to maintain compatibility. |
| Comment by Andreas Dilger [ 15/Nov/19 ] |
|
Reducing size expectations for allocations during mount, and/or limiting scanning should also help. I think for small writes, we should avoid trying to do group preallocation until after the bitmaps have been loaded. That can be handled entirely inside the ldiskfs code and avoids the need to understand what is happening at the Lustre level. The bitmap scanning code can also advance the allocation hints itself until it finds some groups that have suitable free space, instead of waiting for an incoming write to do this. |
| Comment by Wang Shilong (Inactive) [ 15/Nov/19 ] |
|
I cooked a new patch before to load block bitmaps async using workqueue: And there was interface to control how much blocks could be prefetched each time, but havne't got some benchmark numbers for it yet. |
| Comment by Wang Shilong (Inactive) [ 15/Nov/19 ] |
| Comment by Alex Zhuravlev [ 15/Nov/19 ] |
|
I've got a script to prepare a fragemnted filesystem using debugfs's setb and freeb commands which basically takes few seconds. |
| Comment by Andreas Dilger [ 15/Nov/19 ] |
|
Alex, I think that setting "RAID stripe size" (sbi->s_stripe) in the superblock may also be a contribute to the problem. For large RAID systems this is typically 512 blocks (2MB), up to 2048 blocks (8MB) or more in order to get allocations sized an aligned with the underlying RAID geometry. That in itself is good for large writes, but for small writes at mount time it can be problematic. |
| Comment by Andreas Dilger [ 15/Nov/19 ] |
|
Shilong, could you please post your patch to WC Gerrit so that it can be reviewed. Once the block bitmap is loaded, it makes sense to call mb_regenerate_buddy() to create the buddy bitmap and ext4_group_info as part of the ext4_end_bitmap_read() callback rather than waiting in ext4_wait_block_bitmap() for the bitmaps. That allows submitting IO in batches and letting it complete asyncrhonously (keep an atomic counter of how many blocks need to be processed and submit more IO when it gets large enough), rather than doing read then wait for all blocks to finish, read/wait, ...
Alex, it would be very useful to submit this upstream to e2fsprogs, since testing fragmented filesystems is always a problem. |
| Comment by Alex Zhuravlev [ 18/Nov/19 ] |
|
sure, will try to make the script useful for the outter world. |
| Comment by Alex Zhuravlev [ 20/Nov/19 ] |
|
https://review.whamcloud.com/#/c/36793/ - this patch limits scanning for a good goup and adds basic prefetching. |
| Comment by Andreas Dilger [ 20/Nov/19 ] |
|
Alex, is this complementary with the |
| Comment by Alex Zhuravlev [ 20/Nov/19 ] |
|
I think it's a bit different approach. overall fullness doesn't mean we can't find good chunks, IMO. |
| Comment by Andreas Dilger [ 21/Nov/19 ] |
|
While it is possible to have the 1/2 full and 1/2 empty groups case you propose, I don't think that this is a likely condition. Even so, in this case, wouldn't the allocator just find the first empty group and allocate linearly from there? |
| Comment by Alex Zhuravlev [ 21/Nov/19 ] |
|
hmm, why you think this is not likely? few growing files would fill the filesystem group by group. |
| Comment by Alex Zhuravlev [ 22/Nov/19 ] |
|
partly for curiosity attached old SATA 7200 500GB drive to my testing box..:
[root@rz /]# time cat /proc/fs/ext4/sda/mb_groups >/dev/null
real 0m24.081s
user 0m0.000s
sys 0m0.274s
this is 3726 groups, initialized by mke2fs so all to read during that cat. |
| Comment by Alex Zhuravlev [ 22/Nov/19 ] |
|
with 32-groups-at-once prefetching, the same cat: real 0m14.150s user 0m0.000s sys 0m0.309s with 64-groups-at-once prefetching: real 0m13.200s user 0m0.000s sys 0m0.277s but this is a single spindle, for any regular site that would be multiple spindles I guess and a larger prefetch window would help more. |
| Comment by Alex Zhuravlev [ 06/Dec/19 ] |
|
given in all the case we do forward scan, I think it would be relatively simple to add few lists of groups to be scanned at each criterion. |
| Comment by Andreas Dilger [ 06/Dec/19 ] |
|
I have thought in the past about something similar to what you describe. However, it is difficult to know in advance what the size requirements are. One though was whether it makes sense to have a higher-level buddy bitmap for groups that is generated at the default preallocation unit size (based on the s_mb_large_req size) that allows quickly finding groups that have available 8MB or 16MB chunks, up to the maximum possible allocation size (probably 64MB is enough). At 8-64MB chunks this would mean 15MB of bitmap for a 512TiB filesystem (could use kvmalloc()). This would be essentially a filesystem-wide replacement for the bb_counters array that is tracked on a per-group basis, so would likely reduce overall memory usage, and would essentially replace "group scanning" with "bitmap scanning". It could be optimized to save the first set bit to avoid repeatedly scanning the blocks of beginning of the filesystem, assuming they would be preferentially allocated. This could also be implemented as an array of linked lists (at power-of-two granularity up to 64MB), with groups being put in the list with their largest aligned free chunk (separate lists for unaligned chunks?). Allocations would first walk the list for the smallest chunk that they need, then move up to lists with progressively larger chunks if no groups are available at the smaller size. Once the allocation is done, the group may be demote to to a lower list if the allocation results in a smaller chunk being available. To add a list_head to each of 4M groups in a 512TiB filesystem would consume 64MB of memory, but it would be split across all of the ext4_group_info allocations. Note that using a bigalloc size of e.g. 32KB would reduce the number of groups by a factor of 8 (e.g. 4M -> 512K) so we should also consider fixing the issues with bigalloc so that it is usable. |
| Comment by Andreas Dilger [ 11/May/21 ] |
|
Link to backport of upstream mballoc patches in LU-14438, which may be enough to resolve this issue. |