Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12970

improve mballoc for huge filesystems

Details

    • 9223372036854775807

    Description

      there are number of reports demonstrating a poor behaviour of mballoc on huge filesystems. in one report it was 688TB filesystem with 5.3M groups.
      mballoc tries to allocate large chunks of space, for small allocations it tries to preallocate and share large chunks. while this is good in terms of fragmentation and streaming IO allocation itself may need to scan many groups to find a good candidate.
      mballoc maintains internal in-memory structures (buddy cache) to speed up searching, but that cache is built from regular on-disk bitmaps, meaning IO. and if cache is cold, populating it may take a lot of time.

      there are few ideas how to improve that:

      • skip more groups using less information when possible
      • stop scanning if too many groups have been scanned (loaded) and use best found
      • prefetch bitmaps (use lazy init thread? prefetch at scanning)

      another option for prefetching would be to skip non-initialized groups, but start an async read for the corresponding bitmap.
      also, when mballoc marks the blocks used (allocation has been just made) it could make sense to check/prefetch the subsequent group(s) which is likely a goal for subsequent allocation - while the caller are writting IO to just allocated blocks, the next group(s) will be prefetchted and ready to use.

      Attachments

        Issue Links

          Activity

            [LU-12970] improve mballoc for huge filesystems
            adilger Andreas Dilger added a comment - - edited

            Link to backport of upstream mballoc patches in LU-14438, which may be enough to resolve this issue.

            adilger Andreas Dilger added a comment - - edited Link to backport of upstream mballoc patches in LU-14438 , which may be enough to resolve this issue.

            I have thought in the past about something similar to what you describe. However, it is difficult to know in advance what the size requirements are.

            One though was whether it makes sense to have a higher-level buddy bitmap for groups that is generated at the default preallocation unit size (based on the s_mb_large_req size) that allows quickly finding groups that have available 8MB or 16MB chunks, up to the maximum possible allocation size (probably 64MB is enough). At 8-64MB chunks this would mean 15MB of bitmap for a 512TiB filesystem (could use kvmalloc()). This would be essentially a filesystem-wide replacement for the bb_counters array that is tracked on a per-group basis, so would likely reduce overall memory usage, and would essentially replace "group scanning" with "bitmap scanning". It could be optimized to save the first set bit to avoid repeatedly scanning the blocks of beginning of the filesystem, assuming they would be preferentially allocated.

            This could also be implemented as an array of linked lists (at power-of-two granularity up to 64MB), with groups being put in the list with their largest aligned free chunk (separate lists for unaligned chunks?). Allocations would first walk the list for the smallest chunk that they need, then move up to lists with progressively larger chunks if no groups are available at the smaller size. Once the allocation is done, the group may be demote to to a lower list if the allocation results in a smaller chunk being available. To add a list_head to each of 4M groups in a 512TiB filesystem would consume 64MB of memory, but it would be split across all of the ext4_group_info allocations.

            Note that using a bigalloc size of e.g. 32KB would reduce the number of groups by a factor of 8 (e.g. 4M -> 512K) so we should also consider fixing the issues with bigalloc so that it is usable.

            adilger Andreas Dilger added a comment - I have thought in the past about something similar to what you describe. However, it is difficult to know in advance what the size requirements are. One though was whether it makes sense to have a higher-level buddy bitmap for groups that is generated at the default preallocation unit size (based on the s_mb_large_req size) that allows quickly finding groups that have available 8MB or 16MB chunks, up to the maximum possible allocation size (probably 64MB is enough). At 8-64MB chunks this would mean 15MB of bitmap for a 512TiB filesystem (could use kvmalloc() ). This would be essentially a filesystem-wide replacement for the bb_counters array that is tracked on a per-group basis, so would likely reduce overall memory usage, and would essentially replace "group scanning" with "bitmap scanning". It could be optimized to save the first set bit to avoid repeatedly scanning the blocks of beginning of the filesystem, assuming they would be preferentially allocated. This could also be implemented as an array of linked lists (at power-of-two granularity up to 64MB), with groups being put in the list with their largest aligned free chunk (separate lists for unaligned chunks?). Allocations would first walk the list for the smallest chunk that they need, then move up to lists with progressively larger chunks if no groups are available at the smaller size. Once the allocation is done, the group may be demote to to a lower list if the allocation results in a smaller chunk being available. To add a list_head to each of 4M groups in a 512TiB filesystem would consume 64MB of memory, but it would be split across all of the ext4_group_info allocations. Note that using a bigalloc size of e.g. 32KB would reduce the number of groups by a factor of 8 (e.g. 4M -> 512K) so we should also consider fixing the issues with bigalloc so that it is usable.

            given in all the case we do forward scan, I think it would be relatively simple to add few lists of groups to be scanned at each criterion.
            each time a group gets/loses free blocks we would move it from one to another list at cost of few CPU cycles.
            I'm not sure what are crossing points for each list yet (in terms of free blocks/fragments/etc), but the inital implementation
            could start with empty/non-empty lists.

            bzzz Alex Zhuravlev added a comment - given in all the case we do forward scan, I think it would be relatively simple to add few lists of groups to be scanned at each criterion. each time a group gets/loses free blocks we would move it from one to another list at cost of few CPU cycles. I'm not sure what are crossing points for each list yet (in terms of free blocks/fragments/etc), but the inital implementation could start with empty/non-empty lists.

            with 32-groups-at-once prefetching, the same cat:

            real	0m14.150s
            user	0m0.000s
            sys	0m0.309s
            

            with 64-groups-at-once prefetching:

            real	0m13.200s
            user	0m0.000s
            sys	0m0.277s
            

            but this is a single spindle, for any regular site that would be multiple spindles I guess and a larger prefetch window would help more.

            bzzz Alex Zhuravlev added a comment - with 32-groups-at-once prefetching, the same cat: real 0m14.150s user 0m0.000s sys 0m0.309s with 64-groups-at-once prefetching: real 0m13.200s user 0m0.000s sys 0m0.277s but this is a single spindle, for any regular site that would be multiple spindles I guess and a larger prefetch window would help more.

            partly for curiosity attached old SATA 7200 500GB drive to my testing box..:

            [root@rz /]# time cat /proc/fs/ext4/sda/mb_groups >/dev/null
            
            real	0m24.081s
            user	0m0.000s
            sys	0m0.274s
            

            this is 3726 groups, initialized by mke2fs so all to read during that cat.

            bzzz Alex Zhuravlev added a comment - partly for curiosity attached old SATA 7200 500GB drive to my testing box..: [root@rz /]# time cat /proc/fs/ext4/sda/mb_groups >/dev/ null real 0m24.081s user 0m0.000s sys 0m0.274s this is 3726 groups, initialized by mke2fs so all to read during that cat.

            hmm, why you think this is not likely? few growing files would fill the filesystem group by group.
            "just find" - this is exactly the issue. the allocator is supposed to be generic enough to work with small and big files, right?
            thus we want to keep some locality, if file A has last extent in the group N, then we should try to write next extent in the same N or nearby, not just any empty group?
            and then searching for the group is what is happening in DDN -923, but the groups weren't considered "best" and that got worse due to cold cache.
            so that approach I'm trying is to limit coverage of searching.
            I think that coverage can be expressed in number of groups to search in and/or number of uninitialized groups causing IO.
            on the first try we can search for exactly requested chunk in N groups, if failed relax requirement and search for best in N*m groups, then just anything..

            bzzz Alex Zhuravlev added a comment - hmm, why you think this is not likely? few growing files would fill the filesystem group by group. "just find" - this is exactly the issue. the allocator is supposed to be generic enough to work with small and big files, right? thus we want to keep some locality, if file A has last extent in the group N, then we should try to write next extent in the same N or nearby, not just any empty group? and then searching for the group is what is happening in DDN -923, but the groups weren't considered "best" and that got worse due to cold cache. so that approach I'm trying is to limit coverage of searching. I think that coverage can be expressed in number of groups to search in and/or number of uninitialized groups causing IO. on the first try we can search for exactly requested chunk in N groups, if failed relax requirement and search for best in N*m groups, then just anything..

            While it is possible to have the 1/2 full and 1/2 empty groups case you propose, I don't think that this is a likely condition. Even so, in this case, wouldn't the allocator just find the first empty group and allocate linearly from there?

            adilger Andreas Dilger added a comment - While it is possible to have the 1/2 full and 1/2 empty groups case you propose, I don't think that this is a likely condition. Even so, in this case, wouldn't the allocator just find the first empty group and allocate linearly from there?

            I think it's a bit different approach. overall fullness doesn't mean we can't find good chunks, IMO.
            say, few files have been written very dense so that 1/2 of groups are full, but another 1/2 is nearly empty.
            why should we change the algorithm?

            bzzz Alex Zhuravlev added a comment - I think it's a bit different approach. overall fullness doesn't mean we can't find good chunks, IMO. say, few files have been written very dense so that 1/2 of groups are full, but another 1/2 is nearly empty. why should we change the algorithm?

            Alex, is this complementary with the LU-12103 patch that is already landed?

            adilger Andreas Dilger added a comment - Alex, is this complementary with the LU-12103 patch that is already landed?

            https://review.whamcloud.com/#/c/36793/ - this patch limits scanning for a good goup and adds basic prefetching.
            currently it's more like an RFC, though I tested it manually

            bzzz Alex Zhuravlev added a comment - https://review.whamcloud.com/#/c/36793/ - this patch limits scanning for a good goup and adds basic prefetching. currently it's more like an RFC, though I tested it manually

            People

              wc-triage WC Triage
              bzzz Alex Zhuravlev
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated: