Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12970

improve mballoc for huge filesystems

    XMLWordPrintable

Details

    • Improvement
    • Resolution: Unresolved
    • Minor
    • None
    • None
    • 9223372036854775807

    Description

      there are number of reports demonstrating a poor behaviour of mballoc on huge filesystems. in one report it was 688TB filesystem with 5.3M groups.
      mballoc tries to allocate large chunks of space, for small allocations it tries to preallocate and share large chunks. while this is good in terms of fragmentation and streaming IO allocation itself may need to scan many groups to find a good candidate.
      mballoc maintains internal in-memory structures (buddy cache) to speed up searching, but that cache is built from regular on-disk bitmaps, meaning IO. and if cache is cold, populating it may take a lot of time.

      there are few ideas how to improve that:

      • skip more groups using less information when possible
      • stop scanning if too many groups have been scanned (loaded) and use best found
      • prefetch bitmaps (use lazy init thread? prefetch at scanning)

      another option for prefetching would be to skip non-initialized groups, but start an async read for the corresponding bitmap.
      also, when mballoc marks the blocks used (allocation has been just made) it could make sense to check/prefetch the subsequent group(s) which is likely a goal for subsequent allocation - while the caller are writting IO to just allocated blocks, the next group(s) will be prefetchted and ready to use.

      Attachments

        Issue Links

          Activity

            People

              wc-triage WC Triage
              bzzz Alex Zhuravlev
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated: