Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12970

improve mballoc for huge filesystems


    • Type: Improvement
    • Status: Open
    • Priority: Minor
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Labels:
    • Rank (Obsolete):


      there are number of reports demonstrating a poor behaviour of mballoc on huge filesystems. in one report it was 688TB filesystem with 5.3M groups.
      mballoc tries to allocate large chunks of space, for small allocations it tries to preallocate and share large chunks. while this is good in terms of fragmentation and streaming IO allocation itself may need to scan many groups to find a good candidate.
      mballoc maintains internal in-memory structures (buddy cache) to speed up searching, but that cache is built from regular on-disk bitmaps, meaning IO. and if cache is cold, populating it may take a lot of time.

      there are few ideas how to improve that:

      • skip more groups using less information when possible
      • stop scanning if too many groups have been scanned (loaded) and use best found
      • prefetch bitmaps (use lazy init thread? prefetch at scanning)

      another option for prefetching would be to skip non-initialized groups, but start an async read for the corresponding bitmap.
      also, when mballoc marks the blocks used (allocation has been just made) it could make sense to check/prefetch the subsequent group(s) which is likely a goal for subsequent allocation - while the caller are writting IO to just allocated blocks, the next group(s) will be prefetchted and ready to use.


          Issue Links



              • Assignee:
                wc-triage WC Triage
                bzzz Alex Zhuravlev
              • Votes:
                0 Vote for this issue
                6 Start watching this issue


                • Created: