Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-9

Optimize weighted QOS Round-Robin allocator

    XMLWordPrintable

Details

    • Improvement
    • Resolution: Unresolved
    • Major
    • None
    • None
    • 18,547
    • 10673

    Description

      New bug for old bugzilla b=18547.

      Quoting from the old ticket, since bugzilla.lustre.org is no longer available:

      We want the OSTs to be evenly filled for as much of their lives as possible, to be as evenly loaded (disk and network io) as possible, for optimum filesystem performance. If the weighting was subtle e.g. round-robin, but skipping more-full OSTs
      1/n of the times (where n is proportional to the imbalance) then it could essentially be in use all of the time. If we implemented QOS as a part of the round-robin allocator that is always active, just by skipping OSTs periodically (rather than randomly) it would provide much more uniform loading and would prevent the OSTs from getting far out of balance in the first place.

      Say the OSTs are 100MB in size (1% = 1MB) and OST10-19 have 9x as much free space as OST0-9 then the RR QOS selection would skip OST0-9 about 8/9 times, but never skip OST10-19. If each file created was 1MB in size then the first
      11 files would use 1MB on each of the 10 emptier OSTs, and 1MB on one of the fuller OSTs. Repeat 10x and 10MB is used uniformly on each of the emptier OSTs, but only 1MB on each of the fuller OSTs.

      This is what we tried to do with the random QOS weighting, but just as with the original LOV allocator (which randomly picked OSTs instead of round-robin) the chance of having an imbalance (e.g. average 2 objects/OST but some had 0
      and some had 4) was too high.

      One strong candidate is to use an "accumulated error" mechanism, similar to Bresenham's line drawing algorithm or error diffusion dithering, where each of the OSTs is present in the RR list (in the optimized OST/OSS order it normally is), and each time it is skipped from the RR selection (because the OST available space + accumulated error < threshold/average space) the OST free space is added to a "accumulated error" counter for that OST. When the OST (available space + accumulated error) is large enough the OST will be picked by normal RR selection and then its bonus reset to 0 and the threshold will be again be too low to select it for a number of iterations over the list.

      The MDS can estimate the average file size from the OST statfs data, so we might increment the bonus by the average file size? One thing we might want to do is initialize the bonus with some random amount (proportional to the free space?) at MDS startup so that some of the OSTs with less free space become available at different times instead of all at once.

      It seems we should also include some fraction (maybe 1/2, tunable via qos_prio_free) of the bonus value into the average so that less-used OSTs (with higher bonus) are more likely to be used instead of all being chosen equally. The problem with having equal weights (i.e. not taking the bonus into account for $average_free) is that e.g. the OST immediately following the new one in the list will be picked disproportionately often.

      In the above example, if the file size is 1 then we skip the full OSTs 9 times each increasing their bonus by 9, and filling the new OST by approximately 9, giving it a negative bonus (penalty) of 9. Penalty can be reset at next statfs for this OST, but bonus is not reset. Then $average_free = 10 * 9 + 91 = 18 and all of the OSTs are candidates for selection (modulo some random initial bonus so that more-full OSTs are not all selected at one time).

      For N-striped files, one idea I had was to subtract the just-used OST's space from the current file's "average" so that the remaining OSTs become proportionally more attractive. Continuing in the above example, but with a 2-stripe file, the current 2-stripe file's $average_free is now 10, and any of the OSTs can be selected.

      It is important that the algorithm keep the filesystem total available space updated in an incremental manner as objects are allocated, and only recomputing the OST weight in the statfs interpreter callback every few seconds. It definitely shouldn't have to iterate over 1000 OSTs to compute the weighting each time, or it will add too much overhead on the MDS.

      If one of the full OSTs had been previously selected then its bonus = -1, and the rest have bonus = 1, so the average is 10.9. The just used OST only has weight 9 and will not be selected this round. One OST will be picked and its weight set to -1, and the remainder get bonus = 2, so average = 11.7 for next round, and only OSTs with weight 12 can be selected for the second stripe.

      For 1-stripe files (using 1/2 weighting of the bonus in the average) we would need to allocate 14 files on the free OST before allocating a file on one of the other OSTs.

      We should have a simulator that can just plug in lov_qos.c so that we can both analyze the current code, and secondly validate any changes made. In other projects this has been done by adding unit-test code at the bottom of this file, and it is "#ifdef UNIT_TEST" so it is not in normal compiles, but creates a test executable when compiled with -DUNIT_TEST.

      We would dynamically skip entries while walking the OST index array, based on the current "weight" of each particular OST. The weight is potentially made up of many inputs. Currently the weight is the OST available space, plus/minus any bonus/penalty for recent allocations/skips.

      There is a per-OST penalty and a per-OSS penalty, which are very rough estimates for the amount of space that will be consumed on the OST and the remaining IO bandwidth of the OSS. This also needs to be flexible enough to taking into account other weighting factors into the mix such as network bandwidth/latency (e.g. based on multiple NIDs and/or LNet stats), storage bandwidth (e.g. based on old/new HDDs or impact due to RAID rebuilding, OST-side stats in LU-7880), administrator tuning, etc.

      Attachments

        1. b=18547.html
          106 kB
          Andreas Dilger
        2. LU-9.xlsx
          38 kB
          Shuichi Ihara

        Issue Links

          Activity

            People

              wc-triage WC Triage
              rread Robert Read
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated: