Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8515

OSC: Send RPCs with full extents

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.10.0
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      In Lustre 2.7 and newer, single node multi-process single-shared-file write performance is significantly slower than in Lustre 2.5. This is due to a problem in deciding when to make an RPC. (IE, the decisions made in osc_makes_rpc)

      Currently, Lustre decides to send an RPC under a number of
      conditions (such as memory pressure or lock cancellcation);
      one of the conditions it looks for is "enough dirty pages
      to fill an RPC". This worked fine when only one process
      could be dirtying pages at a time, but in newer Lustre
      versions, more than one process can write to the same
      file (and the same osc object) at once.

      In this case, the "count dirty pages method" will see there
      are enough dirty pages to fill an RPC, but since the dirty
      pages are being created by multiple writers, they are not
      contiguous and will not fit in to one RPC. This resulted in
      many RPCs of less than full size being sent, despite a
      good I/O pattern. (Earlier versions of Lustre usually
      send only full RPCs when presented with this pattern.)

      Instead, we remove this check and add extents to a special
      full extent list when they reach max pages per RPC, then
      send from that list. (This is similar to high priority
      and urgent extents.)

      With a good I/O pattern, like usually used in benchmarking,
      it should be possible to send only full size RPCs. This
      patch achieves that without degrading performance in other
      cases.

      In IOR tests with multiple writers to a single file,
      this patch improves performance by several times, and
      returns performance to equal levels (single striped files)
      or much greater levels (very high speed OSTs, files
      with many stripes) vs earlier versions.

      Here's some specific data:
      On this machine and storage system, the best bandwidth we can get to a single stripe from one node is about 330 MB/s. This occurs with one writer. All tests are run on a newly created, singly striped file, except where a higher stripe count is specified.

      IOR: aprun -n 1 $(IOR) -w -t 4m -b 16g -C -e -E -k -u -v
      (1 thread, 4 MiB transfer size, 16GB per thread.)

      Unmodified:
      write 334.12 334.12 334.12 0.00 83.53 83.53
      write 329.34 329.34 329.34 0.00 82.33 82.33
      write 329.37 329.37 329.37 0.00 82.34 82.34

      Modified (full extent):
      write 329.47 329.47 329.47 0.00 82.37 82.37
      write 339.33 339.33 339.33 0.00 84.83 84.83
      write 323.18 323.18 323.18 0.00 80.80 80.80

      Here's an example of the improvement available. We're using 8 threads and 1 GB of data per thread. (Results are similar with a larger amount of data per thread.)
      IOR: aprun -n 8 $(IOR) -w -t 4m -b 1g -C -e -E -k -u -v
      Unmodified:
      write 87.24 87.24 87.24 0.00 21.81 21.81
      write 89.26 89.26 89.26 0.00 22.31 22.31
      write 90.45 90.45 90.45 0.00 22.61 22.61

      Modified:
      write 345.72 345.72 345.72 0.00 86.43 86.43
      write 334.14 334.14 334.14 0.00 83.53 83.53
      write 351.03 351.03 351.03 0.00 87.76 87.76

      Note the above is actually a shade higher than the single thread performance, despite being at essentially the limit for the target (from this node, with these settings).

      2 stripes:

      1 thread, unmodified:
      write 614.48 614.48 614.48 0.00 153.62 153.62
      write 626.98 626.98 626.98 0.00 156.75 156.75
      write 610.14 610.14 610.14 0.00 152.53 152.53

      1 thread, modified:
      write 627.86 627.86 627.86 0.00 156.97 156.97
      write 625.68 625.68 625.68 0.00 156.42 156.42
      write 625.47 625.47 625.47 0.00 156.37 156.37

      8 threads, unmodified:
      write 172.24 172.24 172.24 0.00 43.06 43.06
      write 180.02 180.02 180.02 0.00 45.01 45.01
      write 186.17 186.17 186.17 0.00 46.54 46.54

      8 threads, modified:
      write 614.53 614.53 614.53 0.00 153.63 153.63
      write 604.05 604.05 604.05 0.00 151.01 151.01
      write 616.77 616.77 616.77 0.00 154.19 154.19

      8 stripes:
      Note - These tests were run with 4 or 8 GB of data per thread, otherwise they completed too quickly for me to be comfortable (though the numbers were similar). Performance numbers were the same across all total amounts of data tested. Numbers given below are representative - I repeated each test several times, but didn't want to put in that much data.

      1 thread, unmodified:
      write 1270.16 1270.16 1270.16 0.00 317.54 317.54

      1 thread, modified:
      write 1256.26 1256.26 1256.26 0.00 314.06 314.06

      8 threads, unmodified:
      write 712.33 712.33 712.33 0.00 178.08 178.08

      8 threads, modified:
      write 1949.85 1949.85 1949.85 0.00 487.46 487.46

      16 stripes:

      8 threads, unmodified:
      write 1461.83 1461.83 1461.83 0.00 365.46 365.46

      8 threads, modified:
      write 3082.42 3082.42 3082.42 0.00 770.61 770.61

      Attachments

        Activity

          People

            paf Patrick Farrell (Inactive)
            paf Patrick Farrell (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: