Details
-
Bug
-
Resolution: Fixed
-
Major
-
None
-
None
-
3
-
9223372036854775807
Description
In Lustre 2.7 and newer, single node multi-process single-shared-file write performance is significantly slower than in Lustre 2.5. This is due to a problem in deciding when to make an RPC. (IE, the decisions made in osc_makes_rpc)
Currently, Lustre decides to send an RPC under a number of
conditions (such as memory pressure or lock cancellcation);
one of the conditions it looks for is "enough dirty pages
to fill an RPC". This worked fine when only one process
could be dirtying pages at a time, but in newer Lustre
versions, more than one process can write to the same
file (and the same osc object) at once.
In this case, the "count dirty pages method" will see there
are enough dirty pages to fill an RPC, but since the dirty
pages are being created by multiple writers, they are not
contiguous and will not fit in to one RPC. This resulted in
many RPCs of less than full size being sent, despite a
good I/O pattern. (Earlier versions of Lustre usually
send only full RPCs when presented with this pattern.)
Instead, we remove this check and add extents to a special
full extent list when they reach max pages per RPC, then
send from that list. (This is similar to high priority
and urgent extents.)
With a good I/O pattern, like usually used in benchmarking,
it should be possible to send only full size RPCs. This
patch achieves that without degrading performance in other
cases.
In IOR tests with multiple writers to a single file,
this patch improves performance by several times, and
returns performance to equal levels (single striped files)
or much greater levels (very high speed OSTs, files
with many stripes) vs earlier versions.
Here's some specific data:
On this machine and storage system, the best bandwidth we can get to a single stripe from one node is about 330 MB/s. This occurs with one writer. All tests are run on a newly created, singly striped file, except where a higher stripe count is specified.
IOR: aprun -n 1 $(IOR) -w -t 4m -b 16g -C -e -E -k -u -v
(1 thread, 4 MiB transfer size, 16GB per thread.)
Unmodified:
write 334.12 334.12 334.12 0.00 83.53 83.53
write 329.34 329.34 329.34 0.00 82.33 82.33
write 329.37 329.37 329.37 0.00 82.34 82.34
Modified (full extent):
write 329.47 329.47 329.47 0.00 82.37 82.37
write 339.33 339.33 339.33 0.00 84.83 84.83
write 323.18 323.18 323.18 0.00 80.80 80.80
Here's an example of the improvement available. We're using 8 threads and 1 GB of data per thread. (Results are similar with a larger amount of data per thread.)
IOR: aprun -n 8 $(IOR) -w -t 4m -b 1g -C -e -E -k -u -v
Unmodified:
write 87.24 87.24 87.24 0.00 21.81 21.81
write 89.26 89.26 89.26 0.00 22.31 22.31
write 90.45 90.45 90.45 0.00 22.61 22.61
Modified:
write 345.72 345.72 345.72 0.00 86.43 86.43
write 334.14 334.14 334.14 0.00 83.53 83.53
write 351.03 351.03 351.03 0.00 87.76 87.76
Note the above is actually a shade higher than the single thread performance, despite being at essentially the limit for the target (from this node, with these settings).
2 stripes:
1 thread, unmodified:
write 614.48 614.48 614.48 0.00 153.62 153.62
write 626.98 626.98 626.98 0.00 156.75 156.75
write 610.14 610.14 610.14 0.00 152.53 152.53
1 thread, modified:
write 627.86 627.86 627.86 0.00 156.97 156.97
write 625.68 625.68 625.68 0.00 156.42 156.42
write 625.47 625.47 625.47 0.00 156.37 156.37
8 threads, unmodified:
write 172.24 172.24 172.24 0.00 43.06 43.06
write 180.02 180.02 180.02 0.00 45.01 45.01
write 186.17 186.17 186.17 0.00 46.54 46.54
8 threads, modified:
write 614.53 614.53 614.53 0.00 153.63 153.63
write 604.05 604.05 604.05 0.00 151.01 151.01
write 616.77 616.77 616.77 0.00 154.19 154.19
8 stripes:
Note - These tests were run with 4 or 8 GB of data per thread, otherwise they completed too quickly for me to be comfortable (though the numbers were similar). Performance numbers were the same across all total amounts of data tested. Numbers given below are representative - I repeated each test several times, but didn't want to put in that much data.
1 thread, unmodified:
write 1270.16 1270.16 1270.16 0.00 317.54 317.54
1 thread, modified:
write 1256.26 1256.26 1256.26 0.00 314.06 314.06
8 threads, unmodified:
write 712.33 712.33 712.33 0.00 178.08 178.08
8 threads, modified:
write 1949.85 1949.85 1949.85 0.00 487.46 487.46
16 stripes:
8 threads, unmodified:
write 1461.83 1461.83 1461.83 0.00 365.46 365.46
8 threads, modified:
write 3082.42 3082.42 3082.42 0.00 770.61 770.61