[LU-1757] Short I/O support Created: 16/Aug/12 Updated: 02/Nov/19 Resolved: 22/Dec/17 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.11.0, Lustre 2.12.0 |
| Type: | Improvement | Priority: | Major |
| Reporter: | Alexander Boyko | Assignee: | Patrick Farrell (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | patch | ||
| Issue Links: |
|
||||||||||||||||||||||||||||
| Rank (Obsolete): | 8137 | ||||||||||||||||||||||||||||
| Description |
|
Perform short I/O (requests <= 4k) w/o bulk RPC. |
| Comments |
| Comment by Alexander Boyko [ 16/Aug/12 ] | ||||||||||||||||
|
req http://review.whamcloud.com/3690
| ||||||||||||||||
| Comment by Andreas Dilger [ 24/Aug/12 ] | ||||||||||||||||
|
Thanks, I was just looking in this bug to see if there were any kind of performance results. The improvement isn't quite as good as I was hoping to see (i.e. only a few percent faster instead of 2-3x faster). Do you have any idea on where there are other performance bottlenecks for this use case? What is the performance of these tests on the local OST filesystem? | ||||||||||||||||
| Comment by Alexander Boyko [ 24/Aug/12 ] | ||||||||||||||||
|
(52.1-48.1)*100/52.1=7.677543186 ~ 7% from total dd operation. I think this is not so bad. May be we need to compare ost_brw_write by time stamps to exclude other Lustre overhead. For short io bulk transfer was changed to memcpy at both sides client/server so we have bulk vs memcpy time. | ||||||||||||||||
| Comment by Peter Jones [ 20/Sep/12 ] | ||||||||||||||||
|
Landed for 2.3 and 2.4 | ||||||||||||||||
| Comment by Andreas Dilger [ 20/Sep/12 ] | ||||||||||||||||
|
Peter, only the reservation of the feature flag has landed, not the actual code to implement it. | ||||||||||||||||
| Comment by Peter Jones [ 11/Oct/12 ] | ||||||||||||||||
|
Landed for 2.4 | ||||||||||||||||
| Comment by Jeremy Filizetti [ 12/Oct/12 ] | ||||||||||||||||
|
Peter, the cherry picked patch that was added b2_1, b2_3 and master was only for the connect flags to reserve them, the full patch still doesn't appear to be landed. If it is can you provide the commit because I can't find it? | ||||||||||||||||
| Comment by Peter Jones [ 13/Oct/12 ] | ||||||||||||||||
|
Ah yes I think that you are right Jeremy - thanks! | ||||||||||||||||
| Comment by Andreas Dilger [ 14/Oct/12 ] | ||||||||||||||||
|
Jeremy, if you (or someone you know) have the ability to do so, it would be great to get some performance benchmarks on this patch over high-latency links. As it stands, getting only a few percent improvement for small IO sizes (7.7MB/s to 8.3MB/s) isn't compelling. Alexander, what was the back-end storage used for this test? If it was a disk, then the IOPS rate would be the limiting factor, though 100000k writes in 52s is about 2000 IOPS, so probably a RAID-10 array or SSD? While I think that this could help the performance, I suspect that a closer investigation of where the actual overhead lies would help. Is there a need for more RPCs in flight with small IOs? Is the latency in the server stack or RPC handling? | ||||||||||||||||
| Comment by Andreas Dilger [ 15/Oct/12 ] | ||||||||||||||||
|
Eric, I recall you having some thoughts about this. The current patch limits the bulk request size to be <= one page of data (+ overhead), which isn't out of line with MDS requests which can have up to 4kB for a symlink or other pathname component. I think it is unavoidable that if we want low latency small IOs that they be done without extra round trips, but I would have thought the performance improvement was much better than a few percent... Perhaps testing against a ramdisk OST would give us a better idea of the upper limit of performance for this patch? | ||||||||||||||||
| Comment by Alexander Boyko [ 23/Oct/12 ] | ||||||||||||||||
|
I got new test result from ramfs. IOR shortio (1 client, IB, ost and mds on the ramfs, 10 runs, average result) Command line used: IOR -a POSIX -t 4k -b 1G -B -o /mnt/lustre/mmap/mmap
Machine: Linux mrpcli9
Summary:
api = POSIX
test filename = /mnt/lustre/mmap/mmap
access = single-shared-file
ordering in a file = sequential offsets
ordering inter file= no tasks offsets
clients = 1 (1 per node)
repetitions = 1
xfersize = 4096 bytes
blocksize = 1 GiB
aggregate filesize = 1 GiB
Operation Max (MiB) Min (MiB) Mean (MiB) Std Dev Max (OPs) Min (OPs) Mean (OPs) Std Dev Mean (s)
--------- --------- --------- ---------- ------- --------- --------- ---------- ------- --------
write 10.59 10.59 10.59 0.00 2709.96 2709.96 2709.96 0.00 96.73352 EXCEL
read 14.00 14.00 14.00 0.00 3584.71 3584.71 3584.71 0.00 73.12840 EXCEL
Max Write: 10.59 MiB/sec (11.10 MB/sec)
Max Read: 14.00 MiB/sec (14.68 MB/sec)
Run finished: Mon Oct 22 10:31:36 2012
real 2m49.891s
user 0m0.537s
sys 1m12.616s
IOR without short IO (1 client, IB, ost and mds on the ramfs, 10 runs, average result) Command line used: IOR -a POSIX -t 4k -b 1G -B -o /mnt/lustre/mmap/mmap
Machine: Linux mrpcli9
Summary:
api = POSIX
test filename = /mnt/lustre/mmap/mmap
access = single-shared-file
ordering in a file = sequential offsets
ordering inter file= no tasks offsets
clients = 1 (1 per node)
repetitions = 1
xfersize = 4096 bytes
blocksize = 1 GiB
aggregate filesize = 1 GiB
Operation Max (MiB) Min (MiB) Mean (MiB) Std Dev Max (OPs) Min (OPs) Mean (OPs) Std Dev Mean (s)
--------- --------- --------- ---------- ------- --------- --------- ---------- ------- --------
write 10.36 10.36 10.36 0.00 2651.19 2651.19 2651.19 0.00 98.87794 EXCEL
read 12.64 12.64 12.64 0.00 3235.79 3235.79 3235.79 0.00 81.01380 EXCEL
Max Write: 10.36 MiB/sec (10.86 MB/sec)
Max Read: 12.64 MiB/sec (13.25 MB/sec)
Run finished: Tue Oct 23 02:12:21 2012
real 2m59.920s
user 0m0.512s
sys 1m9.490s
dd if=/dev/zero of=$FILE bs=4096 count=300000 oflag=direct (1 client, IB, ost and mds on the ramfs) | ||||||||||||||||
| Comment by Andreas Dilger [ 31/Oct/12 ] | ||||||||||||||||
|
In talking with Eric, one concern with using inline bulk data is that this can increase the request size enough to cause the routers to use 1MB buffers for handling the short IO requests, and potentially cause the routers to run out of buffers. With RDMA RPCs the number of inflight bulk requests is limited by the number if service threads (typically 512*num_osts), but with the inline bulk data the number of inflight requests is much larger (8*num_clients*num_osts). In order to avoid consuming all of the large buffers on the routers, either a third pool for 8kB requests is needed (in addition to the 4kB and 1MB pools) or the small request (4kB) pool should be modified to use an 8kB buffer size. | ||||||||||||||||
| Comment by Christopher Morrone [ 31/Oct/12 ] | ||||||||||||||||
|
Alex, if you want to run a test 10 times and get the average, I recommend ior's "-i" option. Otherwise its less obvious to others that you did more than eye-ball the numbers and pick a psuedo-average. With a write performance difference of only 2%, and overall throughput numbers that are so low, it is hard to tell if the results are statistically significant. | ||||||||||||||||
| Comment by Nathan Rutman [ 21/Nov/12 ] | ||||||||||||||||
|
Xyratex-bug-id: MRP-320 | ||||||||||||||||
| Comment by Jeremy Filizetti [ 27/Nov/12 ] | ||||||||||||||||
|
I've tried several times over the past couple weeks to test this patch with master over the WAN but every time I do direct IO read or write I get an LBUG: [root@test tmp]# dd if=test of=/dev/null bs=4k iflag=direct Message from syslogd@test at Nov 27 03:44:12 ... Is this an already known issue with direct IO on master? | ||||||||||||||||
| Comment by Alexey Lyashkov [ 07/Feb/13 ] | ||||||||||||||||
|
> In talking with Eric, one concern with using inline bulk data is that this can increase the request size enough to cause the routers to use 1MB buffers for handling the short IO requests, and potentially cause the routers to run out of buffers. That is too bad for routers. Routers should be have more then two sizes for request size, anyway we have send a transfer size as part of lnet header. | ||||||||||||||||
| Comment by Andreas Dilger [ 14/Jun/13 ] | ||||||||||||||||
|
I was thinking of another potential area where this short IO could improve performance significantly (and give a good reason to land it), is when many clients are writing to the same object. Is it possible for you to run a test with multiple clients IOR writing <= 4kB interleaved chunks to the same 1-stripe file? Ideally this would use server-side locking for the writes, so that there is very minimal contention. It might even be that submitting smaller IOs (say 32 bytes) would give even more of a boost to this patch, since the client does not need to do read-modify-write for the full-page writes as it does today. If this feature can show some significant performance improvements (say 3-4 times faster, though I'd expect possibly much more) then I would be happy to work on getting this this feature landed. | ||||||||||||||||
| Comment by Andreas Dilger [ 15/Oct/13 ] | ||||||||||||||||
|
Alexander, I saw that the patch for this feature is abandoned, however small writes is definitely an area that Lustre could use a considerable amount of improvement. I'm still hopeful that there may be some workloads that this feature could show significant performance improvements on, or at least show what other work needs to be done in addition to this patch. The Data-on-MDT work is more concerned with small files and is definitely orthogonal to this small write patch which is intended to improve small write RPCs to a potentially very large file. It may be that we need to make additional changes in order to see the overall improvement of small files. Some areas to investigate to see why this patch isn't showing the expected improvements:
I suspect that if there were multiple clients doing small writes | ||||||||||||||||
| Comment by Alexander Boyko [ 05/Nov/13 ] | ||||||||||||||||
|
I want to reserve OBDO flag for short io Andreas, right now, I have no time and resources to check your suggestion, and the short io patch is outdated for master and required reworks. Data-on-MDT looks very good for short io and need another patch also. | ||||||||||||||||
| Comment by Gerrit Updater [ 21/Jun/17 ] | ||||||||||||||||
|
Patrick Farrell (paf@cray.com) uploaded a new patch: https://review.whamcloud.com/27767 | ||||||||||||||||
| Comment by Patrick Farrell (Inactive) [ 21/Jun/17 ] | ||||||||||||||||
|
I've resurrected this patch and ported to current master. Some simple testing here suggests A) it's working fine, and B) it gives about a 30% performance improvement for direct I/O of appropriate size (I upped the limit to 3 pages, that's what fits in the RPC, I believe), when I'm reading from a fast storage device (RAM or flash). When I'm doing small I/O to/from a spinning disk, I see no real improvement - But that's probably because network latency is not the primary driver of I/O performance there. | ||||||||||||||||
| Comment by Andreas Dilger [ 22/Jun/17 ] | ||||||||||||||||
|
Patrick, have you tested aligned or unaligned reads/writes? I expect with unaligned multi-client writes and server-side locking that this could also improve performance significantly. There could also be a big benefit from bypassing the client-side aggregation and caching mechanisms completely in that case, and just dump the chunks to the OST as fast as possible, and use something like NRS ORR to aggregate the IOs properly on the server, or at least avoid read-modify-write for small writes over the network. | ||||||||||||||||
| Comment by Patrick Farrell (Inactive) [ 22/Jun/17 ] | ||||||||||||||||
|
Andreas, Aligned, mostly. Can't really test unaligned without getting aggregation or readahead, except for random reads. (Those do well.) Server side locking... How is that achieved, other than with the patch from Also, other than with direct i/o, I'm not sure how to actually achieve small i/os (write aggregation or read ahead will prevent them), except for random reads. (Which do see a benefit - I didn't mention that, but they see the same sort of improvement.) So, I suppose I would say: Is there some easier route I've missed? | ||||||||||||||||
| Comment by Gerrit Updater [ 09/Nov/17 ] | ||||||||||||||||
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/27767/ | ||||||||||||||||
| Comment by Minh Diep [ 09/Nov/17 ] | ||||||||||||||||
|
Landed for 2.11 | ||||||||||||||||
| Comment by Andreas Dilger [ 28/Nov/17 ] | ||||||||||||||||
|
This was landed for 2.11, but Data-on-MDT landed at the same time. The MDS connection does not support SHORTIO yet, but it should. | ||||||||||||||||
| Comment by Gerrit Updater [ 07/Dec/17 ] | ||||||||||||||||
|
Patrick Farrell (paf@cray.com) uploaded a new patch: https://review.whamcloud.com/30435 | ||||||||||||||||
| Comment by Patrick Farrell (Inactive) [ 07/Dec/17 ] | ||||||||||||||||
|
Original patch did not actually enable this functionality. | ||||||||||||||||
| Comment by Gerrit Updater [ 22/Dec/17 ] | ||||||||||||||||
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/30435/ | ||||||||||||||||
| Comment by Peter Jones [ 22/Dec/17 ] | ||||||||||||||||
|
Second time lucky? | ||||||||||||||||
| Comment by Gerrit Updater [ 15/Sep/18 ] | ||||||||||||||||
|
Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33173 | ||||||||||||||||
| Comment by Gerrit Updater [ 12/Oct/18 ] | ||||||||||||||||
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33173/ |