[LU-1757] Short I/O support Created: 16/Aug/12  Updated: 02/Nov/19  Resolved: 22/Dec/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.11.0, Lustre 2.12.0

Type: Improvement Priority: Major
Reporter: Alexander Boyko Assignee: Patrick Farrell (Inactive)
Resolution: Fixed Votes: 0
Labels: patch

Issue Links:
Related
is related to LU-10176 Data-on-MDT phase II Open
is related to LU-3285 Data on MDT Resolved
is related to LU-12856 LustreError: 82937:0:(ldlm_lib.c:3268... Resolved
is related to LU-10264 New static analysis issues in v2_10_5... Resolved
is related to LU-10289 DoM: add SHORTIO support for MDS RPCs Resolved
is related to LU-9409 Lustre small IO write performance imp... Resolved
Rank (Obsolete): 8137

 Description   

Perform short I/O (requests <= 4k) w/o bulk RPC.



 Comments   
Comment by Alexander Boyko [ 16/Aug/12 ]

req http://review.whamcloud.com/3690
Test results (seconds, less is better)

Test case Test script essence Short I/O non Short I/O
Write in each page dd of=$TARGET bs=4096 count=100000 oflag=direct 48.1s 52.1s
mmap I/O multiop $TARGET OsMRUc 98s 99.8s
Non-paged read dd if=$TARGET bs=2048 count=100 skip=$offset 32.4s 34.8s
Comment by Andreas Dilger [ 24/Aug/12 ]

Thanks, I was just looking in this bug to see if there were any kind of performance results.

The improvement isn't quite as good as I was hoping to see (i.e. only a few percent faster instead of 2-3x faster). Do you have any idea on where there are other performance bottlenecks for this use case? What is the performance of these tests on the local OST filesystem?

Comment by Alexander Boyko [ 24/Aug/12 ]

(52.1-48.1)*100/52.1=7.677543186 ~ 7% from total dd operation. I think this is not so bad. May be we need to compare ost_brw_write by time stamps to exclude other Lustre overhead. For short io bulk transfer was changed to memcpy at both sides client/server so we have bulk vs memcpy time.

Comment by Peter Jones [ 20/Sep/12 ]

Landed for 2.3 and 2.4

Comment by Andreas Dilger [ 20/Sep/12 ]

Peter, only the reservation of the feature flag has landed, not the actual code to implement it.

Comment by Peter Jones [ 11/Oct/12 ]

Landed for 2.4

Comment by Jeremy Filizetti [ 12/Oct/12 ]

Peter, the cherry picked patch that was added b2_1, b2_3 and master was only for the connect flags to reserve them, the full patch still doesn't appear to be landed. If it is can you provide the commit because I can't find it?

Comment by Peter Jones [ 13/Oct/12 ]

Ah yes I think that you are right Jeremy - thanks!

Comment by Andreas Dilger [ 14/Oct/12 ]

Jeremy, if you (or someone you know) have the ability to do so, it would be great to get some performance benchmarks on this patch over high-latency links. As it stands, getting only a few percent improvement for small IO sizes (7.7MB/s to 8.3MB/s) isn't compelling.

Alexander, what was the back-end storage used for this test? If it was a disk, then the IOPS rate would be the limiting factor, though 100000k writes in 52s is about 2000 IOPS, so probably a RAID-10 array or SSD? While I think that this could help the performance, I suspect that a closer investigation of where the actual overhead lies would help. Is there a need for more RPCs in flight with small IOs? Is the latency in the server stack or RPC handling?

Comment by Andreas Dilger [ 15/Oct/12 ]

Eric, I recall you having some thoughts about this. The current patch limits the bulk request size to be <= one page of data (+ overhead), which isn't out of line with MDS requests which can have up to 4kB for a symlink or other pathname component.

I think it is unavoidable that if we want low latency small IOs that they be done without extra round trips, but I would have thought the performance improvement was much better than a few percent... Perhaps testing against a ramdisk OST would give us a better idea of the upper limit of performance for this patch?

Comment by Alexander Boyko [ 23/Oct/12 ]

I got new test result from ramfs.

IOR shortio (1 client, IB, ost and mds on the ramfs, 10 runs, average result)

Command line used: IOR -a POSIX -t 4k -b 1G -B -o /mnt/lustre/mmap/mmap
Machine: Linux mrpcli9

Summary:
        api                = POSIX
        test filename      = /mnt/lustre/mmap/mmap
        access             = single-shared-file
        ordering in a file = sequential offsets
        ordering inter file= no tasks offsets
        clients            = 1 (1 per node)
        repetitions        = 1
        xfersize           = 4096 bytes
        blocksize          = 1 GiB
        aggregate filesize = 1 GiB

Operation  Max (MiB)  Min (MiB)  Mean (MiB)   Std Dev  Max (OPs)  Min (OPs)  Mean (OPs)   Std Dev  Mean (s)
---------  ---------  ---------  ----------   -------  ---------  ---------  ----------   -------  --------
write          10.59      10.59       10.59      0.00    2709.96    2709.96     2709.96      0.00  96.73352   EXCEL
read           14.00      14.00       14.00      0.00    3584.71    3584.71     3584.71      0.00  73.12840   EXCEL

Max Write: 10.59 MiB/sec (11.10 MB/sec)
Max Read:  14.00 MiB/sec (14.68 MB/sec)

Run finished: Mon Oct 22 10:31:36 2012

real    2m49.891s
user    0m0.537s
sys     1m12.616s

IOR without short IO (1 client, IB, ost and mds on the ramfs, 10 runs, average result)

Command line used: IOR -a POSIX -t 4k -b 1G -B -o /mnt/lustre/mmap/mmap
Machine: Linux mrpcli9

Summary:
        api                = POSIX
        test filename      = /mnt/lustre/mmap/mmap
        access             = single-shared-file
        ordering in a file = sequential offsets
        ordering inter file= no tasks offsets
        clients            = 1 (1 per node)
        repetitions        = 1
        xfersize           = 4096 bytes
        blocksize          = 1 GiB
        aggregate filesize = 1 GiB

Operation  Max (MiB)  Min (MiB)  Mean (MiB)   Std Dev  Max (OPs)  Min (OPs)  Mean (OPs)   Std Dev  Mean (s)
---------  ---------  ---------  ----------   -------  ---------  ---------  ----------   -------  --------
write          10.36      10.36       10.36      0.00    2651.19    2651.19     2651.19      0.00  98.87794   EXCEL
read           12.64      12.64       12.64      0.00    3235.79    3235.79     3235.79      0.00  81.01380   EXCEL

Max Write: 10.36 MiB/sec (10.86 MB/sec)
Max Read:  12.64 MiB/sec (13.25 MB/sec)

Run finished: Tue Oct 23 02:12:21 2012

real    2m59.920s
user    0m0.512s
sys     1m9.490s

dd if=/dev/zero of=$FILE bs=4096 count=300000 oflag=direct (1 client, IB, ost and mds on the ramfs)
short IO: 113.5-116.0 sec
no short IO: 116.5-118.5 sec
multiop $TARGET OsMRUc on 1.2 GB target file (1 client, IB, ost and mds on the ramfs, 10 iterations)
short IO: 195.6 sec
no short IO: 199.2 sec

Comment by Andreas Dilger [ 31/Oct/12 ]

In talking with Eric, one concern with using inline bulk data is that this can increase the request size enough to cause the routers to use 1MB buffers for handling the short IO requests, and potentially cause the routers to run out of buffers.

With RDMA RPCs the number of inflight bulk requests is limited by the number if service threads (typically 512*num_osts), but with the inline bulk data the number of inflight requests is much larger (8*num_clients*num_osts).

In order to avoid consuming all of the large buffers on the routers, either a third pool for 8kB requests is needed (in addition to the 4kB and 1MB pools) or the small request (4kB) pool should be modified to use an 8kB buffer size.

Comment by Christopher Morrone [ 31/Oct/12 ]

Alex, if you want to run a test 10 times and get the average, I recommend ior's "-i" option. Otherwise its less obvious to others that you did more than eye-ball the numbers and pick a psuedo-average. With a write performance difference of only 2%, and overall throughput numbers that are so low, it is hard to tell if the results are statistically significant.

Comment by Nathan Rutman [ 21/Nov/12 ]

Xyratex-bug-id: MRP-320

Comment by Jeremy Filizetti [ 27/Nov/12 ]

I've tried several times over the past couple weeks to test this patch with master over the WAN but every time I do direct IO read or write I get an LBUG:

[root@test tmp]# dd if=test of=/dev/null bs=4k iflag=direct

Message from syslogd@test at Nov 27 03:44:12 ...
kernel:LustreError: 19403:0:(rw26.c:483:ll_direct_IO_26()) ASSERTION( obj->cob_transient_pages == 0 ) failed:

Is this an already known issue with direct IO on master?

Comment by Alexey Lyashkov [ 07/Feb/13 ]

> In talking with Eric, one concern with using inline bulk data is that this can increase the request size enough to cause the routers to use 1MB buffers for handling the short IO requests, and potentially cause the routers to run out of buffers.

That is too bad for routers. Routers should be have more then two sizes for request size, anyway we have send a transfer size as part of lnet header.

Comment by Andreas Dilger [ 14/Jun/13 ]

I was thinking of another potential area where this short IO could improve performance significantly (and give a good reason to land it), is when many clients are writing to the same object. Is it possible for you to run a test with multiple clients IOR writing <= 4kB interleaved chunks to the same 1-stripe file? Ideally this would use server-side locking for the writes, so that there is very minimal contention. It might even be that submitting smaller IOs (say 32 bytes) would give even more of a boost to this patch, since the client does not need to do read-modify-write for the full-page writes as it does today.

If this feature can show some significant performance improvements (say 3-4 times faster, though I'd expect possibly much more) then I would be happy to work on getting this this feature landed.

Comment by Andreas Dilger [ 15/Oct/13 ]

Alexander, I saw that the patch for this feature is abandoned, however small writes is definitely an area that Lustre could use a considerable amount of improvement. I'm still hopeful that there may be some workloads that this feature could show significant performance improvements on, or at least show what other work needs to be done in addition to this patch. The Data-on-MDT work is more concerned with small files and is definitely orthogonal to this small write patch which is intended to improve small write RPCs to a potentially very large file.

It may be that we need to make additional changes in order to see the overall improvement of small files. Some areas to investigate to see why this patch isn't showing the expected improvements:

  • what is the improvement when the writes are smaller than a single disk block?
  • what is the improvement when multiple clients are doing interleaved writes to the same file? This can be tested relatively easily with IOR and multiple client nodes ("ior -w -b 32 -t 32 -s 65536 -N 8 -i 10 -o /mnt/lustre/testfile" runs on 8 clients and does 65536 interleaved 32-byte writes per client).
  • what impact does NRS object-based round-robin (ORR) have when doing small writing to a single file? This should sort the writes by file offset, but it may be that short writes also need to be cached on the OST so that they can avoid synchronous read-modify-write on the disk. This might be more easily tested with a ZFS OSD, which already does write caching, while the ldiskfs OSD would need changes to the IO path in order to cache small writes.
  • in the shared single interleaved write case, are the clients doing lockless writes? If not, the lock contention and overhead of doing LDLM enqueue/cancel for each write may easily dominate over the improvement from the small write patch. For sub-page writes, it might also be that there needs to be some IO fastpath that bypasses the client page cache so that it can avoid read-modify-write for the local page.

I suspect that if there were multiple clients doing small writes

Comment by Alexander Boyko [ 05/Nov/13 ]

I want to reserve OBDO flag for short io
http://review.whamcloud.com/8182

Andreas, right now, I have no time and resources to check your suggestion, and the short io patch is outdated for master and required reworks. Data-on-MDT looks very good for short io and need another patch also.
We did not do the lockless test for shared files, but single client pages write with oflags=direct did not show significant improvement.

Comment by Gerrit Updater [ 21/Jun/17 ]

Patrick Farrell (paf@cray.com) uploaded a new patch: https://review.whamcloud.com/27767
Subject: LU-1757 brw: add short io osc/ost transfer.
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 01056e12846a73c041da92d8a4f216f2641ca1cc

Comment by Patrick Farrell (Inactive) [ 21/Jun/17 ]

I've resurrected this patch and ported to current master. Some simple testing here suggests A) it's working fine, and B) it gives about a 30% performance improvement for direct I/O of appropriate size (I upped the limit to 3 pages, that's what fits in the RPC, I believe), when I'm reading from a fast storage device (RAM or flash). When I'm doing small I/O to/from a spinning disk, I see no real improvement - But that's probably because network latency is not the primary driver of I/O performance there.

Comment by Andreas Dilger [ 22/Jun/17 ]

Patrick, have you tested aligned or unaligned reads/writes? I expect with unaligned multi-client writes and server-side locking that this could also improve performance significantly.

There could also be a big benefit from bypassing the client-side aggregation and caching mechanisms completely in that case, and just dump the chunks to the OST as fast as possible, and use something like NRS ORR to aggregate the IOs properly on the server, or at least avoid read-modify-write for small writes over the network.

Comment by Patrick Farrell (Inactive) [ 22/Jun/17 ]

Andreas,

Aligned, mostly. Can't really test unaligned without getting aggregation or readahead, except for random reads. (Those do well.)

Server side locking... How is that achieved, other than with the patch from LU-4198, and that only for direct i/o? (https://review.whamcloud.com/#/c/8201/20)
And since it's direct i/o, it has to be aligned. (Trying to update LU-247 (unaligned dio) is my next project.)

Also, other than with direct i/o, I'm not sure how to actually achieve small i/os (write aggregation or read ahead will prevent them), except for random reads. (Which do see a benefit - I didn't mention that, but they see the same sort of improvement.)

So, I suppose I would say:
I want to try all of that, I agree that it would likely benefit enormously (you pointed this out in your earlier comments to this LU) but I believe I need LU-247 to make it really possible, since direct i/o is the only way I know of to A) skip the page cache, B) force small i/o, and C) move the locking to the server (I can fake the effect of that by doing my i/o from one node.)

Is there some easier route I've missed?

Comment by Gerrit Updater [ 09/Nov/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/27767/
Subject: LU-1757 brw: add short io osc/ost transfer.
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 70f092a0587866662735e1a6eaf27701a576370d

Comment by Minh Diep [ 09/Nov/17 ]

Landed for 2.11

Comment by Andreas Dilger [ 28/Nov/17 ]

This was landed for 2.11, but Data-on-MDT landed at the same time. The MDS connection does not support SHORTIO yet, but it should.

Comment by Gerrit Updater [ 07/Dec/17 ]

Patrick Farrell (paf@cray.com) uploaded a new patch: https://review.whamcloud.com/30435
Subject: LU-1757 brw: Fix short i/o and enable for mdc
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 8efc38861cb224d69c012862f6e8ae453b890d17

Comment by Patrick Farrell (Inactive) [ 07/Dec/17 ]

Original patch did not actually enable this functionality.

Comment by Gerrit Updater [ 22/Dec/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/30435/
Subject: LU-1757 brw: Fix short i/o and enable for mdc
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 3483e195314bddb8d72594ebb10307c83a4bb860

Comment by Peter Jones [ 22/Dec/17 ]

Second time lucky?

Comment by Gerrit Updater [ 15/Sep/18 ]

Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33173
Subject: LU-1757 osc: clarify short_io_bytes is maximum value
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 8e3e67f0cfdec0bb0a96f9e4fc1793fef7558867

Comment by Gerrit Updater [ 12/Oct/18 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33173/
Subject: LU-1757 osc: clarify short_io_bytes is maximum value
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: b90812a674f6ebaa9de592a4a4d97a35ed38a24e

Generated at Sat Feb 10 01:19:25 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.