[LU-1757] Short I/O support - Whamcloud Community JIRA

Details

Type: Improvement
Resolution: Fixed
Priority: Major
Fix Version/s: Lustre 2.11.0, Lustre 2.12.0
Affects Version/s: None
Labels:
- patch

Rank (Obsolete):
8137

Description

Perform short I/O (requests <= 4k) w/o bulk RPC.

Attachments

Issue Links

is related to

LU-12856 LustreError: 82937:0:(ldlm_lib.c:3268:target_bulk_io()) @@@ truncated bulk READ 0(270336)

Resolved

LU-10264 New static analysis issues in v2_10_55_0-30-g3cbe63e

Resolved

LU-10289 DoM: add SHORTIO support for MDS RPCs

Resolved

LU-9409 Lustre small IO write performance improvement

Resolved

is related to

LU-10176 Data-on-MDT phase II

Open

LU-3285 Data on MDT

Resolved

(1 is related to )

Activity

[LU-1757] Short I/O support

Jeremy Filizetti added a comment - 27/Nov/12 10:04 AM

I've tried several times over the past couple weeks to test this patch with master over the WAN but every time I do direct IO read or write I get an LBUG:

[root@test tmp]# dd if=test of=/dev/null bs=4k iflag=direct

Message from syslogd@test at Nov 27 03:44:12 ...
kernel:LustreError: 19403:0:(rw26.c:483:ll_direct_IO_26()) ASSERTION( obj->cob_transient_pages == 0 ) failed:

Is this an already known issue with direct IO on master?

Jeremy Filizetti added a comment - 27/Nov/12 10:04 AM I've tried several times over the past couple weeks to test this patch with master over the WAN but every time I do direct IO read or write I get an LBUG: [root@test tmp] # dd if=test of=/dev/null bs=4k iflag=direct Message from syslogd@test at Nov 27 03:44:12 ... kernel:LustreError: 19403:0:(rw26.c:483:ll_direct_IO_26()) ASSERTION( obj->cob_transient_pages == 0 ) failed: Is this an already known issue with direct IO on master?

Nathan Rutman added a comment - 21/Nov/12 5:13 PM

Xyratex-bug-id: MRP-320

Nathan Rutman added a comment - 21/Nov/12 5:13 PM Xyratex-bug-id: MRP-320

Christopher Morrone (Inactive) added a comment - 31/Oct/12 6:31 PM

Alex, if you want to run a test 10 times and get the average, I recommend ior's "-i" option. Otherwise its less obvious to others that you did more than eye-ball the numbers and pick a psuedo-average. With a write performance difference of only 2%, and overall throughput numbers that are so low, it is hard to tell if the results are statistically significant.

Christopher Morrone (Inactive) added a comment - 31/Oct/12 6:31 PM Alex, if you want to run a test 10 times and get the average, I recommend ior's "-i" option. Otherwise its less obvious to others that you did more than eye-ball the numbers and pick a psuedo-average. With a write performance difference of only 2%, and overall throughput numbers that are so low, it is hard to tell if the results are statistically significant.

Andreas Dilger added a comment - 31/Oct/12 12:10 PM

In talking with Eric, one concern with using inline bulk data is that this can increase the request size enough to cause the routers to use 1MB buffers for handling the short IO requests, and potentially cause the routers to run out of buffers.

With RDMA RPCs the number of inflight bulk requests is limited by the number if service threads (typically 512*num_osts), but with the inline bulk data the number of inflight requests is much larger (8*num_clients*num_osts).

In order to avoid consuming all of the large buffers on the routers, either a third pool for 8kB requests is needed (in addition to the 4kB and 1MB pools) or the small request (4kB) pool should be modified to use an 8kB buffer size.

Andreas Dilger added a comment - 31/Oct/12 12:10 PM In talking with Eric, one concern with using inline bulk data is that this can increase the request size enough to cause the routers to use 1MB buffers for handling the short IO requests, and potentially cause the routers to run out of buffers. With RDMA RPCs the number of inflight bulk requests is limited by the number if service threads (typically 512*num_osts), but with the inline bulk data the number of inflight requests is much larger (8*num_clients*num_osts). In order to avoid consuming all of the large buffers on the routers, either a third pool for 8kB requests is needed (in addition to the 4kB and 1MB pools) or the small request (4kB) pool should be modified to use an 8kB buffer size.

Alexander Boyko added a comment - 23/Oct/12 1:46 PM

I got new test result from ramfs.

IOR shortio (1 client, IB, ost and mds on the ramfs, 10 runs, average result)

Command line used: IOR -a POSIX -t 4k -b 1G -B -o /mnt/lustre/mmap/mmap
Machine: Linux mrpcli9

Summary:
        api                = POSIX
        test filename      = /mnt/lustre/mmap/mmap
        access             = single-shared-file
        ordering in a file = sequential offsets
        ordering inter file= no tasks offsets
        clients            = 1 (1 per node)
        repetitions        = 1
        xfersize           = 4096 bytes
        blocksize          = 1 GiB
        aggregate filesize = 1 GiB

Operation  Max (MiB)  Min (MiB)  Mean (MiB)   Std Dev  Max (OPs)  Min (OPs)  Mean (OPs)   Std Dev  Mean (s)
---------  ---------  ---------  ----------   -------  ---------  ---------  ----------   -------  --------
write          10.59      10.59       10.59      0.00    2709.96    2709.96     2709.96      0.00  96.73352   EXCEL
read           14.00      14.00       14.00      0.00    3584.71    3584.71     3584.71      0.00  73.12840   EXCEL

Max Write: 10.59 MiB/sec (11.10 MB/sec)
Max Read:  14.00 MiB/sec (14.68 MB/sec)

Run finished: Mon Oct 22 10:31:36 2012

real    2m49.891s
user    0m0.537s
sys     1m12.616s

IOR without short IO (1 client, IB, ost and mds on the ramfs, 10 runs, average result)

Command line used: IOR -a POSIX -t 4k -b 1G -B -o /mnt/lustre/mmap/mmap
Machine: Linux mrpcli9

Summary:
        api                = POSIX
        test filename      = /mnt/lustre/mmap/mmap
        access             = single-shared-file
        ordering in a file = sequential offsets
        ordering inter file= no tasks offsets
        clients            = 1 (1 per node)
        repetitions        = 1
        xfersize           = 4096 bytes
        blocksize          = 1 GiB
        aggregate filesize = 1 GiB

Operation  Max (MiB)  Min (MiB)  Mean (MiB)   Std Dev  Max (OPs)  Min (OPs)  Mean (OPs)   Std Dev  Mean (s)
---------  ---------  ---------  ----------   -------  ---------  ---------  ----------   -------  --------
write          10.36      10.36       10.36      0.00    2651.19    2651.19     2651.19      0.00  98.87794   EXCEL
read           12.64      12.64       12.64      0.00    3235.79    3235.79     3235.79      0.00  81.01380   EXCEL

Max Write: 10.36 MiB/sec (10.86 MB/sec)
Max Read:  12.64 MiB/sec (13.25 MB/sec)

Run finished: Tue Oct 23 02:12:21 2012

real    2m59.920s
user    0m0.512s
sys     1m9.490s

dd if=/dev/zero of=$FILE bs=4096 count=300000 oflag=direct (1 client, IB, ost and mds on the ramfs)
short IO: 113.5-116.0 sec
no short IO: 116.5-118.5 sec
multiop $TARGET OsMRUc on 1.2 GB target file (1 client, IB, ost and mds on the ramfs, 10 iterations)
short IO: 195.6 sec
no short IO: 199.2 sec

Alexander Boyko added a comment - 23/Oct/12 1:46 PM I got new test result from ramfs. IOR shortio (1 client, IB, ost and mds on the ramfs, 10 runs, average result) Command line used: IOR -a POSIX -t 4k -b 1G -B -o /mnt/lustre/mmap/mmap Machine: Linux mrpcli9 Summary: api = POSIX test filename = /mnt/lustre/mmap/mmap access = single-shared-file ordering in a file = sequential offsets ordering inter file= no tasks offsets clients = 1 (1 per node) repetitions = 1 xfersize = 4096 bytes blocksize = 1 GiB aggregate filesize = 1 GiB Operation Max (MiB) Min (MiB) Mean (MiB) Std Dev Max (OPs) Min (OPs) Mean (OPs) Std Dev Mean (s) --------- --------- --------- ---------- ------- --------- --------- ---------- ------- -------- write 10.59 10.59 10.59 0.00 2709.96 2709.96 2709.96 0.00 96.73352 EXCEL read 14.00 14.00 14.00 0.00 3584.71 3584.71 3584.71 0.00 73.12840 EXCEL Max Write: 10.59 MiB/sec (11.10 MB/sec) Max Read: 14.00 MiB/sec (14.68 MB/sec) Run finished: Mon Oct 22 10:31:36 2012 real 2m49.891s user 0m0.537s sys 1m12.616s IOR without short IO (1 client, IB, ost and mds on the ramfs, 10 runs, average result) Command line used: IOR -a POSIX -t 4k -b 1G -B -o /mnt/lustre/mmap/mmap Machine: Linux mrpcli9 Summary: api = POSIX test filename = /mnt/lustre/mmap/mmap access = single-shared-file ordering in a file = sequential offsets ordering inter file= no tasks offsets clients = 1 (1 per node) repetitions = 1 xfersize = 4096 bytes blocksize = 1 GiB aggregate filesize = 1 GiB Operation Max (MiB) Min (MiB) Mean (MiB) Std Dev Max (OPs) Min (OPs) Mean (OPs) Std Dev Mean (s) --------- --------- --------- ---------- ------- --------- --------- ---------- ------- -------- write 10.36 10.36 10.36 0.00 2651.19 2651.19 2651.19 0.00 98.87794 EXCEL read 12.64 12.64 12.64 0.00 3235.79 3235.79 3235.79 0.00 81.01380 EXCEL Max Write: 10.36 MiB/sec (10.86 MB/sec) Max Read: 12.64 MiB/sec (13.25 MB/sec) Run finished: Tue Oct 23 02:12:21 2012 real 2m59.920s user 0m0.512s sys 1m9.490s dd if=/dev/zero of=$FILE bs=4096 count=300000 oflag=direct (1 client, IB, ost and mds on the ramfs) short IO: 113.5-116.0 sec no short IO: 116.5-118.5 sec multiop $TARGET OsMRUc on 1.2 GB target file (1 client, IB, ost and mds on the ramfs, 10 iterations) short IO: 195.6 sec no short IO: 199.2 sec

Andreas Dilger added a comment - 15/Oct/12 2:17 PM

Eric, I recall you having some thoughts about this. The current patch limits the bulk request size to be <= one page of data (+ overhead), which isn't out of line with MDS requests which can have up to 4kB for a symlink or other pathname component.

I think it is unavoidable that if we want low latency small IOs that they be done without extra round trips, but I would have thought the performance improvement was much better than a few percent... Perhaps testing against a ramdisk OST would give us a better idea of the upper limit of performance for this patch?

Andreas Dilger added a comment - 15/Oct/12 2:17 PM Eric, I recall you having some thoughts about this. The current patch limits the bulk request size to be <= one page of data (+ overhead), which isn't out of line with MDS requests which can have up to 4kB for a symlink or other pathname component. I think it is unavoidable that if we want low latency small IOs that they be done without extra round trips, but I would have thought the performance improvement was much better than a few percent... Perhaps testing against a ramdisk OST would give us a better idea of the upper limit of performance for this patch?

Andreas Dilger added a comment - 14/Oct/12 4:21 PM

Jeremy, if you (or someone you know) have the ability to do so, it would be great to get some performance benchmarks on this patch over high-latency links. As it stands, getting only a few percent improvement for small IO sizes (7.7MB/s to 8.3MB/s) isn't compelling.

Alexander, what was the back-end storage used for this test? If it was a disk, then the IOPS rate would be the limiting factor, though 100000k writes in 52s is about 2000 IOPS, so probably a RAID-10 array or SSD? While I think that this could help the performance, I suspect that a closer investigation of where the actual overhead lies would help. Is there a need for more RPCs in flight with small IOs? Is the latency in the server stack or RPC handling?

Andreas Dilger added a comment - 14/Oct/12 4:21 PM Jeremy, if you (or someone you know) have the ability to do so, it would be great to get some performance benchmarks on this patch over high-latency links. As it stands, getting only a few percent improvement for small IO sizes (7.7MB/s to 8.3MB/s) isn't compelling. Alexander, what was the back-end storage used for this test? If it was a disk, then the IOPS rate would be the limiting factor, though 100000k writes in 52s is about 2000 IOPS, so probably a RAID-10 array or SSD? While I think that this could help the performance, I suspect that a closer investigation of where the actual overhead lies would help. Is there a need for more RPCs in flight with small IOs? Is the latency in the server stack or RPC handling?

Peter Jones added a comment - 13/Oct/12 12:55 AM

Ah yes I think that you are right Jeremy - thanks!

Peter Jones added a comment - 13/Oct/12 12:55 AM Ah yes I think that you are right Jeremy - thanks!

Jeremy Filizetti added a comment - 12/Oct/12 8:05 PM

Peter, the cherry picked patch that was added b2_1, b2_3 and master was only for the connect flags to reserve them, the full patch still doesn't appear to be landed. If it is can you provide the commit because I can't find it?

Jeremy Filizetti added a comment - 12/Oct/12 8:05 PM Peter, the cherry picked patch that was added b2_1, b2_3 and master was only for the connect flags to reserve them, the full patch still doesn't appear to be landed. If it is can you provide the commit because I can't find it?

Peter Jones added a comment - 11/Oct/12 1:32 PM

Landed for 2.4

Peter Jones added a comment - 11/Oct/12 1:32 PM Landed for 2.4

Andreas Dilger added a comment - 20/Sep/12 4:53 AM

Peter, only the reservation of the feature flag has landed, not the actual code to implement it.

Andreas Dilger added a comment - 20/Sep/12 4:53 AM Peter, only the reservation of the feature flag has landed, not the actual code to implement it.

People

Assignee:: Patrick Farrell (Inactive)

Reporter:: Alexander Boyko

Votes:: 0 Vote for this issue

Watchers:: 16 Start watching this issue

Dates

Created:: 16/Aug/12 4:42 AM

Updated:: 02/Nov/19 2:02 PM

Resolved:: 22/Dec/17 12:45 PM