[LU-1757] Short I/O support - Whamcloud Community JIRA

Gerrit Updater added a comment - 21/Jun/17 7:44 PM

Patrick Farrell (paf@cray.com) uploaded a new patch: https://review.whamcloud.com/27767
Subject: ~~LU-1757~~ brw: add short io osc/ost transfer.
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 01056e12846a73c041da92d8a4f216f2641ca1cc

Gerrit Updater added a comment - 21/Jun/17 7:44 PM Patrick Farrell (paf@cray.com) uploaded a new patch: https://review.whamcloud.com/27767 Subject: LU-1757 brw: add short io osc/ost transfer. Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 01056e12846a73c041da92d8a4f216f2641ca1cc

Alexander Boyko added a comment - 05/Nov/13 5:52 PM

I want to reserve OBDO flag for short io
http://review.whamcloud.com/8182

Andreas, right now, I have no time and resources to check your suggestion, and the short io patch is outdated for master and required reworks. Data-on-MDT looks very good for short io and need another patch also.
We did not do the lockless test for shared files, but single client pages write with oflags=direct did not show significant improvement.

Alexander Boyko added a comment - 05/Nov/13 5:52 PM I want to reserve OBDO flag for short io http://review.whamcloud.com/8182 Andreas, right now, I have no time and resources to check your suggestion, and the short io patch is outdated for master and required reworks. Data-on-MDT looks very good for short io and need another patch also. We did not do the lockless test for shared files, but single client pages write with oflags=direct did not show significant improvement.

Andreas Dilger added a comment - 15/Oct/13 7:00 PM

Alexander, I saw that the patch for this feature is abandoned, however small writes is definitely an area that Lustre could use a considerable amount of improvement. I'm still hopeful that there may be some workloads that this feature could show significant performance improvements on, or at least show what other work needs to be done in addition to this patch. The Data-on-MDT work is more concerned with small files and is definitely orthogonal to this small write patch which is intended to improve small write RPCs to a potentially very large file.

It may be that we need to make additional changes in order to see the overall improvement of small files. Some areas to investigate to see why this patch isn't showing the expected improvements:

what is the improvement when the writes are smaller than a single disk block?
what is the improvement when multiple clients are doing interleaved writes to the same file? This can be tested relatively easily with IOR and multiple client nodes ("ior -w -b 32 -t 32 -s 65536 -N 8 -i 10 -o /mnt/lustre/testfile" runs on 8 clients and does 65536 interleaved 32-byte writes per client).
what impact does NRS object-based round-robin (ORR) have when doing small writing to a single file? This should sort the writes by file offset, but it may be that short writes also need to be cached on the OST so that they can avoid synchronous read-modify-write on the disk. This might be more easily tested with a ZFS OSD, which already does write caching, while the ldiskfs OSD would need changes to the IO path in order to cache small writes.
in the shared single interleaved write case, are the clients doing lockless writes? If not, the lock contention and overhead of doing LDLM enqueue/cancel for each write may easily dominate over the improvement from the small write patch. For sub-page writes, it might also be that there needs to be some IO fastpath that bypasses the client page cache so that it can avoid read-modify-write for the local page.

I suspect that if there were multiple clients doing small writes

Andreas Dilger added a comment - 15/Oct/13 7:00 PM Alexander, I saw that the patch for this feature is abandoned, however small writes is definitely an area that Lustre could use a considerable amount of improvement. I'm still hopeful that there may be some workloads that this feature could show significant performance improvements on, or at least show what other work needs to be done in addition to this patch. The Data-on-MDT work is more concerned with small files and is definitely orthogonal to this small write patch which is intended to improve small write RPCs to a potentially very large file. It may be that we need to make additional changes in order to see the overall improvement of small files. Some areas to investigate to see why this patch isn't showing the expected improvements: what is the improvement when the writes are smaller than a single disk block? what is the improvement when multiple clients are doing interleaved writes to the same file? This can be tested relatively easily with IOR and multiple client nodes (" ior -w -b 32 -t 32 -s 65536 -N 8 -i 10 -o /mnt/lustre/testfile " runs on 8 clients and does 65536 interleaved 32-byte writes per client). what impact does NRS object-based round-robin (ORR) have when doing small writing to a single file? This should sort the writes by file offset, but it may be that short writes also need to be cached on the OST so that they can avoid synchronous read-modify-write on the disk. This might be more easily tested with a ZFS OSD, which already does write caching, while the ldiskfs OSD would need changes to the IO path in order to cache small writes. in the shared single interleaved write case, are the clients doing lockless writes? If not, the lock contention and overhead of doing LDLM enqueue/cancel for each write may easily dominate over the improvement from the small write patch. For sub-page writes, it might also be that there needs to be some IO fastpath that bypasses the client page cache so that it can avoid read-modify-write for the local page. I suspect that if there were multiple clients doing small writes

Andreas Dilger added a comment - 14/Jun/13 10:35 PM

I was thinking of another potential area where this short IO could improve performance significantly (and give a good reason to land it), is when many clients are writing to the same object. Is it possible for you to run a test with multiple clients IOR writing <= 4kB interleaved chunks to the same 1-stripe file? Ideally this would use server-side locking for the writes, so that there is very minimal contention. It might even be that submitting smaller IOs (say 32 bytes) would give even more of a boost to this patch, since the client does not need to do read-modify-write for the full-page writes as it does today.

If this feature can show some significant performance improvements (say 3-4 times faster, though I'd expect possibly much more) then I would be happy to work on getting this this feature landed.

Andreas Dilger added a comment - 14/Jun/13 10:35 PM I was thinking of another potential area where this short IO could improve performance significantly (and give a good reason to land it), is when many clients are writing to the same object. Is it possible for you to run a test with multiple clients IOR writing <= 4kB interleaved chunks to the same 1-stripe file? Ideally this would use server-side locking for the writes, so that there is very minimal contention. It might even be that submitting smaller IOs (say 32 bytes) would give even more of a boost to this patch, since the client does not need to do read-modify-write for the full-page writes as it does today. If this feature can show some significant performance improvements (say 3-4 times faster, though I'd expect possibly much more) then I would be happy to work on getting this this feature landed.

Alexey Lyashkov added a comment - 07/Feb/13 7:55 AM

> In talking with Eric, one concern with using inline bulk data is that this can increase the request size enough to cause the routers to use 1MB buffers for handling the short IO requests, and potentially cause the routers to run out of buffers.

That is too bad for routers. Routers should be have more then two sizes for request size, anyway we have send a transfer size as part of lnet header.

Alexey Lyashkov added a comment - 07/Feb/13 7:55 AM > In talking with Eric, one concern with using inline bulk data is that this can increase the request size enough to cause the routers to use 1MB buffers for handling the short IO requests, and potentially cause the routers to run out of buffers. That is too bad for routers. Routers should be have more then two sizes for request size, anyway we have send a transfer size as part of lnet header.

Jeremy Filizetti added a comment - 27/Nov/12 10:04 AM

I've tried several times over the past couple weeks to test this patch with master over the WAN but every time I do direct IO read or write I get an LBUG:

[root@test tmp]# dd if=test of=/dev/null bs=4k iflag=direct

Message from syslogd@test at Nov 27 03:44:12 ...
kernel:LustreError: 19403:0:(rw26.c:483:ll_direct_IO_26()) ASSERTION( obj->cob_transient_pages == 0 ) failed:

Is this an already known issue with direct IO on master?

Jeremy Filizetti added a comment - 27/Nov/12 10:04 AM I've tried several times over the past couple weeks to test this patch with master over the WAN but every time I do direct IO read or write I get an LBUG: [root@test tmp] # dd if=test of=/dev/null bs=4k iflag=direct Message from syslogd@test at Nov 27 03:44:12 ... kernel:LustreError: 19403:0:(rw26.c:483:ll_direct_IO_26()) ASSERTION( obj->cob_transient_pages == 0 ) failed: Is this an already known issue with direct IO on master?

Nathan Rutman added a comment - 21/Nov/12 5:13 PM

Xyratex-bug-id: MRP-320

Nathan Rutman added a comment - 21/Nov/12 5:13 PM Xyratex-bug-id: MRP-320

Christopher Morrone (Inactive) added a comment - 31/Oct/12 6:31 PM

Alex, if you want to run a test 10 times and get the average, I recommend ior's "-i" option. Otherwise its less obvious to others that you did more than eye-ball the numbers and pick a psuedo-average. With a write performance difference of only 2%, and overall throughput numbers that are so low, it is hard to tell if the results are statistically significant.

Christopher Morrone (Inactive) added a comment - 31/Oct/12 6:31 PM Alex, if you want to run a test 10 times and get the average, I recommend ior's "-i" option. Otherwise its less obvious to others that you did more than eye-ball the numbers and pick a psuedo-average. With a write performance difference of only 2%, and overall throughput numbers that are so low, it is hard to tell if the results are statistically significant.

Andreas Dilger added a comment - 31/Oct/12 12:10 PM

In talking with Eric, one concern with using inline bulk data is that this can increase the request size enough to cause the routers to use 1MB buffers for handling the short IO requests, and potentially cause the routers to run out of buffers.

With RDMA RPCs the number of inflight bulk requests is limited by the number if service threads (typically 512*num_osts), but with the inline bulk data the number of inflight requests is much larger (8*num_clients*num_osts).

In order to avoid consuming all of the large buffers on the routers, either a third pool for 8kB requests is needed (in addition to the 4kB and 1MB pools) or the small request (4kB) pool should be modified to use an 8kB buffer size.

Andreas Dilger added a comment - 31/Oct/12 12:10 PM In talking with Eric, one concern with using inline bulk data is that this can increase the request size enough to cause the routers to use 1MB buffers for handling the short IO requests, and potentially cause the routers to run out of buffers. With RDMA RPCs the number of inflight bulk requests is limited by the number if service threads (typically 512*num_osts), but with the inline bulk data the number of inflight requests is much larger (8*num_clients*num_osts). In order to avoid consuming all of the large buffers on the routers, either a third pool for 8kB requests is needed (in addition to the 4kB and 1MB pools) or the small request (4kB) pool should be modified to use an 8kB buffer size.

Alexander Boyko added a comment - 23/Oct/12 1:46 PM

I got new test result from ramfs.

IOR shortio (1 client, IB, ost and mds on the ramfs, 10 runs, average result)

Command line used: IOR -a POSIX -t 4k -b 1G -B -o /mnt/lustre/mmap/mmap
Machine: Linux mrpcli9

Summary:
        api                = POSIX
        test filename      = /mnt/lustre/mmap/mmap
        access             = single-shared-file
        ordering in a file = sequential offsets
        ordering inter file= no tasks offsets
        clients            = 1 (1 per node)
        repetitions        = 1
        xfersize           = 4096 bytes
        blocksize          = 1 GiB
        aggregate filesize = 1 GiB

Operation  Max (MiB)  Min (MiB)  Mean (MiB)   Std Dev  Max (OPs)  Min (OPs)  Mean (OPs)   Std Dev  Mean (s)
---------  ---------  ---------  ----------   -------  ---------  ---------  ----------   -------  --------
write          10.59      10.59       10.59      0.00    2709.96    2709.96     2709.96      0.00  96.73352   EXCEL
read           14.00      14.00       14.00      0.00    3584.71    3584.71     3584.71      0.00  73.12840   EXCEL

Max Write: 10.59 MiB/sec (11.10 MB/sec)
Max Read:  14.00 MiB/sec (14.68 MB/sec)

Run finished: Mon Oct 22 10:31:36 2012

real    2m49.891s
user    0m0.537s
sys     1m12.616s

IOR without short IO (1 client, IB, ost and mds on the ramfs, 10 runs, average result)

Command line used: IOR -a POSIX -t 4k -b 1G -B -o /mnt/lustre/mmap/mmap
Machine: Linux mrpcli9

Summary:
        api                = POSIX
        test filename      = /mnt/lustre/mmap/mmap
        access             = single-shared-file
        ordering in a file = sequential offsets
        ordering inter file= no tasks offsets
        clients            = 1 (1 per node)
        repetitions        = 1
        xfersize           = 4096 bytes
        blocksize          = 1 GiB
        aggregate filesize = 1 GiB

Operation  Max (MiB)  Min (MiB)  Mean (MiB)   Std Dev  Max (OPs)  Min (OPs)  Mean (OPs)   Std Dev  Mean (s)
---------  ---------  ---------  ----------   -------  ---------  ---------  ----------   -------  --------
write          10.36      10.36       10.36      0.00    2651.19    2651.19     2651.19      0.00  98.87794   EXCEL
read           12.64      12.64       12.64      0.00    3235.79    3235.79     3235.79      0.00  81.01380   EXCEL

Max Write: 10.36 MiB/sec (10.86 MB/sec)
Max Read:  12.64 MiB/sec (13.25 MB/sec)

Run finished: Tue Oct 23 02:12:21 2012

real    2m59.920s
user    0m0.512s
sys     1m9.490s

dd if=/dev/zero of=$FILE bs=4096 count=300000 oflag=direct (1 client, IB, ost and mds on the ramfs)
short IO: 113.5-116.0 sec
no short IO: 116.5-118.5 sec
multiop $TARGET OsMRUc on 1.2 GB target file (1 client, IB, ost and mds on the ramfs, 10 iterations)
short IO: 195.6 sec
no short IO: 199.2 sec

Alexander Boyko added a comment - 23/Oct/12 1:46 PM I got new test result from ramfs. IOR shortio (1 client, IB, ost and mds on the ramfs, 10 runs, average result) Command line used: IOR -a POSIX -t 4k -b 1G -B -o /mnt/lustre/mmap/mmap Machine: Linux mrpcli9 Summary: api = POSIX test filename = /mnt/lustre/mmap/mmap access = single-shared-file ordering in a file = sequential offsets ordering inter file= no tasks offsets clients = 1 (1 per node) repetitions = 1 xfersize = 4096 bytes blocksize = 1 GiB aggregate filesize = 1 GiB Operation Max (MiB) Min (MiB) Mean (MiB) Std Dev Max (OPs) Min (OPs) Mean (OPs) Std Dev Mean (s) --------- --------- --------- ---------- ------- --------- --------- ---------- ------- -------- write 10.59 10.59 10.59 0.00 2709.96 2709.96 2709.96 0.00 96.73352 EXCEL read 14.00 14.00 14.00 0.00 3584.71 3584.71 3584.71 0.00 73.12840 EXCEL Max Write: 10.59 MiB/sec (11.10 MB/sec) Max Read: 14.00 MiB/sec (14.68 MB/sec) Run finished: Mon Oct 22 10:31:36 2012 real 2m49.891s user 0m0.537s sys 1m12.616s IOR without short IO (1 client, IB, ost and mds on the ramfs, 10 runs, average result) Command line used: IOR -a POSIX -t 4k -b 1G -B -o /mnt/lustre/mmap/mmap Machine: Linux mrpcli9 Summary: api = POSIX test filename = /mnt/lustre/mmap/mmap access = single-shared-file ordering in a file = sequential offsets ordering inter file= no tasks offsets clients = 1 (1 per node) repetitions = 1 xfersize = 4096 bytes blocksize = 1 GiB aggregate filesize = 1 GiB Operation Max (MiB) Min (MiB) Mean (MiB) Std Dev Max (OPs) Min (OPs) Mean (OPs) Std Dev Mean (s) --------- --------- --------- ---------- ------- --------- --------- ---------- ------- -------- write 10.36 10.36 10.36 0.00 2651.19 2651.19 2651.19 0.00 98.87794 EXCEL read 12.64 12.64 12.64 0.00 3235.79 3235.79 3235.79 0.00 81.01380 EXCEL Max Write: 10.36 MiB/sec (10.86 MB/sec) Max Read: 12.64 MiB/sec (13.25 MB/sec) Run finished: Tue Oct 23 02:12:21 2012 real 2m59.920s user 0m0.512s sys 1m9.490s dd if=/dev/zero of=$FILE bs=4096 count=300000 oflag=direct (1 client, IB, ost and mds on the ramfs) short IO: 113.5-116.0 sec no short IO: 116.5-118.5 sec multiop $TARGET OsMRUc on 1.2 GB target file (1 client, IB, ost and mds on the ramfs, 10 iterations) short IO: 195.6 sec no short IO: 199.2 sec

Andreas Dilger added a comment - 15/Oct/12 2:17 PM

Eric, I recall you having some thoughts about this. The current patch limits the bulk request size to be <= one page of data (+ overhead), which isn't out of line with MDS requests which can have up to 4kB for a symlink or other pathname component.

I think it is unavoidable that if we want low latency small IOs that they be done without extra round trips, but I would have thought the performance improvement was much better than a few percent... Perhaps testing against a ramdisk OST would give us a better idea of the upper limit of performance for this patch?

Andreas Dilger added a comment - 15/Oct/12 2:17 PM Eric, I recall you having some thoughts about this. The current patch limits the bulk request size to be <= one page of data (+ overhead), which isn't out of line with MDS requests which can have up to 4kB for a symlink or other pathname component. I think it is unavoidable that if we want low latency small IOs that they be done without extra round trips, but I would have thought the performance improvement was much better than a few percent... Perhaps testing against a ramdisk OST would give us a better idea of the upper limit of performance for this patch?

Short I/O support

Details

Description

Attachments

Issue Links

Activity

People

Dates