Details

    • 8137

    Description

      Perform short I/O (requests <= 4k) w/o bulk RPC.

      Attachments

        Issue Links

          Activity

            [LU-1757] Short I/O support

            Patrick Farrell (paf@cray.com) uploaded a new patch: https://review.whamcloud.com/27767
            Subject: LU-1757 brw: add short io osc/ost transfer.
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 01056e12846a73c041da92d8a4f216f2641ca1cc

            gerrit Gerrit Updater added a comment - Patrick Farrell (paf@cray.com) uploaded a new patch: https://review.whamcloud.com/27767 Subject: LU-1757 brw: add short io osc/ost transfer. Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 01056e12846a73c041da92d8a4f216f2641ca1cc

            I want to reserve OBDO flag for short io
            http://review.whamcloud.com/8182

            Andreas, right now, I have no time and resources to check your suggestion, and the short io patch is outdated for master and required reworks. Data-on-MDT looks very good for short io and need another patch also.
            We did not do the lockless test for shared files, but single client pages write with oflags=direct did not show significant improvement.

            aboyko Alexander Boyko added a comment - I want to reserve OBDO flag for short io http://review.whamcloud.com/8182 Andreas, right now, I have no time and resources to check your suggestion, and the short io patch is outdated for master and required reworks. Data-on-MDT looks very good for short io and need another patch also. We did not do the lockless test for shared files, but single client pages write with oflags=direct did not show significant improvement.

            Alexander, I saw that the patch for this feature is abandoned, however small writes is definitely an area that Lustre could use a considerable amount of improvement. I'm still hopeful that there may be some workloads that this feature could show significant performance improvements on, or at least show what other work needs to be done in addition to this patch. The Data-on-MDT work is more concerned with small files and is definitely orthogonal to this small write patch which is intended to improve small write RPCs to a potentially very large file.

            It may be that we need to make additional changes in order to see the overall improvement of small files. Some areas to investigate to see why this patch isn't showing the expected improvements:

            • what is the improvement when the writes are smaller than a single disk block?
            • what is the improvement when multiple clients are doing interleaved writes to the same file? This can be tested relatively easily with IOR and multiple client nodes ("ior -w -b 32 -t 32 -s 65536 -N 8 -i 10 -o /mnt/lustre/testfile" runs on 8 clients and does 65536 interleaved 32-byte writes per client).
            • what impact does NRS object-based round-robin (ORR) have when doing small writing to a single file? This should sort the writes by file offset, but it may be that short writes also need to be cached on the OST so that they can avoid synchronous read-modify-write on the disk. This might be more easily tested with a ZFS OSD, which already does write caching, while the ldiskfs OSD would need changes to the IO path in order to cache small writes.
            • in the shared single interleaved write case, are the clients doing lockless writes? If not, the lock contention and overhead of doing LDLM enqueue/cancel for each write may easily dominate over the improvement from the small write patch. For sub-page writes, it might also be that there needs to be some IO fastpath that bypasses the client page cache so that it can avoid read-modify-write for the local page.

            I suspect that if there were multiple clients doing small writes

            adilger Andreas Dilger added a comment - Alexander, I saw that the patch for this feature is abandoned, however small writes is definitely an area that Lustre could use a considerable amount of improvement. I'm still hopeful that there may be some workloads that this feature could show significant performance improvements on, or at least show what other work needs to be done in addition to this patch. The Data-on-MDT work is more concerned with small files and is definitely orthogonal to this small write patch which is intended to improve small write RPCs to a potentially very large file. It may be that we need to make additional changes in order to see the overall improvement of small files. Some areas to investigate to see why this patch isn't showing the expected improvements: what is the improvement when the writes are smaller than a single disk block? what is the improvement when multiple clients are doing interleaved writes to the same file? This can be tested relatively easily with IOR and multiple client nodes (" ior -w -b 32 -t 32 -s 65536 -N 8 -i 10 -o /mnt/lustre/testfile " runs on 8 clients and does 65536 interleaved 32-byte writes per client). what impact does NRS object-based round-robin (ORR) have when doing small writing to a single file? This should sort the writes by file offset, but it may be that short writes also need to be cached on the OST so that they can avoid synchronous read-modify-write on the disk. This might be more easily tested with a ZFS OSD, which already does write caching, while the ldiskfs OSD would need changes to the IO path in order to cache small writes. in the shared single interleaved write case, are the clients doing lockless writes? If not, the lock contention and overhead of doing LDLM enqueue/cancel for each write may easily dominate over the improvement from the small write patch. For sub-page writes, it might also be that there needs to be some IO fastpath that bypasses the client page cache so that it can avoid read-modify-write for the local page. I suspect that if there were multiple clients doing small writes

            I was thinking of another potential area where this short IO could improve performance significantly (and give a good reason to land it), is when many clients are writing to the same object. Is it possible for you to run a test with multiple clients IOR writing <= 4kB interleaved chunks to the same 1-stripe file? Ideally this would use server-side locking for the writes, so that there is very minimal contention. It might even be that submitting smaller IOs (say 32 bytes) would give even more of a boost to this patch, since the client does not need to do read-modify-write for the full-page writes as it does today.

            If this feature can show some significant performance improvements (say 3-4 times faster, though I'd expect possibly much more) then I would be happy to work on getting this this feature landed.

            adilger Andreas Dilger added a comment - I was thinking of another potential area where this short IO could improve performance significantly (and give a good reason to land it), is when many clients are writing to the same object. Is it possible for you to run a test with multiple clients IOR writing <= 4kB interleaved chunks to the same 1-stripe file? Ideally this would use server-side locking for the writes, so that there is very minimal contention. It might even be that submitting smaller IOs (say 32 bytes) would give even more of a boost to this patch, since the client does not need to do read-modify-write for the full-page writes as it does today. If this feature can show some significant performance improvements (say 3-4 times faster, though I'd expect possibly much more) then I would be happy to work on getting this this feature landed.

            > In talking with Eric, one concern with using inline bulk data is that this can increase the request size enough to cause the routers to use 1MB buffers for handling the short IO requests, and potentially cause the routers to run out of buffers.

            That is too bad for routers. Routers should be have more then two sizes for request size, anyway we have send a transfer size as part of lnet header.

            shadow Alexey Lyashkov added a comment - > In talking with Eric, one concern with using inline bulk data is that this can increase the request size enough to cause the routers to use 1MB buffers for handling the short IO requests, and potentially cause the routers to run out of buffers. That is too bad for routers. Routers should be have more then two sizes for request size, anyway we have send a transfer size as part of lnet header.

            I've tried several times over the past couple weeks to test this patch with master over the WAN but every time I do direct IO read or write I get an LBUG:

            [root@test tmp]# dd if=test of=/dev/null bs=4k iflag=direct

            Message from syslogd@test at Nov 27 03:44:12 ...
            kernel:LustreError: 19403:0:(rw26.c:483:ll_direct_IO_26()) ASSERTION( obj->cob_transient_pages == 0 ) failed:

            Is this an already known issue with direct IO on master?

            jfilizetti Jeremy Filizetti added a comment - I've tried several times over the past couple weeks to test this patch with master over the WAN but every time I do direct IO read or write I get an LBUG: [root@test tmp] # dd if=test of=/dev/null bs=4k iflag=direct Message from syslogd@test at Nov 27 03:44:12 ... kernel:LustreError: 19403:0:(rw26.c:483:ll_direct_IO_26()) ASSERTION( obj->cob_transient_pages == 0 ) failed: Is this an already known issue with direct IO on master?

            Xyratex-bug-id: MRP-320

            nrutman Nathan Rutman added a comment - Xyratex-bug-id: MRP-320

            Alex, if you want to run a test 10 times and get the average, I recommend ior's "-i" option. Otherwise its less obvious to others that you did more than eye-ball the numbers and pick a psuedo-average. With a write performance difference of only 2%, and overall throughput numbers that are so low, it is hard to tell if the results are statistically significant.

            morrone Christopher Morrone (Inactive) added a comment - Alex, if you want to run a test 10 times and get the average, I recommend ior's "-i" option. Otherwise its less obvious to others that you did more than eye-ball the numbers and pick a psuedo-average. With a write performance difference of only 2%, and overall throughput numbers that are so low, it is hard to tell if the results are statistically significant.

            In talking with Eric, one concern with using inline bulk data is that this can increase the request size enough to cause the routers to use 1MB buffers for handling the short IO requests, and potentially cause the routers to run out of buffers.

            With RDMA RPCs the number of inflight bulk requests is limited by the number if service threads (typically 512*num_osts), but with the inline bulk data the number of inflight requests is much larger (8*num_clients*num_osts).

            In order to avoid consuming all of the large buffers on the routers, either a third pool for 8kB requests is needed (in addition to the 4kB and 1MB pools) or the small request (4kB) pool should be modified to use an 8kB buffer size.

            adilger Andreas Dilger added a comment - In talking with Eric, one concern with using inline bulk data is that this can increase the request size enough to cause the routers to use 1MB buffers for handling the short IO requests, and potentially cause the routers to run out of buffers. With RDMA RPCs the number of inflight bulk requests is limited by the number if service threads (typically 512*num_osts), but with the inline bulk data the number of inflight requests is much larger (8*num_clients*num_osts). In order to avoid consuming all of the large buffers on the routers, either a third pool for 8kB requests is needed (in addition to the 4kB and 1MB pools) or the small request (4kB) pool should be modified to use an 8kB buffer size.

            I got new test result from ramfs.

            IOR shortio (1 client, IB, ost and mds on the ramfs, 10 runs, average result)

            Command line used: IOR -a POSIX -t 4k -b 1G -B -o /mnt/lustre/mmap/mmap
            Machine: Linux mrpcli9
            
            Summary:
                    api                = POSIX
                    test filename      = /mnt/lustre/mmap/mmap
                    access             = single-shared-file
                    ordering in a file = sequential offsets
                    ordering inter file= no tasks offsets
                    clients            = 1 (1 per node)
                    repetitions        = 1
                    xfersize           = 4096 bytes
                    blocksize          = 1 GiB
                    aggregate filesize = 1 GiB
            
            Operation  Max (MiB)  Min (MiB)  Mean (MiB)   Std Dev  Max (OPs)  Min (OPs)  Mean (OPs)   Std Dev  Mean (s)
            ---------  ---------  ---------  ----------   -------  ---------  ---------  ----------   -------  --------
            write          10.59      10.59       10.59      0.00    2709.96    2709.96     2709.96      0.00  96.73352   EXCEL
            read           14.00      14.00       14.00      0.00    3584.71    3584.71     3584.71      0.00  73.12840   EXCEL
            
            Max Write: 10.59 MiB/sec (11.10 MB/sec)
            Max Read:  14.00 MiB/sec (14.68 MB/sec)
            
            Run finished: Mon Oct 22 10:31:36 2012
            
            real    2m49.891s
            user    0m0.537s
            sys     1m12.616s
            

            IOR without short IO (1 client, IB, ost and mds on the ramfs, 10 runs, average result)

            Command line used: IOR -a POSIX -t 4k -b 1G -B -o /mnt/lustre/mmap/mmap
            Machine: Linux mrpcli9
            
            Summary:
                    api                = POSIX
                    test filename      = /mnt/lustre/mmap/mmap
                    access             = single-shared-file
                    ordering in a file = sequential offsets
                    ordering inter file= no tasks offsets
                    clients            = 1 (1 per node)
                    repetitions        = 1
                    xfersize           = 4096 bytes
                    blocksize          = 1 GiB
                    aggregate filesize = 1 GiB
            
            Operation  Max (MiB)  Min (MiB)  Mean (MiB)   Std Dev  Max (OPs)  Min (OPs)  Mean (OPs)   Std Dev  Mean (s)
            ---------  ---------  ---------  ----------   -------  ---------  ---------  ----------   -------  --------
            write          10.36      10.36       10.36      0.00    2651.19    2651.19     2651.19      0.00  98.87794   EXCEL
            read           12.64      12.64       12.64      0.00    3235.79    3235.79     3235.79      0.00  81.01380   EXCEL
            
            Max Write: 10.36 MiB/sec (10.86 MB/sec)
            Max Read:  12.64 MiB/sec (13.25 MB/sec)
            
            Run finished: Tue Oct 23 02:12:21 2012
            
            real    2m59.920s
            user    0m0.512s
            sys     1m9.490s
            

            dd if=/dev/zero of=$FILE bs=4096 count=300000 oflag=direct (1 client, IB, ost and mds on the ramfs)
            short IO: 113.5-116.0 sec
            no short IO: 116.5-118.5 sec
            multiop $TARGET OsMRUc on 1.2 GB target file (1 client, IB, ost and mds on the ramfs, 10 iterations)
            short IO: 195.6 sec
            no short IO: 199.2 sec

            aboyko Alexander Boyko added a comment - I got new test result from ramfs. IOR shortio (1 client, IB, ost and mds on the ramfs, 10 runs, average result) Command line used: IOR -a POSIX -t 4k -b 1G -B -o /mnt/lustre/mmap/mmap Machine: Linux mrpcli9 Summary: api = POSIX test filename = /mnt/lustre/mmap/mmap access = single-shared-file ordering in a file = sequential offsets ordering inter file= no tasks offsets clients = 1 (1 per node) repetitions = 1 xfersize = 4096 bytes blocksize = 1 GiB aggregate filesize = 1 GiB Operation Max (MiB) Min (MiB) Mean (MiB) Std Dev Max (OPs) Min (OPs) Mean (OPs) Std Dev Mean (s) --------- --------- --------- ---------- ------- --------- --------- ---------- ------- -------- write 10.59 10.59 10.59 0.00 2709.96 2709.96 2709.96 0.00 96.73352 EXCEL read 14.00 14.00 14.00 0.00 3584.71 3584.71 3584.71 0.00 73.12840 EXCEL Max Write: 10.59 MiB/sec (11.10 MB/sec) Max Read: 14.00 MiB/sec (14.68 MB/sec) Run finished: Mon Oct 22 10:31:36 2012 real 2m49.891s user 0m0.537s sys 1m12.616s IOR without short IO (1 client, IB, ost and mds on the ramfs, 10 runs, average result) Command line used: IOR -a POSIX -t 4k -b 1G -B -o /mnt/lustre/mmap/mmap Machine: Linux mrpcli9 Summary: api = POSIX test filename = /mnt/lustre/mmap/mmap access = single-shared-file ordering in a file = sequential offsets ordering inter file= no tasks offsets clients = 1 (1 per node) repetitions = 1 xfersize = 4096 bytes blocksize = 1 GiB aggregate filesize = 1 GiB Operation Max (MiB) Min (MiB) Mean (MiB) Std Dev Max (OPs) Min (OPs) Mean (OPs) Std Dev Mean (s) --------- --------- --------- ---------- ------- --------- --------- ---------- ------- -------- write 10.36 10.36 10.36 0.00 2651.19 2651.19 2651.19 0.00 98.87794 EXCEL read 12.64 12.64 12.64 0.00 3235.79 3235.79 3235.79 0.00 81.01380 EXCEL Max Write: 10.36 MiB/sec (10.86 MB/sec) Max Read: 12.64 MiB/sec (13.25 MB/sec) Run finished: Tue Oct 23 02:12:21 2012 real 2m59.920s user 0m0.512s sys 1m9.490s dd if=/dev/zero of=$FILE bs=4096 count=300000 oflag=direct (1 client, IB, ost and mds on the ramfs) short IO: 113.5-116.0 sec no short IO: 116.5-118.5 sec multiop $TARGET OsMRUc on 1.2 GB target file (1 client, IB, ost and mds on the ramfs, 10 iterations) short IO: 195.6 sec no short IO: 199.2 sec

            Eric, I recall you having some thoughts about this. The current patch limits the bulk request size to be <= one page of data (+ overhead), which isn't out of line with MDS requests which can have up to 4kB for a symlink or other pathname component.

            I think it is unavoidable that if we want low latency small IOs that they be done without extra round trips, but I would have thought the performance improvement was much better than a few percent... Perhaps testing against a ramdisk OST would give us a better idea of the upper limit of performance for this patch?

            adilger Andreas Dilger added a comment - Eric, I recall you having some thoughts about this. The current patch limits the bulk request size to be <= one page of data (+ overhead), which isn't out of line with MDS requests which can have up to 4kB for a symlink or other pathname component. I think it is unavoidable that if we want low latency small IOs that they be done without extra round trips, but I would have thought the performance improvement was much better than a few percent... Perhaps testing against a ramdisk OST would give us a better idea of the upper limit of performance for this patch?

            People

              paf Patrick Farrell (Inactive)
              aboyko Alexander Boyko
              Votes:
              0 Vote for this issue
              Watchers:
              16 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: