Details

    • 8137

    Description

      Perform short I/O (requests <= 4k) w/o bulk RPC.

      Attachments

        Issue Links

          Activity

            [LU-1757] Short I/O support

            I've tried several times over the past couple weeks to test this patch with master over the WAN but every time I do direct IO read or write I get an LBUG:

            [root@test tmp]# dd if=test of=/dev/null bs=4k iflag=direct

            Message from syslogd@test at Nov 27 03:44:12 ...
            kernel:LustreError: 19403:0:(rw26.c:483:ll_direct_IO_26()) ASSERTION( obj->cob_transient_pages == 0 ) failed:

            Is this an already known issue with direct IO on master?

            jfilizetti Jeremy Filizetti added a comment - I've tried several times over the past couple weeks to test this patch with master over the WAN but every time I do direct IO read or write I get an LBUG: [root@test tmp] # dd if=test of=/dev/null bs=4k iflag=direct Message from syslogd@test at Nov 27 03:44:12 ... kernel:LustreError: 19403:0:(rw26.c:483:ll_direct_IO_26()) ASSERTION( obj->cob_transient_pages == 0 ) failed: Is this an already known issue with direct IO on master?

            Xyratex-bug-id: MRP-320

            nrutman Nathan Rutman added a comment - Xyratex-bug-id: MRP-320

            Alex, if you want to run a test 10 times and get the average, I recommend ior's "-i" option. Otherwise its less obvious to others that you did more than eye-ball the numbers and pick a psuedo-average. With a write performance difference of only 2%, and overall throughput numbers that are so low, it is hard to tell if the results are statistically significant.

            morrone Christopher Morrone (Inactive) added a comment - Alex, if you want to run a test 10 times and get the average, I recommend ior's "-i" option. Otherwise its less obvious to others that you did more than eye-ball the numbers and pick a psuedo-average. With a write performance difference of only 2%, and overall throughput numbers that are so low, it is hard to tell if the results are statistically significant.

            In talking with Eric, one concern with using inline bulk data is that this can increase the request size enough to cause the routers to use 1MB buffers for handling the short IO requests, and potentially cause the routers to run out of buffers.

            With RDMA RPCs the number of inflight bulk requests is limited by the number if service threads (typically 512*num_osts), but with the inline bulk data the number of inflight requests is much larger (8*num_clients*num_osts).

            In order to avoid consuming all of the large buffers on the routers, either a third pool for 8kB requests is needed (in addition to the 4kB and 1MB pools) or the small request (4kB) pool should be modified to use an 8kB buffer size.

            adilger Andreas Dilger added a comment - In talking with Eric, one concern with using inline bulk data is that this can increase the request size enough to cause the routers to use 1MB buffers for handling the short IO requests, and potentially cause the routers to run out of buffers. With RDMA RPCs the number of inflight bulk requests is limited by the number if service threads (typically 512*num_osts), but with the inline bulk data the number of inflight requests is much larger (8*num_clients*num_osts). In order to avoid consuming all of the large buffers on the routers, either a third pool for 8kB requests is needed (in addition to the 4kB and 1MB pools) or the small request (4kB) pool should be modified to use an 8kB buffer size.

            I got new test result from ramfs.

            IOR shortio (1 client, IB, ost and mds on the ramfs, 10 runs, average result)

            Command line used: IOR -a POSIX -t 4k -b 1G -B -o /mnt/lustre/mmap/mmap
            Machine: Linux mrpcli9
            
            Summary:
                    api                = POSIX
                    test filename      = /mnt/lustre/mmap/mmap
                    access             = single-shared-file
                    ordering in a file = sequential offsets
                    ordering inter file= no tasks offsets
                    clients            = 1 (1 per node)
                    repetitions        = 1
                    xfersize           = 4096 bytes
                    blocksize          = 1 GiB
                    aggregate filesize = 1 GiB
            
            Operation  Max (MiB)  Min (MiB)  Mean (MiB)   Std Dev  Max (OPs)  Min (OPs)  Mean (OPs)   Std Dev  Mean (s)
            ---------  ---------  ---------  ----------   -------  ---------  ---------  ----------   -------  --------
            write          10.59      10.59       10.59      0.00    2709.96    2709.96     2709.96      0.00  96.73352   EXCEL
            read           14.00      14.00       14.00      0.00    3584.71    3584.71     3584.71      0.00  73.12840   EXCEL
            
            Max Write: 10.59 MiB/sec (11.10 MB/sec)
            Max Read:  14.00 MiB/sec (14.68 MB/sec)
            
            Run finished: Mon Oct 22 10:31:36 2012
            
            real    2m49.891s
            user    0m0.537s
            sys     1m12.616s
            

            IOR without short IO (1 client, IB, ost and mds on the ramfs, 10 runs, average result)

            Command line used: IOR -a POSIX -t 4k -b 1G -B -o /mnt/lustre/mmap/mmap
            Machine: Linux mrpcli9
            
            Summary:
                    api                = POSIX
                    test filename      = /mnt/lustre/mmap/mmap
                    access             = single-shared-file
                    ordering in a file = sequential offsets
                    ordering inter file= no tasks offsets
                    clients            = 1 (1 per node)
                    repetitions        = 1
                    xfersize           = 4096 bytes
                    blocksize          = 1 GiB
                    aggregate filesize = 1 GiB
            
            Operation  Max (MiB)  Min (MiB)  Mean (MiB)   Std Dev  Max (OPs)  Min (OPs)  Mean (OPs)   Std Dev  Mean (s)
            ---------  ---------  ---------  ----------   -------  ---------  ---------  ----------   -------  --------
            write          10.36      10.36       10.36      0.00    2651.19    2651.19     2651.19      0.00  98.87794   EXCEL
            read           12.64      12.64       12.64      0.00    3235.79    3235.79     3235.79      0.00  81.01380   EXCEL
            
            Max Write: 10.36 MiB/sec (10.86 MB/sec)
            Max Read:  12.64 MiB/sec (13.25 MB/sec)
            
            Run finished: Tue Oct 23 02:12:21 2012
            
            real    2m59.920s
            user    0m0.512s
            sys     1m9.490s
            

            dd if=/dev/zero of=$FILE bs=4096 count=300000 oflag=direct (1 client, IB, ost and mds on the ramfs)
            short IO: 113.5-116.0 sec
            no short IO: 116.5-118.5 sec
            multiop $TARGET OsMRUc on 1.2 GB target file (1 client, IB, ost and mds on the ramfs, 10 iterations)
            short IO: 195.6 sec
            no short IO: 199.2 sec

            aboyko Alexander Boyko added a comment - I got new test result from ramfs. IOR shortio (1 client, IB, ost and mds on the ramfs, 10 runs, average result) Command line used: IOR -a POSIX -t 4k -b 1G -B -o /mnt/lustre/mmap/mmap Machine: Linux mrpcli9 Summary: api = POSIX test filename = /mnt/lustre/mmap/mmap access = single-shared-file ordering in a file = sequential offsets ordering inter file= no tasks offsets clients = 1 (1 per node) repetitions = 1 xfersize = 4096 bytes blocksize = 1 GiB aggregate filesize = 1 GiB Operation Max (MiB) Min (MiB) Mean (MiB) Std Dev Max (OPs) Min (OPs) Mean (OPs) Std Dev Mean (s) --------- --------- --------- ---------- ------- --------- --------- ---------- ------- -------- write 10.59 10.59 10.59 0.00 2709.96 2709.96 2709.96 0.00 96.73352 EXCEL read 14.00 14.00 14.00 0.00 3584.71 3584.71 3584.71 0.00 73.12840 EXCEL Max Write: 10.59 MiB/sec (11.10 MB/sec) Max Read: 14.00 MiB/sec (14.68 MB/sec) Run finished: Mon Oct 22 10:31:36 2012 real 2m49.891s user 0m0.537s sys 1m12.616s IOR without short IO (1 client, IB, ost and mds on the ramfs, 10 runs, average result) Command line used: IOR -a POSIX -t 4k -b 1G -B -o /mnt/lustre/mmap/mmap Machine: Linux mrpcli9 Summary: api = POSIX test filename = /mnt/lustre/mmap/mmap access = single-shared-file ordering in a file = sequential offsets ordering inter file= no tasks offsets clients = 1 (1 per node) repetitions = 1 xfersize = 4096 bytes blocksize = 1 GiB aggregate filesize = 1 GiB Operation Max (MiB) Min (MiB) Mean (MiB) Std Dev Max (OPs) Min (OPs) Mean (OPs) Std Dev Mean (s) --------- --------- --------- ---------- ------- --------- --------- ---------- ------- -------- write 10.36 10.36 10.36 0.00 2651.19 2651.19 2651.19 0.00 98.87794 EXCEL read 12.64 12.64 12.64 0.00 3235.79 3235.79 3235.79 0.00 81.01380 EXCEL Max Write: 10.36 MiB/sec (10.86 MB/sec) Max Read: 12.64 MiB/sec (13.25 MB/sec) Run finished: Tue Oct 23 02:12:21 2012 real 2m59.920s user 0m0.512s sys 1m9.490s dd if=/dev/zero of=$FILE bs=4096 count=300000 oflag=direct (1 client, IB, ost and mds on the ramfs) short IO: 113.5-116.0 sec no short IO: 116.5-118.5 sec multiop $TARGET OsMRUc on 1.2 GB target file (1 client, IB, ost and mds on the ramfs, 10 iterations) short IO: 195.6 sec no short IO: 199.2 sec

            Eric, I recall you having some thoughts about this. The current patch limits the bulk request size to be <= one page of data (+ overhead), which isn't out of line with MDS requests which can have up to 4kB for a symlink or other pathname component.

            I think it is unavoidable that if we want low latency small IOs that they be done without extra round trips, but I would have thought the performance improvement was much better than a few percent... Perhaps testing against a ramdisk OST would give us a better idea of the upper limit of performance for this patch?

            adilger Andreas Dilger added a comment - Eric, I recall you having some thoughts about this. The current patch limits the bulk request size to be <= one page of data (+ overhead), which isn't out of line with MDS requests which can have up to 4kB for a symlink or other pathname component. I think it is unavoidable that if we want low latency small IOs that they be done without extra round trips, but I would have thought the performance improvement was much better than a few percent... Perhaps testing against a ramdisk OST would give us a better idea of the upper limit of performance for this patch?

            Jeremy, if you (or someone you know) have the ability to do so, it would be great to get some performance benchmarks on this patch over high-latency links. As it stands, getting only a few percent improvement for small IO sizes (7.7MB/s to 8.3MB/s) isn't compelling.

            Alexander, what was the back-end storage used for this test? If it was a disk, then the IOPS rate would be the limiting factor, though 100000k writes in 52s is about 2000 IOPS, so probably a RAID-10 array or SSD? While I think that this could help the performance, I suspect that a closer investigation of where the actual overhead lies would help. Is there a need for more RPCs in flight with small IOs? Is the latency in the server stack or RPC handling?

            adilger Andreas Dilger added a comment - Jeremy, if you (or someone you know) have the ability to do so, it would be great to get some performance benchmarks on this patch over high-latency links. As it stands, getting only a few percent improvement for small IO sizes (7.7MB/s to 8.3MB/s) isn't compelling. Alexander, what was the back-end storage used for this test? If it was a disk, then the IOPS rate would be the limiting factor, though 100000k writes in 52s is about 2000 IOPS, so probably a RAID-10 array or SSD? While I think that this could help the performance, I suspect that a closer investigation of where the actual overhead lies would help. Is there a need for more RPCs in flight with small IOs? Is the latency in the server stack or RPC handling?
            pjones Peter Jones added a comment -

            Ah yes I think that you are right Jeremy - thanks!

            pjones Peter Jones added a comment - Ah yes I think that you are right Jeremy - thanks!

            Peter, the cherry picked patch that was added b2_1, b2_3 and master was only for the connect flags to reserve them, the full patch still doesn't appear to be landed. If it is can you provide the commit because I can't find it?

            jfilizetti Jeremy Filizetti added a comment - Peter, the cherry picked patch that was added b2_1, b2_3 and master was only for the connect flags to reserve them, the full patch still doesn't appear to be landed. If it is can you provide the commit because I can't find it?
            pjones Peter Jones added a comment -

            Landed for 2.4

            pjones Peter Jones added a comment - Landed for 2.4

            Peter, only the reservation of the feature flag has landed, not the actual code to implement it.

            adilger Andreas Dilger added a comment - Peter, only the reservation of the feature flag has landed, not the actual code to implement it.

            People

              paf Patrick Farrell (Inactive)
              aboyko Alexander Boyko
              Votes:
              0 Vote for this issue
              Watchers:
              16 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: