Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8964

use parallel I/O to improve performance on machines with slow single thread performance

Details

    • New Feature
    • Resolution: Duplicate
    • Major
    • None
    • None
    • None
    • 9223372036854775807

    Description

      On machines with slow single thread performance like KNL the bottleneck of I/O performance moved into code which just copy memory from one buffer to other (from user space to kernel or vice versa). In current Lustre implementation all I/O performs in single thread and this is become an issue for KNL. Significantly improve performance can be with solution which do parallel memory transfer of large buffers.

       

      Attachments

        Issue Links

          Activity

            [LU-8964] use parallel I/O to improve performance on machines with slow single thread performance

            LU-12403 will do this work correctly.

            simmonsja James A Simmons added a comment - LU-12403 will do this work correctly.

            Thanks Patrick for the heads up on ktask. I will be watching it closely and give it a spin under this ticket.

            simmonsja James A Simmons added a comment - Thanks Patrick for the heads up on ktask. I will be watching it closely and give it a spin under this ticket.

            Thanks for slides. I will loop at them carefully. But for now I disagree that padata API have a big overhead. It's mostly negligible comparing with other overhead to pass work into different thread. But having many threads will leads a sheduler delay to switch under heavy loads. So, I think padata will work more stable and predictable in this case.

            dmiter Dmitry Eremin (Inactive) added a comment - Thanks for slides. I will loop at them carefully. But for now I disagree that padata API have a big overhead. It's mostly negligible comparing with other overhead to pass work into different thread. But having many threads will leads a sheduler delay to switch under heavy loads. So, I think padata will work more stable and predictable in this case.

            Also, apologies for not posting these last year.

            paf Patrick Farrell (Inactive) added a comment - Also, apologies for not posting these last year.

            https://www.eofs.eu/_media/events/devsummit17/patrick_farrell_laddevsummit_pio.pdf

            This is old and out of date, but I wanted to make sure these slides were seen.  I think the performance of the readahead code would probably be helped a lot by changes to the parallelization framework (as would the performance of pio itself).

            So slides 8, 9, and 10 would probably be of particular interest here.  There are significant performance improvements available for PIO just by going from padata to something simpler.  Also, the CPU binding behavior of padata is pretty bad - Binding explicitly to one CPU is problematic.  Padata seems to assume the whole machine is dedicated, which is not a friendly assumption.  (I discovered its CPU binding behavior because I saw performance problems - A particular CPU would be busy and the work assigned to that CPU would be delayed, which delays the completion of the whole i/o.  At this time, other CPUs were idle, and not binding to a specific CPU would have allowed one of them to be used.)

            paf Patrick Farrell (Inactive) added a comment - https://www.eofs.eu/_media/events/devsummit17/patrick_farrell_laddevsummit_pio.pdf This is old and out of date, but I wanted to make sure these slides were seen.  I think the performance of the readahead code would probably be helped a lot by changes to the parallelization framework (as would the performance of pio itself). So slides 8, 9, and 10 would probably be of particular interest here.  There are significant performance improvements available for PIO just by going from padata to something simpler.  Also, the CPU binding behavior of padata is pretty bad - Binding explicitly to one CPU is problematic.  Padata seems to assume the whole machine is dedicated, which is not a friendly assumption.  (I discovered its CPU binding behavior because I saw performance problems - A particular CPU would be busy and the work assigned to that CPU would be delayed, which delays the completion of the whole i/o.  At this time, other CPUs were idle, and not binding to a specific CPU would have allowed one of them to be used.)

            The last version of patch don't have an issue with RPC splitting. For reading in my VM machine I have the following:

            with PIO disabled:

                                    read                    write
            pages per rpc         rpcs   % cum % |       rpcs   % cum %
            1:                       3   4   4   |          0   0   0
            2:                       0   0   4   |          0   0   0
            4:                       0   0   4   |          0   0   0
            8:                       0   0   4   |          0   0   0
            16:                      0   0   4   |          0   0   0
            32:                      0   0   4   |          0   0   0
            64:                      0   0   4   |          0   0   0
            128:                     0   0   4   |          0   0   0
            256:                     0   0   4   |          0   0   0
            512:                     1   1   6   |          0   0   0
            1024:                   62  93 100   |          0   0   0
            

            with PIO enabled:

                                    read                    write
            pages per rpc         rpcs   % cum % |       rpcs   % cum %
            1:                       2   2   2   |          0   0   0
            2:                       0   0   2   |          0   0   0
            4:                       0   0   2   |          0   0   0
            8:                       0   0   2   |          0   0   0
            16:                      0   0   2   |          0   0   0
            32:                      0   0   2   |          0   0   0
            64:                      0   0   2   |          0   0   0
            128:                     0   0   2   |          0   0   0
            256:                     1   1   4   |          0   0   0
            512:                     4   5  10   |          0   0   0
            1024:                   61  89 100   |          0   0   0
            
            dmiter Dmitry Eremin (Inactive) added a comment - The last version of patch don't have an issue with RPC splitting. For reading in my VM machine I have the following: with PIO disabled: read write pages per rpc rpcs % cum % | rpcs % cum % 1: 3 4 4 | 0 0 0 2: 0 0 4 | 0 0 0 4: 0 0 4 | 0 0 0 8: 0 0 4 | 0 0 0 16: 0 0 4 | 0 0 0 32: 0 0 4 | 0 0 0 64: 0 0 4 | 0 0 0 128: 0 0 4 | 0 0 0 256: 0 0 4 | 0 0 0 512: 1 1 6 | 0 0 0 1024: 62 93 100 | 0 0 0 with PIO enabled: read write pages per rpc rpcs % cum % | rpcs % cum % 1: 2 2 2 | 0 0 0 2: 0 0 2 | 0 0 0 4: 0 0 2 | 0 0 0 8: 0 0 2 | 0 0 0 16: 0 0 2 | 0 0 0 32: 0 0 2 | 0 0 0 64: 0 0 2 | 0 0 0 128: 0 0 2 | 0 0 0 256: 1 1 4 | 0 0 0 512: 4 5 10 | 0 0 0 1024: 61 89 100 | 0 0 0

            This is regreassion of last version of my patch when I turned off async read ahead if PIO flag is not enabled. I'm going to fix this. But anyway I cannot avoid several 1K requests. I think we should not fix this because of this leed low latency in requests. This happens when user request miss page and async read ahead is initiated but minwhile it request a single page only and then unlock user application. In the next loop of reading next page will be available from async read ahead.

            dmiter Dmitry Eremin (Inactive) added a comment - This is regreassion of last version of my patch when I turned off async read ahead if PIO flag is not enabled. I'm going to fix this. But anyway I cannot avoid several 1K requests. I think we should not fix this because of this leed low latency in requests. This happens when user request miss page and async read ahead is initiated but minwhile it request a single page only and then unlock user application. In the next loop of reading next page will be available from async read ahead.

            Looking at when these smaller reads happen, they're clustered in the middle of the job, to the point where there are no 1024 page reads for a while (I used the D_INODE debug in osc_build_rpc for this).

            This is for the 4000 MiB case.

            This is the first RPC, which is naturally enough 1 page:
            00000008:00000002:0.0F:1512684358.524801:0:5975:0:(osc_request.c:2073:osc_build_rpc()) @@@ 1 pages, aa ffff880034753170. now 1r/0w in flight req@ffff8

            That's followed by a large # of 1024 page RPCs, though with 1 page RPCs mixed in every so often, which seems weird. It looks like there is no point at which we hit a steady state of only 1024 page RPCs.

            Then, here's the first of the set of only 1 page RPCs:
            00000008:00000002:2.0:1512684366.335774:0:5973:0:(osc_request.c:2073:osc_build_rpc()) @@@ 1 pages, aa ffff88023602c170. now 1r/0w in flight req@ffff88

            There are then NO 1024 page RPCs for some thousands of RPCs. Weirdly, at the end it seems to recover and we do some 1024 page RPCs again. Here's the first of those:
            00000008:00000002:0.0:1512684370.524090:0:5974:0:(osc_request.c:2073:osc_build_rpc()) @@@ 1024 pages, aa ffff8800a506c770. now 1r/0w in flight req@fff

            And here's the last RPC period:
            00000008:00000002:1.0:1512684371.129494:0:5972:0:(osc_request.c:2073:osc_build_rpc()) @@@ 1022 pages, aa ffff88021f8c1c70. now 2r/0w in flight req@fff

            So spend 8 second sending mostly large RPCs, then 4 seconds sending only 4 KiB RPCs, then another ~1 second sending large RPCs again.

            That means that, as you said, we're only sending a few % of the data in those RPCs - about 3% in this case.
            But they're taking about 30% of the total time, and it's all in one big lump.

            Something's wrong.

            paf Patrick Farrell (Inactive) added a comment - - edited Looking at when these smaller reads happen, they're clustered in the middle of the job, to the point where there are no 1024 page reads for a while (I used the D_INODE debug in osc_build_rpc for this). This is for the 4000 MiB case. This is the first RPC, which is naturally enough 1 page: 00000008:00000002:0.0F:1512684358.524801:0:5975:0:(osc_request.c:2073:osc_build_rpc()) @@@ 1 pages, aa ffff880034753170. now 1r/0w in flight req@ffff8 That's followed by a large # of 1024 page RPCs, though with 1 page RPCs mixed in every so often, which seems weird. It looks like there is no point at which we hit a steady state of only 1024 page RPCs. Then, here's the first of the set of only 1 page RPCs: 00000008:00000002:2.0:1512684366.335774:0:5973:0:(osc_request.c:2073:osc_build_rpc()) @@@ 1 pages, aa ffff88023602c170. now 1r/0w in flight req@ffff88 There are then NO 1024 page RPCs for some thousands of RPCs. Weirdly, at the end it seems to recover and we do some 1024 page RPCs again. Here's the first of those: 00000008:00000002:0.0:1512684370.524090:0:5974:0:(osc_request.c:2073:osc_build_rpc()) @@@ 1024 pages, aa ffff8800a506c770. now 1r/0w in flight req@fff And here's the last RPC period: 00000008:00000002:1.0:1512684371.129494:0:5972:0:(osc_request.c:2073:osc_build_rpc()) @@@ 1022 pages, aa ffff88021f8c1c70. now 2r/0w in flight req@fff So spend 8 second sending mostly large RPCs, then 4 seconds sending only 4 KiB RPCs, then another ~1 second sending large RPCs again. That means that, as you said, we're only sending a few % of the data in those RPCs - about 3% in this case. But they're taking about 30% of the total time, and it's all in one big lump. Something's wrong.

            I see what you're getting at, but then I would expect the % of small RPCs to drop as we go up in file size, but I in fact see the opposite. Something goes badly wrong for me around 4 GiB in size, which is 50% of memory on this VM node (which is unlikely to be a coincidence).

            Examples - 1000 MiB:
            [root@cent7c01 cent7s02]# ls -la ost0; cat ost0 > /dev/null; cat /proc/fs/lustre/osc/cent7s02-OST0000-osc-ffff88022437c800/rpc_stats
            rw-rr- 1 root root 1048576000 Dec 7 16:52 ost0
            snapshot_time: 1512683598.957290341 (secs.nsecs)
            read RPCs in flight: 0
            write RPCs in flight: 0
            pending write pages: 0
            pending read pages: 0

            read write
            pages per rpc rpcs % cum % | rpcs % cum %
            1: 22 8 8 | 0 0 0
            2: 0 0 8 | 0 0 0
            4: 0 0 8 | 0 0 0
            8: 0 0 8 | 0 0 0
            16: 0 0 8 | 0 0 0
            32: 0 0 8 | 0 0 0
            64: 0 0 8 | 0 0 0
            128: 0 0 8 | 0 0 0
            256: 0 0 8 | 0 0 0
            512: 0 0 8 | 0 0 0
            1024: 250 91 100 | 0 0 0

            2000 MiB:
            [root@cent7c01 cent7s02]# dd if=/dev/zero bs=1M count=2000 of=./ost0; echo 3 > /proc/sys/vm/drop_caches ; echo clear > /sys/fs/lustre/ldlm/namespaces/cent7s02-OST0000-osc-ffff88022437c800/lru_size; echo c > /proc/fs/lustre/osc/cent7s02-OST0000-osc-ffff88022437c800/rpc_stats; cat ost0 > /dev/null; cat /proc/fs/lustre/osc/cent7s02-OST0000-osc-ffff88022437c800/rpc_stats
            2000+0 records in
            2000+0 records out
            2097152000 bytes (2.1 GB) copied, 7.99026 s, 262 MB/s
            snapshot_time: 1512683716.923342172 (secs.nsecs)
            read RPCs in flight: 0
            write RPCs in flight: 0
            pending write pages: 0
            pending read pages: 0

            read write
            pages per rpc rpcs % cum % | rpcs % cum %
            1: 37 6 6 | 0 0 0
            2: 0 0 6 | 0 0 0
            4: 0 0 6 | 0 0 0
            8: 0 0 6 | 0 0 0
            16: 0 0 6 | 0 0 0
            32: 0 0 6 | 0 0 0
            64: 0 0 6 | 0 0 0
            128: 0 0 6 | 0 0 0
            256: 0 0 6 | 0 0 0
            512: 0 0 6 | 0 0 0
            1024: 500 93 100 | 0 0 0

            3000 MiB:
            [root@cent7c01 cent7s02]# dd if=/dev/zero bs=1M count=3000 of=./ost0; echo 3 > /proc/sys/vm/drop_caches ; echo clear > /sys/fs/lustre/ldlm/namespaces/cent7s02-OST0000-osc-ffff88022437c800/lru_size; echo c > /proc/fs/lustre/osc/cent7s02-OST0000-osc-ffff88022437c800/rpc_stats; cat ost0 > /dev/null; cat /proc/fs/lustre/osc/cent7s02-OST0000-osc-ffff88022437c800/rpc_stats
            3000+0 records in
            3000+0 records out
            3145728000 bytes (3.1 GB) copied, 10.6623 s, 295 MB/s
            snapshot_time: 1512683893.046695412 (secs.nsecs)
            read RPCs in flight: 0
            write RPCs in flight: 0
            pending write pages: 0
            pending read pages: 0

            read write
            pages per rpc rpcs % cum % | rpcs % cum %
            1: 53 6 6 | 0 0 0
            2: 0 0 6 | 0 0 0
            4: 0 0 6 | 0 0 0
            8: 0 0 6 | 0 0 0
            16: 0 0 6 | 0 0 0
            32: 0 0 6 | 0 0 0
            64: 0 0 6 | 0 0 0
            128: 0 0 6 | 0 0 0
            256: 0 0 6 | 0 0 0
            512: 0 0 6 | 0 0 0
            1024: 750 93 100 | 0 0 0

            4000 MiB:
            [root@cent7c01 cent7s02]# dd if=/dev/zero bs=1M count=4000 of=./ost0; echo 3 > /proc/sys/vm/drop_caches ; echo clear > /sys/fs/lustre/ldlm/namespaces/cent7s02-OST0000-osc-ffff88022437c800/lru_size; echo c > /proc/fs/lustre/osc/cent7s02-OST0000-osc-ffff88022437c800/rpc_stats; cat ost0 > /dev/null; cat /proc/fs/lustre/osc/cent7s02-OST0000-osc-ffff88022437c800/rpc_stats
            4000+0 records in
            4000+0 records out
            4194304000 bytes (4.2 GB) copied, 13.1352 s, 319 MB/s
            snapshot_time: 1512683761.337612432 (secs.nsecs)
            read RPCs in flight: 0
            write RPCs in flight: 0
            pending write pages: 0
            pending read pages: 0

            read write
            pages per rpc rpcs % cum % | rpcs % cum %
            1: 8074 89 89 | 0 0 0
            2: 0 0 89 | 0 0 0
            4: 0 0 89 | 0 0 0
            8: 0 0 89 | 0 0 0
            16: 0 0 89 | 0 0 0
            32: 0 0 89 | 0 0 0
            64: 0 0 89 | 0 0 0
            128: 0 0 89 | 0 0 0
            256: 1 0 89 | 0 0 0
            512: 0 0 89 | 0 0 0
            1024: 992 10 100 | 0 0 0

            Larger sizes remain problematic. So when I hit the cached mb limit on the node, something goes totally off the rails, I think. Perhaps we're getting that from the kernel, but it's still a major degradation.

            paf Patrick Farrell (Inactive) added a comment - - edited I see what you're getting at, but then I would expect the % of small RPCs to drop as we go up in file size, but I in fact see the opposite. Something goes badly wrong for me around 4 GiB in size, which is 50% of memory on this VM node (which is unlikely to be a coincidence). Examples - 1000 MiB: [root@cent7c01 cent7s02] # ls -la ost0; cat ost0 > /dev/null; cat /proc/fs/lustre/osc/cent7s02-OST0000-osc-ffff88022437c800/rpc_stats rw-r r - 1 root root 1048576000 Dec 7 16:52 ost0 snapshot_time: 1512683598.957290341 (secs.nsecs) read RPCs in flight: 0 write RPCs in flight: 0 pending write pages: 0 pending read pages: 0 read write pages per rpc rpcs % cum % | rpcs % cum % 1: 22 8 8 | 0 0 0 2: 0 0 8 | 0 0 0 4: 0 0 8 | 0 0 0 8: 0 0 8 | 0 0 0 16: 0 0 8 | 0 0 0 32: 0 0 8 | 0 0 0 64: 0 0 8 | 0 0 0 128: 0 0 8 | 0 0 0 256: 0 0 8 | 0 0 0 512: 0 0 8 | 0 0 0 1024: 250 91 100 | 0 0 0 2000 MiB: [root@cent7c01 cent7s02] # dd if=/dev/zero bs=1M count=2000 of=./ost0; echo 3 > /proc/sys/vm/drop_caches ; echo clear > /sys/fs/lustre/ldlm/namespaces/cent7s02-OST0000-osc-ffff88022437c800/lru_size; echo c > /proc/fs/lustre/osc/cent7s02-OST0000-osc-ffff88022437c800/rpc_stats; cat ost0 > /dev/null; cat /proc/fs/lustre/osc/cent7s02-OST0000-osc-ffff88022437c800/rpc_stats 2000+0 records in 2000+0 records out 2097152000 bytes (2.1 GB) copied, 7.99026 s, 262 MB/s snapshot_time: 1512683716.923342172 (secs.nsecs) read RPCs in flight: 0 write RPCs in flight: 0 pending write pages: 0 pending read pages: 0 read write pages per rpc rpcs % cum % | rpcs % cum % 1: 37 6 6 | 0 0 0 2: 0 0 6 | 0 0 0 4: 0 0 6 | 0 0 0 8: 0 0 6 | 0 0 0 16: 0 0 6 | 0 0 0 32: 0 0 6 | 0 0 0 64: 0 0 6 | 0 0 0 128: 0 0 6 | 0 0 0 256: 0 0 6 | 0 0 0 512: 0 0 6 | 0 0 0 1024: 500 93 100 | 0 0 0 3000 MiB: [root@cent7c01 cent7s02] # dd if=/dev/zero bs=1M count=3000 of=./ost0; echo 3 > /proc/sys/vm/drop_caches ; echo clear > /sys/fs/lustre/ldlm/namespaces/cent7s02-OST0000-osc-ffff88022437c800/lru_size; echo c > /proc/fs/lustre/osc/cent7s02-OST0000-osc-ffff88022437c800/rpc_stats; cat ost0 > /dev/null; cat /proc/fs/lustre/osc/cent7s02-OST0000-osc-ffff88022437c800/rpc_stats 3000+0 records in 3000+0 records out 3145728000 bytes (3.1 GB) copied, 10.6623 s, 295 MB/s snapshot_time: 1512683893.046695412 (secs.nsecs) read RPCs in flight: 0 write RPCs in flight: 0 pending write pages: 0 pending read pages: 0 read write pages per rpc rpcs % cum % | rpcs % cum % 1: 53 6 6 | 0 0 0 2: 0 0 6 | 0 0 0 4: 0 0 6 | 0 0 0 8: 0 0 6 | 0 0 0 16: 0 0 6 | 0 0 0 32: 0 0 6 | 0 0 0 64: 0 0 6 | 0 0 0 128: 0 0 6 | 0 0 0 256: 0 0 6 | 0 0 0 512: 0 0 6 | 0 0 0 1024: 750 93 100 | 0 0 0 4000 MiB: [root@cent7c01 cent7s02] # dd if=/dev/zero bs=1M count=4000 of=./ost0; echo 3 > /proc/sys/vm/drop_caches ; echo clear > /sys/fs/lustre/ldlm/namespaces/cent7s02-OST0000-osc-ffff88022437c800/lru_size; echo c > /proc/fs/lustre/osc/cent7s02-OST0000-osc-ffff88022437c800/rpc_stats; cat ost0 > /dev/null; cat /proc/fs/lustre/osc/cent7s02-OST0000-osc-ffff88022437c800/rpc_stats 4000+0 records in 4000+0 records out 4194304000 bytes (4.2 GB) copied, 13.1352 s, 319 MB/s snapshot_time: 1512683761.337612432 (secs.nsecs) read RPCs in flight: 0 write RPCs in flight: 0 pending write pages: 0 pending read pages: 0 read write pages per rpc rpcs % cum % | rpcs % cum % 1: 8074 89 89 | 0 0 0 2: 0 0 89 | 0 0 0 4: 0 0 89 | 0 0 0 8: 0 0 89 | 0 0 0 16: 0 0 89 | 0 0 0 32: 0 0 89 | 0 0 0 64: 0 0 89 | 0 0 0 128: 0 0 89 | 0 0 0 256: 1 0 89 | 0 0 0 512: 0 0 89 | 0 0 0 1024: 992 10 100 | 0 0 0 — Larger sizes remain problematic. So when I hit the cached mb limit on the node, something goes totally off the rails, I think. Perhaps we're getting that from the kernel, but it's still a major degradation.

            Not to downplay the significance of the 4KB RPCs, since there are a lot of them, but it is worthwhile to note that the bulk of the IO is still being transferred by 4MB RPCs - 1729/1731 in the patched case vs 1751 in the unpatched case. So 98% of the data is still using 4MB RPCs, vs 2% of the data is using 4KB RPCs.

            My first guess is that the small RPC sizes are caused by a smaller readahead window at the start being split across multiple threads/RPCs, and then as the readahead window grows it starts using larger reads? It may also be that a small number of 4KB RPCs are being handled separately for each 4MB RPC, but we don't see this because of binning in the histogram. That would have to be checked in the client debug logs with lctl set_param debug=+reada.

            adilger Andreas Dilger added a comment - Not to downplay the significance of the 4KB RPCs, since there are a lot of them, but it is worthwhile to note that the bulk of the IO is still being transferred by 4MB RPCs - 1729/1731 in the patched case vs 1751 in the unpatched case. So 98% of the data is still using 4MB RPCs, vs 2% of the data is using 4KB RPCs. My first guess is that the small RPC sizes are caused by a smaller readahead window at the start being split across multiple threads/RPCs, and then as the readahead window grows it starts using larger reads? It may also be that a small number of 4KB RPCs are being handled separately for each 4MB RPC, but we don't see this because of binning in the histogram. That would have to be checked in the client debug logs with lctl set_param debug=+reada .

            People

              simmonsja James A Simmons
              dmiter Dmitry Eremin (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              26 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: