Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8964

use parallel I/O to improve performance on machines with slow single thread performance

Details

    • New Feature
    • Resolution: Duplicate
    • Major
    • None
    • None
    • None
    • 9223372036854775807

    Description

      On machines with slow single thread performance like KNL the bottleneck of I/O performance moved into code which just copy memory from one buffer to other (from user space to kernel or vice versa). In current Lustre implementation all I/O performs in single thread and this is become an issue for KNL. Significantly improve performance can be with solution which do parallel memory transfer of large buffers.

       

      Attachments

        Issue Links

          Activity

            [LU-8964] use parallel I/O to improve performance on machines with slow single thread performance
            spitzcor Cory Spitz added a comment -

            simmonsja, for the record, you mean LU-12043. LU-12403 is "add e2fsprog support for RHEL-8".

            spitzcor Cory Spitz added a comment - simmonsja , for the record, you mean LU-12043 . LU-12403 is "add e2fsprog support for RHEL-8".

            LU-12403 will do this work correctly.

            simmonsja James A Simmons added a comment - LU-12403 will do this work correctly.

            Thanks Patrick for the heads up on ktask. I will be watching it closely and give it a spin under this ticket.

            simmonsja James A Simmons added a comment - Thanks Patrick for the heads up on ktask. I will be watching it closely and give it a spin under this ticket.

            Thanks for slides. I will loop at them carefully. But for now I disagree that padata API have a big overhead. It's mostly negligible comparing with other overhead to pass work into different thread. But having many threads will leads a sheduler delay to switch under heavy loads. So, I think padata will work more stable and predictable in this case.

            dmiter Dmitry Eremin (Inactive) added a comment - Thanks for slides. I will loop at them carefully. But for now I disagree that padata API have a big overhead. It's mostly negligible comparing with other overhead to pass work into different thread. But having many threads will leads a sheduler delay to switch under heavy loads. So, I think padata will work more stable and predictable in this case.

            Also, apologies for not posting these last year.

            paf Patrick Farrell (Inactive) added a comment - Also, apologies for not posting these last year.

            https://www.eofs.eu/_media/events/devsummit17/patrick_farrell_laddevsummit_pio.pdf

            This is old and out of date, but I wanted to make sure these slides were seen.  I think the performance of the readahead code would probably be helped a lot by changes to the parallelization framework (as would the performance of pio itself).

            So slides 8, 9, and 10 would probably be of particular interest here.  There are significant performance improvements available for PIO just by going from padata to something simpler.  Also, the CPU binding behavior of padata is pretty bad - Binding explicitly to one CPU is problematic.  Padata seems to assume the whole machine is dedicated, which is not a friendly assumption.  (I discovered its CPU binding behavior because I saw performance problems - A particular CPU would be busy and the work assigned to that CPU would be delayed, which delays the completion of the whole i/o.  At this time, other CPUs were idle, and not binding to a specific CPU would have allowed one of them to be used.)

            paf Patrick Farrell (Inactive) added a comment - https://www.eofs.eu/_media/events/devsummit17/patrick_farrell_laddevsummit_pio.pdf This is old and out of date, but I wanted to make sure these slides were seen.  I think the performance of the readahead code would probably be helped a lot by changes to the parallelization framework (as would the performance of pio itself). So slides 8, 9, and 10 would probably be of particular interest here.  There are significant performance improvements available for PIO just by going from padata to something simpler.  Also, the CPU binding behavior of padata is pretty bad - Binding explicitly to one CPU is problematic.  Padata seems to assume the whole machine is dedicated, which is not a friendly assumption.  (I discovered its CPU binding behavior because I saw performance problems - A particular CPU would be busy and the work assigned to that CPU would be delayed, which delays the completion of the whole i/o.  At this time, other CPUs were idle, and not binding to a specific CPU would have allowed one of them to be used.)

            The last version of patch don't have an issue with RPC splitting. For reading in my VM machine I have the following:

            with PIO disabled:

                                    read                    write
            pages per rpc         rpcs   % cum % |       rpcs   % cum %
            1:                       3   4   4   |          0   0   0
            2:                       0   0   4   |          0   0   0
            4:                       0   0   4   |          0   0   0
            8:                       0   0   4   |          0   0   0
            16:                      0   0   4   |          0   0   0
            32:                      0   0   4   |          0   0   0
            64:                      0   0   4   |          0   0   0
            128:                     0   0   4   |          0   0   0
            256:                     0   0   4   |          0   0   0
            512:                     1   1   6   |          0   0   0
            1024:                   62  93 100   |          0   0   0
            

            with PIO enabled:

                                    read                    write
            pages per rpc         rpcs   % cum % |       rpcs   % cum %
            1:                       2   2   2   |          0   0   0
            2:                       0   0   2   |          0   0   0
            4:                       0   0   2   |          0   0   0
            8:                       0   0   2   |          0   0   0
            16:                      0   0   2   |          0   0   0
            32:                      0   0   2   |          0   0   0
            64:                      0   0   2   |          0   0   0
            128:                     0   0   2   |          0   0   0
            256:                     1   1   4   |          0   0   0
            512:                     4   5  10   |          0   0   0
            1024:                   61  89 100   |          0   0   0
            
            dmiter Dmitry Eremin (Inactive) added a comment - The last version of patch don't have an issue with RPC splitting. For reading in my VM machine I have the following: with PIO disabled: read write pages per rpc rpcs % cum % | rpcs % cum % 1: 3 4 4 | 0 0 0 2: 0 0 4 | 0 0 0 4: 0 0 4 | 0 0 0 8: 0 0 4 | 0 0 0 16: 0 0 4 | 0 0 0 32: 0 0 4 | 0 0 0 64: 0 0 4 | 0 0 0 128: 0 0 4 | 0 0 0 256: 0 0 4 | 0 0 0 512: 1 1 6 | 0 0 0 1024: 62 93 100 | 0 0 0 with PIO enabled: read write pages per rpc rpcs % cum % | rpcs % cum % 1: 2 2 2 | 0 0 0 2: 0 0 2 | 0 0 0 4: 0 0 2 | 0 0 0 8: 0 0 2 | 0 0 0 16: 0 0 2 | 0 0 0 32: 0 0 2 | 0 0 0 64: 0 0 2 | 0 0 0 128: 0 0 2 | 0 0 0 256: 1 1 4 | 0 0 0 512: 4 5 10 | 0 0 0 1024: 61 89 100 | 0 0 0

            This is regreassion of last version of my patch when I turned off async read ahead if PIO flag is not enabled. I'm going to fix this. But anyway I cannot avoid several 1K requests. I think we should not fix this because of this leed low latency in requests. This happens when user request miss page and async read ahead is initiated but minwhile it request a single page only and then unlock user application. In the next loop of reading next page will be available from async read ahead.

            dmiter Dmitry Eremin (Inactive) added a comment - This is regreassion of last version of my patch when I turned off async read ahead if PIO flag is not enabled. I'm going to fix this. But anyway I cannot avoid several 1K requests. I think we should not fix this because of this leed low latency in requests. This happens when user request miss page and async read ahead is initiated but minwhile it request a single page only and then unlock user application. In the next loop of reading next page will be available from async read ahead.

            Looking at when these smaller reads happen, they're clustered in the middle of the job, to the point where there are no 1024 page reads for a while (I used the D_INODE debug in osc_build_rpc for this).

            This is for the 4000 MiB case.

            This is the first RPC, which is naturally enough 1 page:
            00000008:00000002:0.0F:1512684358.524801:0:5975:0:(osc_request.c:2073:osc_build_rpc()) @@@ 1 pages, aa ffff880034753170. now 1r/0w in flight req@ffff8

            That's followed by a large # of 1024 page RPCs, though with 1 page RPCs mixed in every so often, which seems weird. It looks like there is no point at which we hit a steady state of only 1024 page RPCs.

            Then, here's the first of the set of only 1 page RPCs:
            00000008:00000002:2.0:1512684366.335774:0:5973:0:(osc_request.c:2073:osc_build_rpc()) @@@ 1 pages, aa ffff88023602c170. now 1r/0w in flight req@ffff88

            There are then NO 1024 page RPCs for some thousands of RPCs. Weirdly, at the end it seems to recover and we do some 1024 page RPCs again. Here's the first of those:
            00000008:00000002:0.0:1512684370.524090:0:5974:0:(osc_request.c:2073:osc_build_rpc()) @@@ 1024 pages, aa ffff8800a506c770. now 1r/0w in flight req@fff

            And here's the last RPC period:
            00000008:00000002:1.0:1512684371.129494:0:5972:0:(osc_request.c:2073:osc_build_rpc()) @@@ 1022 pages, aa ffff88021f8c1c70. now 2r/0w in flight req@fff

            So spend 8 second sending mostly large RPCs, then 4 seconds sending only 4 KiB RPCs, then another ~1 second sending large RPCs again.

            That means that, as you said, we're only sending a few % of the data in those RPCs - about 3% in this case.
            But they're taking about 30% of the total time, and it's all in one big lump.

            Something's wrong.

            paf Patrick Farrell (Inactive) added a comment - - edited Looking at when these smaller reads happen, they're clustered in the middle of the job, to the point where there are no 1024 page reads for a while (I used the D_INODE debug in osc_build_rpc for this). This is for the 4000 MiB case. This is the first RPC, which is naturally enough 1 page: 00000008:00000002:0.0F:1512684358.524801:0:5975:0:(osc_request.c:2073:osc_build_rpc()) @@@ 1 pages, aa ffff880034753170. now 1r/0w in flight req@ffff8 That's followed by a large # of 1024 page RPCs, though with 1 page RPCs mixed in every so often, which seems weird. It looks like there is no point at which we hit a steady state of only 1024 page RPCs. Then, here's the first of the set of only 1 page RPCs: 00000008:00000002:2.0:1512684366.335774:0:5973:0:(osc_request.c:2073:osc_build_rpc()) @@@ 1 pages, aa ffff88023602c170. now 1r/0w in flight req@ffff88 There are then NO 1024 page RPCs for some thousands of RPCs. Weirdly, at the end it seems to recover and we do some 1024 page RPCs again. Here's the first of those: 00000008:00000002:0.0:1512684370.524090:0:5974:0:(osc_request.c:2073:osc_build_rpc()) @@@ 1024 pages, aa ffff8800a506c770. now 1r/0w in flight req@fff And here's the last RPC period: 00000008:00000002:1.0:1512684371.129494:0:5972:0:(osc_request.c:2073:osc_build_rpc()) @@@ 1022 pages, aa ffff88021f8c1c70. now 2r/0w in flight req@fff So spend 8 second sending mostly large RPCs, then 4 seconds sending only 4 KiB RPCs, then another ~1 second sending large RPCs again. That means that, as you said, we're only sending a few % of the data in those RPCs - about 3% in this case. But they're taking about 30% of the total time, and it's all in one big lump. Something's wrong.

            People

              simmonsja James A Simmons
              dmiter Dmitry Eremin (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              26 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: