Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8964

use parallel I/O to improve performance on machines with slow single thread performance

Details

    • New Feature
    • Resolution: Duplicate
    • Major
    • None
    • None
    • None
    • 9223372036854775807

    Description

      On machines with slow single thread performance like KNL the bottleneck of I/O performance moved into code which just copy memory from one buffer to other (from user space to kernel or vice versa). In current Lustre implementation all I/O performs in single thread and this is become an issue for KNL. Significantly improve performance can be with solution which do parallel memory transfer of large buffers.

       

      Attachments

        Issue Links

          Activity

            [LU-8964] use parallel I/O to improve performance on machines with slow single thread performance
            spitzcor Cory Spitz added a comment -

            simmonsja, for the record, you mean LU-12043. LU-12403 is "add e2fsprog support for RHEL-8".

            spitzcor Cory Spitz added a comment - simmonsja , for the record, you mean LU-12043 . LU-12403 is "add e2fsprog support for RHEL-8".

            LU-12403 will do this work correctly.

            simmonsja James A Simmons added a comment - LU-12403 will do this work correctly.

            Thanks Patrick for the heads up on ktask. I will be watching it closely and give it a spin under this ticket.

            simmonsja James A Simmons added a comment - Thanks Patrick for the heads up on ktask. I will be watching it closely and give it a spin under this ticket.

            Thanks for slides. I will loop at them carefully. But for now I disagree that padata API have a big overhead. It's mostly negligible comparing with other overhead to pass work into different thread. But having many threads will leads a sheduler delay to switch under heavy loads. So, I think padata will work more stable and predictable in this case.

            dmiter Dmitry Eremin (Inactive) added a comment - Thanks for slides. I will loop at them carefully. But for now I disagree that padata API have a big overhead. It's mostly negligible comparing with other overhead to pass work into different thread. But having many threads will leads a sheduler delay to switch under heavy loads. So, I think padata will work more stable and predictable in this case.

            Also, apologies for not posting these last year.

            paf Patrick Farrell (Inactive) added a comment - Also, apologies for not posting these last year.

            https://www.eofs.eu/_media/events/devsummit17/patrick_farrell_laddevsummit_pio.pdf

            This is old and out of date, but I wanted to make sure these slides were seen.  I think the performance of the readahead code would probably be helped a lot by changes to the parallelization framework (as would the performance of pio itself).

            So slides 8, 9, and 10 would probably be of particular interest here.  There are significant performance improvements available for PIO just by going from padata to something simpler.  Also, the CPU binding behavior of padata is pretty bad - Binding explicitly to one CPU is problematic.  Padata seems to assume the whole machine is dedicated, which is not a friendly assumption.  (I discovered its CPU binding behavior because I saw performance problems - A particular CPU would be busy and the work assigned to that CPU would be delayed, which delays the completion of the whole i/o.  At this time, other CPUs were idle, and not binding to a specific CPU would have allowed one of them to be used.)

            paf Patrick Farrell (Inactive) added a comment - https://www.eofs.eu/_media/events/devsummit17/patrick_farrell_laddevsummit_pio.pdf This is old and out of date, but I wanted to make sure these slides were seen.  I think the performance of the readahead code would probably be helped a lot by changes to the parallelization framework (as would the performance of pio itself). So slides 8, 9, and 10 would probably be of particular interest here.  There are significant performance improvements available for PIO just by going from padata to something simpler.  Also, the CPU binding behavior of padata is pretty bad - Binding explicitly to one CPU is problematic.  Padata seems to assume the whole machine is dedicated, which is not a friendly assumption.  (I discovered its CPU binding behavior because I saw performance problems - A particular CPU would be busy and the work assigned to that CPU would be delayed, which delays the completion of the whole i/o.  At this time, other CPUs were idle, and not binding to a specific CPU would have allowed one of them to be used.)

            The last version of patch don't have an issue with RPC splitting. For reading in my VM machine I have the following:

            with PIO disabled:

                                    read                    write
            pages per rpc         rpcs   % cum % |       rpcs   % cum %
            1:                       3   4   4   |          0   0   0
            2:                       0   0   4   |          0   0   0
            4:                       0   0   4   |          0   0   0
            8:                       0   0   4   |          0   0   0
            16:                      0   0   4   |          0   0   0
            32:                      0   0   4   |          0   0   0
            64:                      0   0   4   |          0   0   0
            128:                     0   0   4   |          0   0   0
            256:                     0   0   4   |          0   0   0
            512:                     1   1   6   |          0   0   0
            1024:                   62  93 100   |          0   0   0
            

            with PIO enabled:

                                    read                    write
            pages per rpc         rpcs   % cum % |       rpcs   % cum %
            1:                       2   2   2   |          0   0   0
            2:                       0   0   2   |          0   0   0
            4:                       0   0   2   |          0   0   0
            8:                       0   0   2   |          0   0   0
            16:                      0   0   2   |          0   0   0
            32:                      0   0   2   |          0   0   0
            64:                      0   0   2   |          0   0   0
            128:                     0   0   2   |          0   0   0
            256:                     1   1   4   |          0   0   0
            512:                     4   5  10   |          0   0   0
            1024:                   61  89 100   |          0   0   0
            
            dmiter Dmitry Eremin (Inactive) added a comment - The last version of patch don't have an issue with RPC splitting. For reading in my VM machine I have the following: with PIO disabled: read write pages per rpc rpcs % cum % | rpcs % cum % 1: 3 4 4 | 0 0 0 2: 0 0 4 | 0 0 0 4: 0 0 4 | 0 0 0 8: 0 0 4 | 0 0 0 16: 0 0 4 | 0 0 0 32: 0 0 4 | 0 0 0 64: 0 0 4 | 0 0 0 128: 0 0 4 | 0 0 0 256: 0 0 4 | 0 0 0 512: 1 1 6 | 0 0 0 1024: 62 93 100 | 0 0 0 with PIO enabled: read write pages per rpc rpcs % cum % | rpcs % cum % 1: 2 2 2 | 0 0 0 2: 0 0 2 | 0 0 0 4: 0 0 2 | 0 0 0 8: 0 0 2 | 0 0 0 16: 0 0 2 | 0 0 0 32: 0 0 2 | 0 0 0 64: 0 0 2 | 0 0 0 128: 0 0 2 | 0 0 0 256: 1 1 4 | 0 0 0 512: 4 5 10 | 0 0 0 1024: 61 89 100 | 0 0 0

            People

              simmonsja James A Simmons
              dmiter Dmitry Eremin (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              26 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: