Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8943

Enable Multiple IB/OPA Endpoints Between Nodes

Details

    • Improvement
    • Resolution: Fixed
    • Critical
    • Lustre 2.10.0
    • None
    • 9223372036854775807

    Description

      OPA driver optimizations are based on the MPI model where it is expected to have multiple endpoints between two given nodes. To enable this optimization for Lustre, we need to make it possible, via an LND-specific tuneable, to create multiple endpoints and to balance the traffic over them.

      I have already created an experimental patch to test this theory out. I was able to push OPA performance to 12.4GB/s by just having 2 QPs between the nodes and round robin messages between them.

      This Jira ticket is for productizing my patch and testing it out thoroughly for OPA and IB. Test results will be posted to this ticket.

      Attachments

        Issue Links

          Activity

            [LU-8943] Enable Multiple IB/OPA Endpoints Between Nodes

            No, as I mentoined before only reboot helps.

            # lustre_rmmod                                                                  
            rmmod: ERROR: Module ko2iblnd is in use
            
            # lsmod|less                                                                    
            Module                  Size  Used by
            ko2iblnd              233790  1 
            ptlrpc               1343928  0 
            obdclass             1744518  1 ptlrpc
            lnet                  483843  3 ko2iblnd,obdclass,ptlrpc
            libcfs                416336  4 lnet,ko2iblnd,obdclass,ptlrpc
            [...]
            
            # lctl network down                                                             
            LNET busy
            
            lnetctl > lnet unconfigure
            unconfigure:
                - lnet:
                      errno: -16
                      descr: "LNet unconfigure error: Device or resource busy"
            lnetctl > lnet unconfigure --all
            unconfigure:
                - lnet:
                      errno: -16
                      descr: "LNet unconfigure error: Device or resource busy"
            
            # lustre_rmmod                                                                  
            rmmod: ERROR: Module ko2iblnd is in use
            
            
            
            
            dmiter Dmitry Eremin (Inactive) added a comment - No, as I mentoined before only reboot helps. # lustre_rmmod rmmod: ERROR: Module ko2iblnd is in use # lsmod|less Module Size Used by ko2iblnd 233790 1 ptlrpc 1343928 0 obdclass 1744518 1 ptlrpc lnet 483843 3 ko2iblnd,obdclass,ptlrpc libcfs 416336 4 lnet,ko2iblnd,obdclass,ptlrpc [...] # lctl network down LNET busy lnetctl > lnet unconfigure unconfigure: - lnet: errno: -16 descr: "LNet unconfigure error: Device or resource busy" lnetctl > lnet unconfigure --all unconfigure: - lnet: errno: -16 descr: "LNet unconfigure error: Device or resource busy" # lustre_rmmod rmmod: ERROR: Module ko2iblnd is in use
            adilger Andreas Dilger added a comment - - edited

            Does "lctl network down" or "lnetctl lnet unconfigure" help?

            adilger Andreas Dilger added a comment - - edited Does " lctl network down " or " lnetctl lnet unconfigure " help?

            I observed strange behavior. It looks after this commit I cannot unload ko2iblnd module. LNet is busy even all unmounted successfully. Only reboot helps.

             

             

            dmiter Dmitry Eremin (Inactive) added a comment - I observed strange behavior. It looks after this commit I cannot unload ko2iblnd module. LNet is busy even all unmounted successfully. Only reboot helps.    
            pjones Peter Jones added a comment -

            Landed for 2.10

            pjones Peter Jones added a comment - Landed for 2.10

            Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/25168/
            Subject: LU-8943 lnd: Enable Multiple OPA Endpoints between Nodes
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 7241e68f37962991ef43a6c01b3a83ff67282d88

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/25168/ Subject: LU-8943 lnd: Enable Multiple OPA Endpoints between Nodes Project: fs/lustre-release Branch: master Current Patch Set: Commit: 7241e68f37962991ef43a6c01b3a83ff67282d88

            I did update the OPA defaults to set conns_per_peer to 4 when OPA is detected.  I'll also update the manual under LUDOC-374.

            I bumped the conns_per_peer to 4 from 3 because OPA team is going to start recommending a krcvqs default of 4 especially for a low number of cores (i.e. VMs).  Having a conns_per_peer of 4 helps to compensate for the lower krcvqs number so we should work well out of the box whether krcvqs is 4 or 8.

            doug Doug Oucharek (Inactive) added a comment - I did update the OPA defaults to set conns_per_peer to 4 when OPA is detected.  I'll also update the manual under LUDOC-374 . I bumped the conns_per_peer to 4 from 3 because OPA team is going to start recommending a krcvqs default of 4 especially for a low number of cores (i.e. VMs).  Having a conns_per_peer of 4 helps to compensate for the lower krcvqs number so we should work well out of the box whether krcvqs is 4 or 8.

            I recommend OPA systems with many cores use conns_per_peer = 3 and these HFI1 parameters:

            options hfi1 krcvqs=8 piothreshold=0 sge_copy_mode=2 wss_threshold=70
            

            Are you going to add these to the /usr/sbin/ko2iblnd-probe script, or be set by default in some other manner, or will this be up to the user to discover and set? At a very minimum there should be an update to the Lustre User Manual (see https://wiki.hpdd.intel.com/display/PUB/Making+changes+to+the+Lustre+Manual), but providing good performance out of the box is preferred.

            adilger Andreas Dilger added a comment - I recommend OPA systems with many cores use conns_per_peer = 3 and these HFI1 parameters: options hfi1 krcvqs=8 piothreshold=0 sge_copy_mode=2 wss_threshold=70 Are you going to add these to the /usr/sbin/ko2iblnd-probe script, or be set by default in some other manner, or will this be up to the user to discover and set? At a very minimum there should be an update to the Lustre User Manual (see https://wiki.hpdd.intel.com/display/PUB/Making+changes+to+the+Lustre+Manual ), but providing good performance out of the box is preferred.

            Backwards compatibility testing looks good.  An upgraded node who initiates connections will create conns_per_peer connections and the non-upgraded receiver node will allow that many connections to be created.  However, the non-upgraded node will not "use" all the connections to send messages, only the first one.  So performance will not improve.

            If things are reversed (non-upgraded initiator to upgraded receiver) will work as if neither side is upgraded because it is the initiator who decides how many connections to have and in this case, it will just be one.

            So, to get the performance benefit, both sides of a connection need to be upgraded with this patch and the initiator needs to have conns_per_peer set > 1.

            Based on the attached spreadsheet, I recommend OPA systems with many cores use conns_per_peer = 3 and these HFI1 parameters:

            options hfi1 krcvqs=8 piothreshold=0 sge_copy_mode=2 wss_threshold=70

            However, if you are on a VM or have a limited number of cores, change conns_per_peer = 4 and krcvqs = 4 in the HFI1 parameters.

            doug Doug Oucharek (Inactive) added a comment - Backwards compatibility testing looks good.  An upgraded node who initiates connections will create conns_per_peer connections and the non-upgraded receiver node will allow that many connections to be created.  However, the non-upgraded node will not "use" all the connections to send messages, only the first one.  So performance will not improve. If things are reversed (non-upgraded initiator to upgraded receiver) will work as if neither side is upgraded because it is the initiator who decides how many connections to have and in this case, it will just be one. So, to get the performance benefit, both sides of a connection need to be upgraded with this patch and the initiator needs to have conns_per_peer set > 1. Based on the attached spreadsheet, I recommend OPA systems with many cores use conns_per_peer = 3 and these HFI1 parameters: options hfi1 krcvqs=8 piothreshold=0 sge_copy_mode=2 wss_threshold=70 However, if you are on a VM or have a limited number of cores, change conns_per_peer = 4 and krcvqs = 4 in the HFI1 parameters.

            I have attached an Excel spreadsheet showing the performance changes with different conns_per_peer settings for both OPA and MLX-QDR.  For OPA, there is a tab showing the change without any HFI1 tunings (i.e. just the defaults) and with the recommended HFI1 tunings.

            Summary: Using this patch with conns_per_peer of 3 and the recommended HFI1 tunings provides good and consistent performance.

            Still to be done: Testing this patch for backwards compatibility.  

            doug Doug Oucharek (Inactive) added a comment - I have attached an Excel spreadsheet showing the performance changes with different conns_per_peer settings for both OPA and MLX-QDR.  For OPA, there is a tab showing the change without any HFI1 tunings (i.e. just the defaults) and with the recommended HFI1 tunings. Summary: Using this patch with conns_per_peer of 3 and the recommended HFI1 tunings provides good and consistent performance. Still to be done: Testing this patch for backwards compatibility.  

            The patch for this ticket is showing a lot of promise.  To productize it so we can land it to master, I need to do the following:

            • Work the code so only the active side of a connection needs to have conns_per_peer set.  The passive side should just adapt automatically.
            • Make sure backwards compatibility is not broken when this feature is turned on in either the active side or passive side.
            • The code which implements the round-robin behaviour has a potential infinite loop when things go wrong.  I need to add protection against that happening.
            • There is no code to recover a downed connection to get us back to the conns_per_peer level.  I'm not sure I will add that but need to evaluate the situation more.
            • Right now, there is no easy way to see if this feature is active and how well it is working.  I need to add some connection-based stats to be queried by lnetctl so we have a way to validate this feature and monitor it.

            In addition, testing needs to be done to see how much more CPU this feature consumes when it is activated.  We need to measure the costs as well as the benefits.  This needs to all be done with MLX hardware as well as OPA just to see what happens if this is activated on MLX-based networks.

            doug Doug Oucharek (Inactive) added a comment - The patch for this ticket is showing a lot of promise.  To productize it so we can land it to master, I need to do the following: Work the code so only the active side of a connection needs to have conns_per_peer set.  The passive side should just adapt automatically. Make sure backwards compatibility is not broken when this feature is turned on in either the active side or passive side. The code which implements the round-robin behaviour has a potential infinite loop when things go wrong.  I need to add protection against that happening. There is no code to recover a downed connection to get us back to the conns_per_peer level.  I'm not sure I will add that but need to evaluate the situation more. Right now, there is no easy way to see if this feature is active and how well it is working.  I need to add some connection-based stats to be queried by lnetctl so we have a way to validate this feature and monitor it. In addition, testing needs to be done to see how much more CPU this feature consumes when it is activated.  We need to measure the costs as well as the benefits.  This needs to all be done with MLX hardware as well as OPA just to see what happens if this is activated on MLX-based networks.

            To activate this patch, you need to use the following option:

            options ko2iblnd conns_per_peer=<n>

            Where <n> is the number QPs you want per peer connection.  At the moment, both sides of the connection must have the same setting (I need to fix this in the patch...only the client side should need this).

            I found that setting <n> to 6 gave me amazing performance.  Note: I have not tried this patch yet with the recommended hfi tunings.  They "will" interfere with this patch and should initially be avoided.

            Another note: I believe there is a race condition in the hfi driver we trigger when there is too much parallelism.  A couple of times running this patch, I found the hfi driver "missed" an event.  I am talking to the OPA developers about this.

            doug Doug Oucharek (Inactive) added a comment - To activate this patch, you need to use the following option: options ko2iblnd conns_per_peer=<n> Where <n> is the number QPs you want per peer connection.  At the moment, both sides of the connection must have the same setting (I need to fix this in the patch...only the client side should need this). I found that setting <n> to 6 gave me amazing performance.  Note: I have not tried this patch yet with the recommended hfi tunings.  They "will" interfere with this patch and should initially be avoided. Another note: I believe there is a race condition in the hfi driver we trigger when there is too much parallelism.  A couple of times running this patch, I found the hfi driver "missed" an event.  I am talking to the OPA developers about this.

            People

              doug Doug Oucharek (Inactive)
              doug Doug Oucharek (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              21 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: