Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5995

Apparent scale issue with 2.5.2 clients to 2.5.3 servers

Details

    • Question/Request
    • Resolution: Unresolved
    • Major
    • None
    • Lustre 2.4.2, Lustre 2.5.3
    • None

    Description

      I am running into a significant performance issue when running IOR primarily to the file system mentioned in the environment but is being observed on other global lustre file systems as well.

      On the smaller of the two systems with a single router I am getting near wire speed with running IOR on 32 nodes with 8 threads. On the larger system I am getting only approximately 10% of what I expect to see going through the routers and where I'm expecting to see ~10GB/sec to the file system I am only seeing roughly 6% of that performance.

      I have run both netperf and lnet_selftest from the routers to the servers and in the case of netperf am seeing basically wire speed. lnet_selftest show approximately the same result when using a concurrency of 8 but with a concurrency of 1 I am seeing about half.

      Loads on the servers and routers are insignificant. I have increased the credits on both the servers and gateways with no observable impact (although changes to the routers has not been implement only on 2 gateways do to it being a production environment). Credits changes have not been made on the client side (again because this is a production system).

      I am unsure what to try next nor do I know whether there is a known compatibility issue between client and server versions.

      Any help would be greatly appreciated.

      Attachments

        Activity

          [LU-5995] Apparent scale issue with 2.5.2 clients to 2.5.3 servers

          Hi Joe, did you try lnet_selftest with all 32 clients and all servers (OSSs) or at least 12 servers, what the average server performance in that case? Also, when you ran IOR on the large system, were IO requests from clients evenly spreading over all servers?

          liang Liang Zhen (Inactive) added a comment - Hi Joe, did you try lnet_selftest with all 32 clients and all servers (OSSs) or at least 12 servers, what the average server performance in that case? Also, when you ran IOR on the large system, were IO requests from clients evenly spreading over all servers?

          Here the tunings that might affect performance:
          1. router buffers, increasing this will increase the number of messages a router can handle
          2. peer_buffer_credits: # router buffer credits per peer. These are receiving credits and apply to the router only. Increasing these will increase the number of messages a router can handle simoultaneously
          3. credits (default = 256): # concurrent sends to all peers
          4. peer_credits (default = 8): # concurrent sends to 1 peer (this overrides the peer_buffer_credits)

          You can try manipulating these on clients and servers to see if the increase performance.

          ashehata Amir Shehata (Inactive) added a comment - Here the tunings that might affect performance: 1. router buffers, increasing this will increase the number of messages a router can handle 2. peer_buffer_credits: # router buffer credits per peer. These are receiving credits and apply to the router only. Increasing these will increase the number of messages a router can handle simoultaneously 3. credits (default = 256): # concurrent sends to all peers 4. peer_credits (default = 8): # concurrent sends to 1 peer (this overrides the peer_buffer_credits) You can try manipulating these on clients and servers to see if the increase performance.
          jamervi Joe Mervini added a comment -

          The data through the (12) routers was relatively balanced. I was monitoring the router stats during all my testing. The performance of the lnet_selftest was consistent between the 2 systems in terms that on a single router the results were similar and the average when up on the larger system when addition routers were added to the test (with lst only running between the router nodes themselves and the lustre servers as opposed to running lst on clients through the routers)

          One thing we were wondering about is credits and peer_credits and whether their setting might be a factor. To be honest we haven't really monkeyed with the module load options in years with the exception of adding new networks and routers. In terms of load time option we are using they are shown below:

          1. Device aliases
            alias ib0 ib_ipoib
            alias ib1 ib_ipoib

          ###############################################################################

          1. LNET options

          options lnet tiny_router_buffers=4096
          options lnet small_router_buffers=65536
          options lnet large_router_buffers=4096
          options lnet live_router_check_interval=60
          options lnet dead_router_check_interval=60
          options lnet check_routers_before_use=1

          1. TCP LND options
          1. OpenIB LND options
            options ko2iblnd timeout=100
          jamervi Joe Mervini added a comment - The data through the (12) routers was relatively balanced. I was monitoring the router stats during all my testing. The performance of the lnet_selftest was consistent between the 2 systems in terms that on a single router the results were similar and the average when up on the larger system when addition routers were added to the test (with lst only running between the router nodes themselves and the lustre servers as opposed to running lst on clients through the routers) One thing we were wondering about is credits and peer_credits and whether their setting might be a factor. To be honest we haven't really monkeyed with the module load options in years with the exception of adding new networks and routers. In terms of load time option we are using they are shown below: Device aliases alias ib0 ib_ipoib alias ib1 ib_ipoib ############################################################################### LNET options options lnet tiny_router_buffers=4096 options lnet small_router_buffers=65536 options lnet large_router_buffers=4096 options lnet live_router_check_interval=60 options lnet dead_router_check_interval=60 options lnet check_routers_before_use=1 TCP LND options OpenIB LND options options ko2iblnd timeout=100
          green Oleg Drokin added a comment -

          There should not be any compatibility issues between 2.4 and 2.5 versions.

          I guess you already have looked into router statistics to ensure the load is evenly distributed across all the routers and not going through one-two of them only making them the bottleneck of the entire thing (same goes for your file striping, but I guess you have checked that too already?)
          If you get roughly 10% of your performance expected in large config with 12 routers and expected performance out of a small system with just one router, I think this one might be first one to doublecheck and rule out.

          When you say "lnet_selftest show approximately the same result" do you mean that you get wirespeed in such a test as well?

          green Oleg Drokin added a comment - There should not be any compatibility issues between 2.4 and 2.5 versions. I guess you already have looked into router statistics to ensure the load is evenly distributed across all the routers and not going through one-two of them only making them the bottleneck of the entire thing (same goes for your file striping, but I guess you have checked that too already?) If you get roughly 10% of your performance expected in large config with 12 routers and expected performance out of a small system with just one router, I think this one might be first one to doublecheck and rule out. When you say "lnet_selftest show approximately the same result" do you mean that you get wirespeed in such a test as well?
          pjones Peter Jones added a comment -

          Oleg

          What do you advise here?

          Peter

          pjones Peter Jones added a comment - Oleg What do you advise here? Peter

          People

            green Oleg Drokin
            jamervi Joe Mervini
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated: