Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-16028

LNet tunings for large, interbuilding SAN

    XMLWordPrintable

Details

    • Question/Request
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.12.7
    • OS: Mostly RHEL 7.9, some RHEL 8.4+
      lustre: Mostly 2.12.8_7.llnl-1.ch6.x86_64, but some other 2.12.x, 2.10 and 2.14 also
      Networks: Omnipath, Infiniband, 10/40/100 Gb Ethernet
    • 9223372036854775807

    Description

      Topology:

      We have various Lustre client clusters spread across 3 buildings on our campus, with most of the clients and servers for the relevant network here (CZ Network), in two buildings, connected by twin 100G links.  On the CZ network there are 3 Lustre clusters. In each building there is an EDR IB SAN that Lustre servers and lustre routers connect to. Inside each compute cluster, there is a local Lustre network, based on either EDR IB (with few exceptions) or Omnipath.

      Clients may access Lustre servers in their local building or the remote one. For a client to get to the remote cluster, it may pass through a local router (if on it's own lustre network) to get to the building SAN, then go through the inter-building routers (a set on each side) where the transmission goes over ksocklnd, then back to ko2iblnd to the other building's SAN.

      network topology:
      (each name represents a cluster)
      
                o2ib100      /    o2ib600
      syrah----+            /         +---quartz
      ruby-----+           /          +---pascal
      corona---+-orelic--------zrelic-+---copper(lustre1)
      catalyst-+         /            +---zinc(lustre2)
      ...------+        /             +---... 

      Summary:

      We've had routing issues on our CZ lustre network for a couple years (see LU-14026) – roughly when we updated from Lustre 2.10 to 2.12. After upgrading, our inter-building routers (we call relics) would seemingly jam up and stop sending messages, bringing things to a halt. We eventually downgraded them to Lustre 2.10 and the problem went away. Since then we have tried various tunings and modifications, but the problem persists.

      Interestingly, on a parallel, but significantly smaller, inter-building network, the Lustre clients and servers do not exhibit these same problems. All servers on that other network are running Lustre 2.12.

      Tunings:

      In a case with Serguei a few months ago where we suspected routing issues, I asked him if there was a way to verify our settings are sane, given our topology; he suggested opening a ticket with Whamcloud to review them. I've copied below what I think are the most relevant settings below, but can provide additional information if needed. Can you help us understand what the appropriate tunings/settings are for our large Lustre network?

      Note too, that we have discovery turned off.

      Cluster Node Type Network(s) credits (IB/TCP) peer_credits peer_buffer_credits router_buffers
      (tiny/small/large)
      buffers (routers only)
      borax compute OPA 256 8 0 0/0/0  
        router OPA/IB 1024 8 0 2048/16384/2048 pages count credits
            0  1024    1024
            0  1024    1024
            1  8192    8192
            1  8192    8192
        256  1024    1024
        256  1024    1024
      boraxo compute/login OPA 256 8 0 0/0/0  
        router IB 1024 8 0 2048/16384/2048 pages count credits
            0  1024    1024
            0  1024    1024
            1  8192    8192
            1  8192    8192
        256  1024    1024
        256  1024    1024
      Catalyst compute/login IB 256 8 0 0/0/0  
        router IB/IB 1024 8 0 2048/16384/2048 pages count credits
            0  1024    1024
            0  1024    1024
            1  8192    8192
            1  8192    8192
        256  1024    1024
        256  1024    1024
      Copper Lustre server IB 256 8 0 0/0/0  
      Corona compute/login IB 256 8 0 0/0/0  
        router IB/IB 1024 8 0 2048/16384/2048 Variable:
       
      pages count credits
            0  1024    1024
            0  1024    1024
            1  8192    8192
            1  8192    8192
        256  1024    1024
        256  1024    1024
       
      pages count credits
            0   512     512
            0   512     512
            0   512     512
            0   512     512
            0   512     512
            0   512     512
            0   512     512
            0   512     512
            1  4096    4096
            1  4096    4096
            1  4096    4096
            1  4096    4096
            1  4096    4096
            1  4096    4096
            1  4096    4096
            1  4096    4096
        256   256     256
        256   256     256
        256   256     256
        256   256     256
        256   256     256
        256   256     256
        256   256     256
        256   256     256
      czvnc login IB 256 8 0 0/0/0  
      flash compute/login IB 256 8 0 0/0/0  
        router IB/IB 1024 8 0 2048/16384/2048 pages count credits
            0  1024    1024
            0  1024    1024
            1  8192    8192
            1  8192    8192
        256  1024    1024
        256  1024    1024
      gopher Lustre server IB 256 8 0 0/0/0  
      jet Lustre server IB 256 8 0 0/0/0  
      lead lustre server IB 256 8 0 0/0/0  
      mammoth compute/login OPA 256 8 0 0/0/0  
        router OPA/IB 1024 8 0 2048/16384/2048 pages count credits
            0  1024    1024
            0  1024    1024
            1  8192    8192
            1  8192    8192
        256  1024    1024
        256  1024    1024
      oslic compute/login IB 256 8 0 0/0/0  
      pascal compute/login IB 256 8 0 0/0/0  
        router IB/IB 1024 8 0 2048/16384/2048 pages count credits
            0  1024    1024
            0  1024    1024
            1  8192    8192
            1  8192    8192
        256  1024    1024
        256  1024    1024
      quartz compute/login OPA 256 8 0 0/0/0  
        router OPA/IB 1024 8 0 2048/16384/2048 pages count credits
            0  1024    1024
            0  1024    1024
            1  8192    8192
            1  8192    8192
        256  1024    1024
        256  1024    1024
      ruby compute/login OPA 256 8 0 0/0/0  
      ruby router OPA/IB 1024 8 0 2048/16384/2048 Variable w/ negative"min" values on somepages count credits
            0  1024    1024
            0  1024    1024
            1  8192    8192
            1  8192    8192
        256  1024    1024
        256  1024    1024
       
      pages count credits
          0  1024    1024
          0  1024    1024
          1  8192    8192
          1  8192    8192
      256    512     512
      256    512     512
      shell compute/login IB 256 8 0 0/0/0  
      solfish Indexer IB 256 8 0 0/0/0  
      syrah compute/login IB 256 8 0 0/0/0  
        router IB/IB 256 8 0 0/0/0 pages count credits
            0  1024    1024
            0  1024    1024
            1  8192    8192
            1  8192    8192
      256      512      512
      256      512      512
      tin Lustre server IB (QDR) 256 8 0 0/0/0  
      zinc Lustre server IB 256 8 0 0/0/0  
      orelic
      (except orelic2, which has temporary settings)
      Inter-building
      router
      IB/Eth 1024/1024 8 0 4096/32768/4096 pages count credits
          0  2048    2048
          0  2048    2048
          1 16384   16384
          1 16384   16384
        256  2048    2048
        256  2048    2048
      zrelic Inter-building
      {}router
      IB/Eth 1024/1024 8 0 4096/32768/4096 pages count credits
          0  2048    2048
          0  2048    2048
          1 16384   16384
          1 16384   16384
        256  2048    2048
        256  2048    2048

       

      Attachments

        Activity

          People

            ssmirnov Serguei Smirnov
            charr Cameron Harr
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: