[LU-16028] LNet tunings for large, interbuilding SAN Created: 19/Jul/22  Updated: 20/Jul/22

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.7
Fix Version/s: None

Type: Question/Request Priority: Minor
Reporter: Cameron Harr Assignee: Serguei Smirnov
Resolution: Unresolved Votes: 0
Labels: llnl
Environment:

OS: Mostly RHEL 7.9, some RHEL 8.4+
lustre: Mostly 2.12.8_7.llnl-1.ch6.x86_64, but some other 2.12.x, 2.10 and 2.14 also
Networks: Omnipath, Infiniband, 10/40/100 Gb Ethernet


Epic/Theme: lnet
Rank (Obsolete): 9223372036854775807

 Description   

Topology:

We have various Lustre client clusters spread across 3 buildings on our campus, with most of the clients and servers for the relevant network here (CZ Network), in two buildings, connected by twin 100G links.  On the CZ network there are 3 Lustre clusters. In each building there is an EDR IB SAN that Lustre servers and lustre routers connect to. Inside each compute cluster, there is a local Lustre network, based on either EDR IB (with few exceptions) or Omnipath.

Clients may access Lustre servers in their local building or the remote one. For a client to get to the remote cluster, it may pass through a local router (if on it's own lustre network) to get to the building SAN, then go through the inter-building routers (a set on each side) where the transmission goes over ksocklnd, then back to ko2iblnd to the other building's SAN.

network topology:
(each name represents a cluster)

          o2ib100      /    o2ib600
syrah----+            /         +---quartz
ruby-----+           /          +---pascal
corona---+-orelic--------zrelic-+---copper(lustre1)
catalyst-+         /            +---zinc(lustre2)
...------+        /             +---... 

Summary:

We've had routing issues on our CZ lustre network for a couple years (see LU-14026) – roughly when we updated from Lustre 2.10 to 2.12. After upgrading, our inter-building routers (we call relics) would seemingly jam up and stop sending messages, bringing things to a halt. We eventually downgraded them to Lustre 2.10 and the problem went away. Since then we have tried various tunings and modifications, but the problem persists.

Interestingly, on a parallel, but significantly smaller, inter-building network, the Lustre clients and servers do not exhibit these same problems. All servers on that other network are running Lustre 2.12.

Tunings:

In a case with Serguei a few months ago where we suspected routing issues, I asked him if there was a way to verify our settings are sane, given our topology; he suggested opening a ticket with Whamcloud to review them. I've copied below what I think are the most relevant settings below, but can provide additional information if needed. Can you help us understand what the appropriate tunings/settings are for our large Lustre network?

Note too, that we have discovery turned off.

Cluster Node Type Network(s) credits (IB/TCP) peer_credits peer_buffer_credits router_buffers
(tiny/small/large)
buffers (routers only)
borax compute OPA 256 8 0 0/0/0  
  router OPA/IB 1024 8 0 2048/16384/2048 pages count credits
      0  1024    1024
      0  1024    1024
      1  8192    8192
      1  8192    8192
  256  1024    1024
  256  1024    1024
boraxo compute/login OPA 256 8 0 0/0/0  
  router IB 1024 8 0 2048/16384/2048 pages count credits
      0  1024    1024
      0  1024    1024
      1  8192    8192
      1  8192    8192
  256  1024    1024
  256  1024    1024
Catalyst compute/login IB 256 8 0 0/0/0  
  router IB/IB 1024 8 0 2048/16384/2048 pages count credits
      0  1024    1024
      0  1024    1024
      1  8192    8192
      1  8192    8192
  256  1024    1024
  256  1024    1024
Copper Lustre server IB 256 8 0 0/0/0  
Corona compute/login IB 256 8 0 0/0/0  
  router IB/IB 1024 8 0 2048/16384/2048 Variable:
 
pages count credits
      0  1024    1024
      0  1024    1024
      1  8192    8192
      1  8192    8192
  256  1024    1024
  256  1024    1024
 
pages count credits
      0   512     512
      0   512     512
      0   512     512
      0   512     512
      0   512     512
      0   512     512
      0   512     512
      0   512     512
      1  4096    4096
      1  4096    4096
      1  4096    4096
      1  4096    4096
      1  4096    4096
      1  4096    4096
      1  4096    4096
      1  4096    4096
  256   256     256
  256   256     256
  256   256     256
  256   256     256
  256   256     256
  256   256     256
  256   256     256
  256   256     256
czvnc login IB 256 8 0 0/0/0  
flash compute/login IB 256 8 0 0/0/0  
  router IB/IB 1024 8 0 2048/16384/2048 pages count credits
      0  1024    1024
      0  1024    1024
      1  8192    8192
      1  8192    8192
  256  1024    1024
  256  1024    1024
gopher Lustre server IB 256 8 0 0/0/0  
jet Lustre server IB 256 8 0 0/0/0  
lead lustre server IB 256 8 0 0/0/0  
mammoth compute/login OPA 256 8 0 0/0/0  
  router OPA/IB 1024 8 0 2048/16384/2048 pages count credits
      0  1024    1024
      0  1024    1024
      1  8192    8192
      1  8192    8192
  256  1024    1024
  256  1024    1024
oslic compute/login IB 256 8 0 0/0/0  
pascal compute/login IB 256 8 0 0/0/0  
  router IB/IB 1024 8 0 2048/16384/2048 pages count credits
      0  1024    1024
      0  1024    1024
      1  8192    8192
      1  8192    8192
  256  1024    1024
  256  1024    1024
quartz compute/login OPA 256 8 0 0/0/0  
  router OPA/IB 1024 8 0 2048/16384/2048 pages count credits
      0  1024    1024
      0  1024    1024
      1  8192    8192
      1  8192    8192
  256  1024    1024
  256  1024    1024
ruby compute/login OPA 256 8 0 0/0/0  
ruby router OPA/IB 1024 8 0 2048/16384/2048 Variable w/ negative"min" values on somepages count credits
      0  1024    1024
      0  1024    1024
      1  8192    8192
      1  8192    8192
  256  1024    1024
  256  1024    1024
 
pages count credits
    0  1024    1024
    0  1024    1024
    1  8192    8192
    1  8192    8192
256    512     512
256    512     512
shell compute/login IB 256 8 0 0/0/0  
solfish Indexer IB 256 8 0 0/0/0  
syrah compute/login IB 256 8 0 0/0/0  
  router IB/IB 256 8 0 0/0/0 pages count credits
      0  1024    1024
      0  1024    1024
      1  8192    8192
      1  8192    8192
256      512      512
256      512      512
tin Lustre server IB (QDR) 256 8 0 0/0/0  
zinc Lustre server IB 256 8 0 0/0/0  
orelic
(except orelic2, which has temporary settings)
Inter-building
router
IB/Eth 1024/1024 8 0 4096/32768/4096 pages count credits
    0  2048    2048
    0  2048    2048
    1 16384   16384
    1 16384   16384
  256  2048    2048
  256  2048    2048
zrelic Inter-building
{}router
IB/Eth 1024/1024 8 0 4096/32768/4096 pages count credits
    0  2048    2048
    0  2048    2048
    1 16384   16384
    1 16384   16384
  256  2048    2048
  256  2048    2048

 



 Comments   
Comment by Peter Jones [ 20/Jul/22 ]

Serguei

Could you please advise

Thanks

Peter

Generated at Sat Feb 10 03:23:21 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.