Details
Description
Topology:
We have various Lustre client clusters spread across 3 buildings on our campus, with most of the clients and servers for the relevant network here (CZ Network), in two buildings, connected by twin 100G links. On the CZ network there are 3 Lustre clusters. In each building there is an EDR IB SAN that Lustre servers and lustre routers connect to. Inside each compute cluster, there is a local Lustre network, based on either EDR IB (with few exceptions) or Omnipath.
Clients may access Lustre servers in their local building or the remote one. For a client to get to the remote cluster, it may pass through a local router (if on it's own lustre network) to get to the building SAN, then go through the inter-building routers (a set on each side) where the transmission goes over ksocklnd, then back to ko2iblnd to the other building's SAN.
network topology: (each name represents a cluster) o2ib100 / o2ib600 syrah----+ / +---quartz ruby-----+ / +---pascal corona---+-orelic--------zrelic-+---copper(lustre1) catalyst-+ / +---zinc(lustre2) ...------+ / +---...
Summary:
We've had routing issues on our CZ lustre network for a couple years (see LU-14026) – roughly when we updated from Lustre 2.10 to 2.12. After upgrading, our inter-building routers (we call relics) would seemingly jam up and stop sending messages, bringing things to a halt. We eventually downgraded them to Lustre 2.10 and the problem went away. Since then we have tried various tunings and modifications, but the problem persists.
Interestingly, on a parallel, but significantly smaller, inter-building network, the Lustre clients and servers do not exhibit these same problems. All servers on that other network are running Lustre 2.12.
Tunings:
In a case with Serguei a few months ago where we suspected routing issues, I asked him if there was a way to verify our settings are sane, given our topology; he suggested opening a ticket with Whamcloud to review them. I've copied below what I think are the most relevant settings below, but can provide additional information if needed. Can you help us understand what the appropriate tunings/settings are for our large Lustre network?
Note too, that we have discovery turned off.
Cluster | Node Type | Network(s) | credits (IB/TCP) | peer_credits | peer_buffer_credits | router_buffers (tiny/small/large) |
buffers (routers only) |
---|---|---|---|---|---|---|---|
borax | compute | OPA | 256 | 8 | 0 | 0/0/0 | |
router | OPA/IB | 1024 | 8 | 0 | 2048/16384/2048 | pages count credits 0 1024 1024 0 1024 1024 1 8192 8192 1 8192 8192 256 1024 1024 256 1024 1024 |
|
boraxo | compute/login | OPA | 256 | 8 | 0 | 0/0/0 | |
router | IB | 1024 | 8 | 0 | 2048/16384/2048 | pages count credits 0 1024 1024 0 1024 1024 1 8192 8192 1 8192 8192 256 1024 1024 256 1024 1024 |
|
Catalyst | compute/login | IB | 256 | 8 | 0 | 0/0/0 | |
router | IB/IB | 1024 | 8 | 0 | 2048/16384/2048 | pages count credits 0 1024 1024 0 1024 1024 1 8192 8192 1 8192 8192 256 1024 1024 256 1024 1024 |
|
Copper | Lustre server | IB | 256 | 8 | 0 | 0/0/0 | |
Corona | compute/login | IB | 256 | 8 | 0 | 0/0/0 | |
router | IB/IB | 1024 | 8 | 0 | 2048/16384/2048 | Variable: pages count credits 0 1024 1024 0 1024 1024 1 8192 8192 1 8192 8192 256 1024 1024 256 1024 1024 pages count credits 0 512 512 0 512 512 0 512 512 0 512 512 0 512 512 0 512 512 0 512 512 0 512 512 1 4096 4096 1 4096 4096 1 4096 4096 1 4096 4096 1 4096 4096 1 4096 4096 1 4096 4096 1 4096 4096 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 |
|
czvnc | login | IB | 256 | 8 | 0 | 0/0/0 | |
flash | compute/login | IB | 256 | 8 | 0 | 0/0/0 | |
router | IB/IB | 1024 | 8 | 0 | 2048/16384/2048 | pages count credits 0 1024 1024 0 1024 1024 1 8192 8192 1 8192 8192 256 1024 1024 256 1024 1024 |
|
gopher | Lustre server | IB | 256 | 8 | 0 | 0/0/0 | |
jet | Lustre server | IB | 256 | 8 | 0 | 0/0/0 | |
lead | lustre server | IB | 256 | 8 | 0 | 0/0/0 | |
mammoth | compute/login | OPA | 256 | 8 | 0 | 0/0/0 | |
router | OPA/IB | 1024 | 8 | 0 | 2048/16384/2048 | pages count credits 0 1024 1024 0 1024 1024 1 8192 8192 1 8192 8192 256 1024 1024 256 1024 1024 |
|
oslic | compute/login | IB | 256 | 8 | 0 | 0/0/0 | |
pascal | compute/login | IB | 256 | 8 | 0 | 0/0/0 | |
router | IB/IB | 1024 | 8 | 0 | 2048/16384/2048 | pages count credits 0 1024 1024 0 1024 1024 1 8192 8192 1 8192 8192 256 1024 1024 256 1024 1024 |
|
quartz | compute/login | OPA | 256 | 8 | 0 | 0/0/0 | |
router | OPA/IB | 1024 | 8 | 0 | 2048/16384/2048 | pages count credits 0 1024 1024 0 1024 1024 1 8192 8192 1 8192 8192 256 1024 1024 256 1024 1024 |
|
ruby | compute/login | OPA | 256 | 8 | 0 | 0/0/0 | |
ruby | router | OPA/IB | 1024 | 8 | 0 | 2048/16384/2048 | Variable w/ negative"min" values on somepages count credits 0 1024 1024 0 1024 1024 1 8192 8192 1 8192 8192 256 1024 1024 256 1024 1024 pages count credits 0 1024 1024 0 1024 1024 1 8192 8192 1 8192 8192 256 512 512 256 512 512 |
shell | compute/login | IB | 256 | 8 | 0 | 0/0/0 | |
solfish | Indexer | IB | 256 | 8 | 0 | 0/0/0 | |
syrah | compute/login | IB | 256 | 8 | 0 | 0/0/0 | |
router | IB/IB | 256 | 8 | 0 | 0/0/0 | pages count credits 0 1024 1024 0 1024 1024 1 8192 8192 1 8192 8192 256 512 512 256 512 512 |
|
tin | Lustre server | IB (QDR) | 256 | 8 | 0 | 0/0/0 | |
zinc | Lustre server | IB | 256 | 8 | 0 | 0/0/0 | |
orelic (except orelic2, which has temporary settings) |
Inter-building router |
IB/Eth | 1024/1024 | 8 | 0 | 4096/32768/4096 | pages count credits 0 2048 2048 0 2048 2048 1 16384 16384 1 16384 16384 256 2048 2048 256 2048 2048 |
zrelic | Inter-building {}router |
IB/Eth | 1024/1024 | 8 | 0 | 4096/32768/4096 | pages count credits 0 2048 2048 0 2048 2048 1 16384 16384 1 16384 16384 256 2048 2048 256 2048 2048 |