Details
Description
We recently experienced major issues switching Lustre routing clusters to 2.12 and ended up reverting them to 2.10. In trying to better understand LNet, I read through various documentation pages, but was left with several questions. Can you help answer the following questions and perhaps update the LNet docs as well?
Questions:
- What is the data flow, or order of operations, for sending LNet messages from client to server, through routers? For example, a mock (and incorrect?) model might be:
- Client determines next hop router
- Client checks available routing buffer credits (rtr) on router
- Client checks available send credits (tx) and peer_credits to that router on self
- Client sends <= #peer_credits messages, decrementing tx for each message
- Router receives messages in routing buffers, depending on message size, and decrements # routing buffer credits (rtr) for each message.
- Router then acts as the client, repeating steps 1-5 above to the next hop as well as back to the original client (as data is received)
- Is there any need to manually add peers in a non-MR config? My understanding is no.
- Should a router have a peer entry for every node in the expanded SAN, including in other networks it needs to be routed to?
- The manual states "The number of credits currently in flight (number of transmit credits) is shown in the tx column.... Therefore, rtr – tx is the number of transmits in flight." It seems "in flight" for the "tx" description should be "available" so that rtr-tx would be "in flight", right?
- Should a NID ever show for the wrong interface (e.g. tcp instead of o2ibXX)? We will sometimes see messages in logs from <addr>@tcp when it should be <addr>@o2ibX.
- Do the older mlx4 lnet settings need to be updated for mlx5 or are they still applicable? (https://wiki.lustre.org/LNet_Router_Config_Guide#Configure_Lustre_Servers)?
1) In some cases, we're routing packets from a cluster compute node that may be OPA or IB to a cluster's router node to a data center router node, across a campus WAN, back to another data center router which sits on an IB SAN the Lustre cluster sits on. So, ko2iblnd <
> ko2iblnd <> ko2iblnd/ksocklnd <> ksocklnd/ko2iblnd <> ko2iblnd. Does the }}{{credits parameter for the ksocklnd modules need to match the credits parameter for ko2iblnd on router nodes with both interfaces?2) Given the context in 1), do the number of ko2iblnd credits need to match on servers along the entire path or is it appropriate for router nodes to have a larger number of credits set?
The peer_credits parameters determines how many concurrent messages can be inflight to the same peer. Since o2iblnd is generally more performant than the socklnd, it would make sense to have a larger number of peer_credits for the socklnd network. The o2iblnd negotiates the peer credits per connection. So even if the peer_credits on two nodes are different, they'll be negotiated down to the least common denominator value. I would recommend then to keep the peer_credits the same across homogeneous networks. That said we know of a limitation with socklnd where the performance isn't great per interface. One work around we currently have is to create multiple virtual interfaces on the same ethernet interface and then configure those in MR config. This increases the performance. We're tracking this under: https://jira.whamcloud.com/browse/LU-12815. So this might be a way for you to increase the performance on the socklnd side.
I would also differentiate between the credits module parameters and peer_credits module parameter. The former determines the limit on the total concurrent sends to all peers on a particular Network Interface. While the latter limits the number of concurrent sends per peer. So if you increase the number of peer_credits, you'd want to increase the number of global credits as well for the NI.
The credits are calculated per CPT. You can take a look at {{lnet_ni_tq_credits()}}for more details.
3) Should there be a relation between the number of credits defined for a node's LND driver and it's buffer (or other relevant settings)?
For a router I think you need to look at the total number of large/small/tiny buffers you've specified. lnetctl routing show shows you stats on these buffers; the minimum credits for each. if the minimum credits are dipping in the negative that means you have instances where you're queuing due to lack of buffers. In that case you can increase the buffers allocated for that size. This can be done dynamically via: lnetctl set [tiny_buffers|small_buffers|large_buffers] <value>.
My above comments explain the peer_credits/credits relationship.
Let me know if you have other questions.