Details
Description
We recently experienced major issues switching Lustre routing clusters to 2.12 and ended up reverting them to 2.10. In trying to better understand LNet, I read through various documentation pages, but was left with several questions. Can you help answer the following questions and perhaps update the LNet docs as well?
Questions:
- What is the data flow, or order of operations, for sending LNet messages from client to server, through routers? For example, a mock (and incorrect?) model might be:
- Client determines next hop router
- Client checks available routing buffer credits (rtr) on router
- Client checks available send credits (tx) and peer_credits to that router on self
- Client sends <= #peer_credits messages, decrementing tx for each message
- Router receives messages in routing buffers, depending on message size, and decrements # routing buffer credits (rtr) for each message.
- Router then acts as the client, repeating steps 1-5 above to the next hop as well as back to the original client (as data is received)
- Is there any need to manually add peers in a non-MR config? My understanding is no.
- Should a router have a peer entry for every node in the expanded SAN, including in other networks it needs to be routed to?
- The manual states "The number of credits currently in flight (number of transmit credits) is shown in the tx column.... Therefore, rtr – tx is the number of transmits in flight." It seems "in flight" for the "tx" description should be "available" so that rtr-tx would be "in flight", right?
- Should a NID ever show for the wrong interface (e.g. tcp instead of o2ibXX)? We will sometimes see messages in logs from <addr>@tcp when it should be <addr>@o2ibX.
- Do the older mlx4 lnet settings need to be updated for mlx5 or are they still applicable? (https://wiki.lustre.org/LNet_Router_Config_Guide#Configure_Lustre_Servers)?
Hi Cameron,
I was thinking from the the router perspective. If a router is forwarding from o2iblnd <->socklnd and you configure the socklnd with more credits than the o2iblnd, you effectively widen the pipe on the socklnd side. So you allow more messages on the socklnd side while throttling the rate on the o2iblnd side. Wouldn't that have the desired effect? And now thinking about it, you can attempt to throttle down the rate by adjusting the max_rpcs_in_flight parameter to reduce the number of RPCs on the o2iblnd side as well. You'd do that on the clients, I believe. Take a look at "39.3.1. Monitoring the Client RPC Stream" in the manual.
RE the credits. When you see credits going negative that means you're starting to queue internally because you're going into throttling mode. That in itself is okay, since you do use the credits to limit the rate to peers. However, if the MIN credits is getting larger and larger (in the negative), that means the system could be getting overwhelmed an is unable to cope with the traffic. In that case I don't think just increasing the credits will be enough. Some other measures will need to be taken. Increasing the number of routers or servers, is an option.
Also note when you start seeing a lot of queuing going on that could eventually lead to timeouts as well. Since a message is tagged by a deadline when you initially send it. The timeout value covers the time messages spend on the queue and the time messages spend on the wire before receiving a completion event (speaking from the o2iblnd perspective). It's a bit of a compound effect. As you get more queuing the time messages wait before they get sent gets longer, which could result in more timeouts. So adjusting the timeout values in this scenario might be needed as well. I'm talking about lnet_transaction_timeout.