[LUDOC-479] Need LNet clarifications - Whamcloud Community JIRA

Details

Type: Question/Request
Resolution: Fixed
Priority: Minor
Fix Version/s: None
Affects Version/s: None
Labels:
- llnl
Environment:
Lustre 2.12.5
Infiniband (MLX4, MLX5)
TCP
OPA

Epic/Theme:
- lnet
Rank (Obsolete):
9223372036854775807

Description

We recently experienced major issues switching Lustre routing clusters to 2.12 and ended up reverting them to 2.10. In trying to better understand LNet, I read through various documentation pages, but was left with several questions. Can you help answer the following questions and perhaps update the LNet docs as well?

Questions:

What is the data flow, or order of operations, for sending LNet messages from client to server, through routers? For example, a mock (and incorrect?) model might be:
1. Client determines next hop router
2. Client checks available routing buffer credits (rtr) on router
3. Client checks available send credits (tx) and peer_credits to that router on self
4. Client sends <= #peer_credits messages, decrementing tx for each message
5. Router receives messages in routing buffers, depending on message size, and decrements # routing buffer credits (rtr) for each message.
6. Router then acts as the client, repeating steps 1-5 above to the next hop as well as back to the original client (as data is received)
Is there any need to manually add peers in a non-MR config? My understanding is no.
Should a router have a peer entry for every node in the expanded SAN, including in other networks it needs to be routed to?
The manual states "The number of credits currently in flight (number of transmit credits) is shown in the tx column.... Therefore, rtr – tx is the number of transmits in flight." It seems "in flight" for the "tx" description should be "available" so that rtr-tx would be "in flight", right?
Should a NID ever show for the wrong interface (e.g. tcp instead of o2ibXX)? We will sometimes see messages in logs from <addr>@tcp when it should be <addr>@o2ibX.
Do the older mlx4 lnet settings need to be updated for mlx5 or are they still applicable? (https://wiki.lustre.org/LNet_Router_Config_Guide#Configure_Lustre_Servers)?

Attachments

Activity

People

Assignee:: Amir Shehata (Inactive)

Reporter:: Cameron Harr

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 16/Sep/20 10:49 PM

Updated:: 18/Oct/21 11:25 PM

Resolved:: 18/Oct/21 11:25 PM