[LUDOC-479] Need LNet clarifications - Whamcloud Community JIRA

Details

Type: Question/Request
Resolution: Fixed
Priority: Minor
Fix Version/s: None
Affects Version/s: None
Labels:
- llnl
Environment:
Lustre 2.12.5
Infiniband (MLX4, MLX5)
TCP
OPA

Epic/Theme:
- lnet
Rank (Obsolete):
9223372036854775807

Description

We recently experienced major issues switching Lustre routing clusters to 2.12 and ended up reverting them to 2.10. In trying to better understand LNet, I read through various documentation pages, but was left with several questions. Can you help answer the following questions and perhaps update the LNet docs as well?

Questions:

What is the data flow, or order of operations, for sending LNet messages from client to server, through routers? For example, a mock (and incorrect?) model might be:
1. Client determines next hop router
2. Client checks available routing buffer credits (rtr) on router
3. Client checks available send credits (tx) and peer_credits to that router on self
4. Client sends <= #peer_credits messages, decrementing tx for each message
5. Router receives messages in routing buffers, depending on message size, and decrements # routing buffer credits (rtr) for each message.
6. Router then acts as the client, repeating steps 1-5 above to the next hop as well as back to the original client (as data is received)
Is there any need to manually add peers in a non-MR config? My understanding is no.
Should a router have a peer entry for every node in the expanded SAN, including in other networks it needs to be routed to?
The manual states "The number of credits currently in flight (number of transmit credits) is shown in the tx column.... Therefore, rtr – tx is the number of transmits in flight." It seems "in flight" for the "tx" description should be "available" so that rtr-tx would be "in flight", right?
Should a NID ever show for the wrong interface (e.g. tcp instead of o2ibXX)? We will sometimes see messages in logs from <addr>@tcp when it should be <addr>@o2ibX.
Do the older mlx4 lnet settings need to be updated for mlx5 or are they still applicable? (https://wiki.lustre.org/LNet_Router_Config_Guide#Configure_Lustre_Servers)?

Attachments

Activity

[LUDOC-479] Need LNet clarifications

Amir Shehata (Inactive) added a comment - 22/Oct/20 7:47 PM

1) In some cases, we're routing packets from a cluster compute node that may be OPA or IB to a cluster's router node to a data center router node, across a campus WAN, back to another data center router which sits on an IB SAN the Lustre cluster sits on. So, ko2iblnd <~~> ko2iblnd <~~> ko2iblnd/ksocklnd <~~> ksocklnd/ko2iblnd <~~> ko2iblnd. Does the }}{{credits parameter for the ksocklnd modules need to match the credits parameter for ko2iblnd on router nodes with both interfaces?

2) Given the context in 1), do the number of ko2iblnd credits need to match on servers along the entire path or is it appropriate for router nodes to have a larger number of credits set?

The peer_credits parameters determines how many concurrent messages can be inflight to the same peer. Since o2iblnd is generally more performant than the socklnd, it would make sense to have a larger number of peer_credits for the socklnd network. The o2iblnd negotiates the peer credits per connection. So even if the peer_credits on two nodes are different, they'll be negotiated down to the least common denominator value. I would recommend then to keep the peer_credits the same across homogeneous networks. That said we know of a limitation with socklnd where the performance isn't great per interface. One work around we currently have is to create multiple virtual interfaces on the same ethernet interface and then configure those in MR config. This increases the performance. We're tracking this under: https://jira.whamcloud.com/browse/LU-12815. So this might be a way for you to increase the performance on the socklnd side.

I would also differentiate between the credits module parameters and peer_credits module parameter. The former determines the limit on the total concurrent sends to all peers on a particular Network Interface. While the latter limits the number of concurrent sends per peer. So if you increase the number of peer_credits, you'd want to increase the number of global credits as well for the NI.

The credits are calculated per CPT. You can take a look at {{lnet_ni_tq_credits()}}for more details.

3) Should there be a relation between the number of credits defined for a node's LND driver and it's buffer (or other relevant settings)?

For a router I think you need to look at the total number of large/small/tiny buffers you've specified. lnetctl routing show shows you stats on these buffers; the minimum credits for each. if the minimum credits are dipping in the negative that means you have instances where you're queuing due to lack of buffers. In that case you can increase the buffers allocated for that size. This can be done dynamically via: lnetctl set [tiny_buffers|small_buffers|large_buffers] <value>.

My above comments explain the peer_credits/credits relationship.

Let me know if you have other questions.

Amir Shehata (Inactive) added a comment - 22/Oct/20 7:47 PM 1) In some cases, we're routing packets from a cluster compute node that may be OPA or IB to a cluster's router node to a data center router node, across a campus WAN, back to another data center router which sits on an IB SAN the Lustre cluster sits on. So, ko2iblnd < > ko2iblnd < > ko2iblnd/ksocklnd < > ksocklnd/ko2iblnd < > ko2iblnd. Does the }}{{credits parameter for the ksocklnd modules need to match the credits parameter for ko2iblnd on router nodes with both interfaces? 2) Given the context in 1), do the number of ko2iblnd credits need to match on servers along the entire path or is it appropriate for router nodes to have a larger number of credits set? The peer_credits parameters determines how many concurrent messages can be inflight to the same peer. Since o2iblnd is generally more performant than the socklnd, it would make sense to have a larger number of peer_credits for the socklnd network. The o2iblnd negotiates the peer credits per connection. So even if the peer_credits on two nodes are different, they'll be negotiated down to the least common denominator value. I would recommend then to keep the peer_credits the same across homogeneous networks. That said we know of a limitation with socklnd where the performance isn't great per interface. One work around we currently have is to create multiple virtual interfaces on the same ethernet interface and then configure those in MR config. This increases the performance. We're tracking this under: https://jira.whamcloud.com/browse/LU-12815 . So this might be a way for you to increase the performance on the socklnd side. I would also differentiate between the credits module parameters and peer_credits module parameter. The former determines the limit on the total concurrent sends to all peers on a particular Network Interface. While the latter limits the number of concurrent sends per peer. So if you increase the number of peer_credits, you'd want to increase the number of global credits as well for the NI. The credits are calculated per CPT. You can take a look at {{lnet_ni_tq_credits()}}for more details. 3) Should there be a relation between the number of credits defined for a node's LND driver and it's buffer (or other relevant settings)? For a router I think you need to look at the total number of large/small/tiny buffers you've specified. lnetctl routing show shows you stats on these buffers; the minimum credits for each. if the minimum credits are dipping in the negative that means you have instances where you're queuing due to lack of buffers. In that case you can increase the buffers allocated for that size. This can be done dynamically via: lnetctl set [tiny_buffers|small_buffers|large_buffers] <value> . My above comments explain the peer_credits/credits relationship. Let me know if you have other questions.

Cameron Harr added a comment - 14/Oct/20 12:20 AM - edited

Amir - I was trying to keep details of our specific issue out of this documentation ticket. I would like to pass a few more questions past you though:

1) In some cases, we're routing packets from a cluster compute node that may be OPA or IB to a cluster's router node to a data center router node, across a campus WAN, back to another data center router which sits on an IB SAN the Lustre cluster sits on. So, ko2iblnd <> ko2iblnd <> ko2iblnd/ksocklnd <> ksocklnd/ko2iblnd <> ko2iblnd. Does the {{credits}} parameter for the ksocklnd modules need to match the credits parameter for ko2iblnd on router nodes with both interfaces?

2) Given the context in 1), do the number of ko2iblnd credits need to match on servers along the entire path or is it appropriate for router nodes to have a larger number of credits set?

3) Should there be a relation between the number of credits defined for a node's LND driver and it's buffer (or other relevant settings)?

Cameron Harr added a comment - 14/Oct/20 12:20 AM - edited Amir - I was trying to keep details of our specific issue out of this documentation ticket. I would like to pass a few more questions past you though: 1) In some cases, we're routing packets from a cluster compute node that may be OPA or IB to a cluster's router node to a data center router node, across a campus WAN, back to another data center router which sits on an IB SAN the Lustre cluster sits on. So, ko2iblnd <> ko2iblnd <> ko2iblnd/ksocklnd <> ksocklnd/ko2iblnd <> ko2iblnd. Does the {{credits }} parameter for the ksocklnd modules need to match the credits parameter for ko2iblnd on router nodes with both interfaces? 2) Given the context in 1), do the number of ko2iblnd credits need to match on servers along the entire path or is it appropriate for router nodes to have a larger number of credits set? 3) Should there be a relation between the number of credits defined for a node's LND driver and it's buffer (or other relevant settings)?

Olaf Faaland added a comment - 13/Oct/20 11:17 PM

Hi Amir,

I'll create a separate ticket for our issue. It seems like we've worked through the general questions.

Olaf Faaland added a comment - 13/Oct/20 11:17 PM Hi Amir, I'll create a separate ticket for our issue. It seems like we've worked through the general questions.

Amir Shehata (Inactive) added a comment - 13/Oct/20 11:15 PM

In your upgrade procedure do you bring down a router, upgrade to 2.12 and then bring up the router? And that's when you start seeing timeout issues?

Have you tried disabling discovery on the router as you bring it up?

Would we be able to setup a debugging session to get to the bottom of this?

Amir Shehata (Inactive) added a comment - 13/Oct/20 11:15 PM In your upgrade procedure do you bring down a router, upgrade to 2.12 and then bring up the router? And that's when you start seeing timeout issues? Have you tried disabling discovery on the router as you bring it up? Would we be able to setup a debugging session to get to the bottom of this?

Cameron Harr added a comment - 12/Oct/20 10:37 PM

Amir,

We're still having major timeout issues when we try to bring up routers (one by one) in 2.12. Another question I have is whether it should be fine to mix 2.10 and 2.12 routers? I would think so, but wanted to verify.

Cameron Harr added a comment - 12/Oct/20 10:37 PM Amir, We're still having major timeout issues when we try to bring up routers (one by one) in 2.12. Another question I have is whether it should be fine to mix 2.10 and 2.12 routers? I would think so, but wanted to verify.

Cameron Harr added a comment - 09/Oct/20 6:09 PM

Thanks for the clarification. It's helpful.

Cameron Harr added a comment - 09/Oct/20 6:09 PM Thanks for the clarification. It's helpful.

Amir Shehata (Inactive) added a comment - 09/Oct/20 5:52 PM

"On the second-to-last question, the confusion on our part in some cases was because the node should only have had a single NID (@o2ibXX) but the error message referenced @tcp"

There is currently an issue where this could happen. We have fixed it on master.

LU-13477 lnet: Force full discovery cycle

We're currently trying to port it back to b2_12

"can you also clarify the separation of buffers and routing buffers for me? I understand from your flow diagram the regular buffers (divided in bins by page size) are for receiving messages. Are the routing buffers then used once the node (a router) determines the message is not for it, at which point it places the message in a routing buffer to be processed. Is that correct? Are those routing buffers typically larger than the regular buffers?"

For non-gateway nodes (IE clients and servers), usually lustre sets up the RDMA buffer to receive or transmit from. However for gateways, there are no such buffers. In fact lustre doesn't even have to be loaded at all on the gateways. Therefore, a gateway needs to allocate buffers to accept the RDMAed data into and then turn around and forward this data by RDMAing it to the next hop. These are the buffers I was trying to show in the second flow diagram in the link above. In my diagram there are no "regular buffers". I reference only "routing buffers". I'll update the diagram to clarify.

There are 3 router buffer pools, tiny, small and large. These are only allocated when you turn on the routing feature. When a message is received by a gateway and it determines it needs to forward this over to the next hop, then it looks at the size of data in this message and pulls out an appropriately sized buffer (max 1MB). It receives the RDMAed data into that buffer, then turns around and forwards it to the next hop.

Does that answer your question?

Amir Shehata (Inactive) added a comment - 09/Oct/20 5:52 PM "On the second-to-last question, the confusion on our part in some cases was because the node should only have had a single NID (@o2ibXX) but the error message referenced @tcp" There is currently an issue where this could happen. We have fixed it on master. LU-13477 lnet: Force full discovery cycle We're currently trying to port it back to b2_12 "can you also clarify the separation of buffers and routing buffers for me? I understand from your flow diagram the regular buffers (divided in bins by page size) are for receiving messages. Are the routing buffers then used once the node (a router) determines the message is not for it, at which point it places the message in a routing buffer to be processed. Is that correct? Are those routing buffers typically larger than the regular buffers?" For non-gateway nodes (IE clients and servers), usually lustre sets up the RDMA buffer to receive or transmit from. However for gateways, there are no such buffers. In fact lustre doesn't even have to be loaded at all on the gateways. Therefore, a gateway needs to allocate buffers to accept the RDMAed data into and then turn around and forward this data by RDMAing it to the next hop. These are the buffers I was trying to show in the second flow diagram in the link above. In my diagram there are no "regular buffers". I reference only "routing buffers". I'll update the diagram to clarify. There are 3 router buffer pools, tiny, small and large. These are only allocated when you turn on the routing feature. When a message is received by a gateway and it determines it needs to forward this over to the next hop, then it looks at the size of data in this message and pulls out an appropriately sized buffer (max 1MB). It receives the RDMAed data into that buffer, then turns around and forwards it to the next hop. Does that answer your question?

Cameron Harr added a comment - 09/Oct/20 5:25 PM

Amir, can you also clarify the separation of buffers and routing buffers for me? I understand from your flow diagram the regular buffers (divided in bins by page size) are for receiving messages. Are the routing buffers then used once the node (a router) determines the message is not for it, at which point it places the message in a routing buffer to be processed. Is that correct? Are those routing buffers typically larger than the regular buffers?

Cameron Harr added a comment - 09/Oct/20 5:25 PM Amir, can you also clarify the separation of buffers and routing buffers for me? I understand from your flow diagram the regular buffers (divided in bins by page size) are for receiving messages. Are the routing buffers then used once the node (a router) determines the message is not for it, at which point it places the message in a routing buffer to be processed. Is that correct? Are those routing buffers typically larger than the regular buffers?

Cameron Harr added a comment - 05/Oct/20 9:49 PM

Amir,

This is very helpful and answers most of my questions. On the second-to-last question, the confusion on our part in some cases was because the node should only have had a single NID (@o2ibXX) but the error message referenced @tcp. I'll have to recreate the issue and dump lnetctl settings to make sure there isn't a TCP network active on it when we get that message.

Regarding the "Issues" we've been having, we're still trying to characterize them. When we enable Lustre 2.12 on the routers, messages from some clients get jammed up and we start seeing hangs until we take the 2.12 routers offline.

Cameron Harr added a comment - 05/Oct/20 9:49 PM Amir, This is very helpful and answers most of my questions. On the second-to-last question, the confusion on our part in some cases was because the node should only have had a single NID (@o2ibXX) but the error message referenced @tcp. I'll have to recreate the issue and dump lnetctl settings to make sure there isn't a TCP network active on it when we get that message. Regarding the "Issues" we've been having, we're still trying to characterize them. When we enable Lustre 2.12 on the routers, messages from some clients get jammed up and we start seeing hangs until we take the 2.12 routers offline.

Gerrit Updater added a comment - 05/Oct/20 6:32 PM

Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/40143
Subject: ~~LUDOC-479~~ lnet: Clarify transmit and routing credits
Project: doc/manual
Branch: master
Current Patch Set: 1
Commit: 9211ffd0087986733e03bc62b2d9f894a19ac3a3

Gerrit Updater added a comment - 05/Oct/20 6:32 PM Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/40143 Subject: LUDOC-479 lnet: Clarify transmit and routing credits Project: doc/manual Branch: master Current Patch Set: 1 Commit: 9211ffd0087986733e03bc62b2d9f894a19ac3a3

Amir Shehata (Inactive) added a comment - 02/Oct/20 2:02 AM

What is the data flow, or order of operations, for sending LNet messages from client to server, through routers? For example, a mock (and incorrect?) model might be:
- I wrote a detailed document on how message processing works in the LNet: https://wiki.whamcloud.com/x/LQfvC
  - I'm not sure if this level of detail is appropriate for the user manual.
Is there any need to manually add peers in a non-MR config? My understanding is no.
- I think we should be careful when we talk about non-MR config. in 2.11 and later MR code is always active. Whether there are multiple local or remote interfaces available, that's a different matter. With discovery enabled, you don't have to do any manual peer addition. The node will automatically discover all the peer's NIDs, whether the peer has multiple interfaces or not. However, if you disable discovery, and the peer (and the local node) are single rail then you don't need to manually add peer NIDs. However, if you disable discovery and have multiple interfaces and you want to use these multiple interfaces, then you'll need to manually add the peer NIDs using
  - lnetctl peer add --prim_nid <prim_nid> --nid <list of nids>

Should a router have a peer entry for every node in the expanded SAN, including in other networks it needs to be routed to?
- You don't have to manually add any peers. Whenever a router creates a connection to any peer NID it'll create a local representation of that peer. With discovery on it'll attempt to discover that NID and it'll create a local peer representation which includes all the NIDs of the peer. Let's take an example of a multi-hop setup. A router sends to another router, which sends to the final destination. Because of discovery, it'll attempt to discover both the next hop and final destination on the first go. So if you do
```
lnetctl peer show
```

then you'll find that the router has peer representations of both the next hop and the final destination

The manual states "The number of credits currently in flight (number of transmit credits) is shown in the tx column.... Therefore, rtr – tx is the number of transmits in flight." It seems "in flight" for the "tx" description should be "available" so that rtr-tx would be "in flight", right?
- I'm not sure if the explanation in the manual is correct. The rtr column shows the currently available router credits. And the tx column shows the currently available transmit credits. The rtr credits are decremented when a message is received and a routing buffer is pulled. If the rtr credits go into the negative then that means there are received messages which are queued due to no rtr credits available. They'll remain queued until rtr credits are available for use (IE messages from that peer which are being forwarded to the final destination are completed).The tx credits is decremented when we pass the message down to the LND for sending to the peer. Let's take an example to work through this: For simplicity say the max rtr credits and max tx credits are 32. Therefore max_rtr_credits - available_rtr_credits = active messages from the peer in the process of being routed to their final destination. max_tx_credits - available_tx_credits = inflight messages to the peer. I don't think rtr-tx would really give us any meaningful information, would it?
Should a NID ever show for the wrong interface (e.g. tcp instead of o2ibXX)? We will sometimes see messages in logs from <addr>@tcp when it should be <addr>@o2ibX.
- This comes back to discovery. If you have a node with multiple NIDs, the first configured NID becomes the primary NID of the peer. So when looking at the logs you have to keep that in mind. Even though the primary NID can be tcp, the actual NID which is used will be the o2ib. For example if you have two nodes. The first has a tcp and o2ib NIDs configured, tcp being the primary NID. The second node has only an o2ib NID. When the second node discovers the first, it'll determine that the primary NID is the tcp one. So it is potentially possible that you might see the tcp NID in the logs referring to the peer even though the o2ib NID is the one being used. The primary NID in effect is the identifying NID of the peer. That peer can have multiple NIDs and only a subset of them can be used.
  That said, however, there is an issue where discovery might glitch and you might get an incomplete representation of the peer. This has been resolved in master.
Do the older mlx4 lnet settings need to be updated for mlx5 or are they still applicable? (https://wiki.lustre.org/LNet_Router_Config_Guide#Configure_Lustre_Servers)?
- I usually recommend from what I have observed on different sites the following values: peer_credits=32, peer_credits_hiw=16, concurrent_sends=64. These seem to work well for mlx5. The rest of the values can remain at their default values.

It might be helpful if you explain to me the major issues you ran into when you tried to switch to 2.12.5. If I have more specifics I might be able to provide more focused feedback.

Amir Shehata (Inactive) added a comment - 02/Oct/20 2:02 AM What is the data flow, or order of operations, for sending LNet messages from client to server, through routers? For example, a mock (and incorrect?) model might be: I wrote a detailed document on how message processing works in the LNet: https://wiki.whamcloud.com/x/LQfvC I'm not sure if this level of detail is appropriate for the user manual. Is there any need to manually add peers in a non-MR config? My understanding is no. I think we should be careful when we talk about non-MR config. in 2.11 and later MR code is always active. Whether there are multiple local or remote interfaces available, that's a different matter. With discovery enabled, you don't have to do any manual peer addition. The node will automatically discover all the peer's NIDs, whether the peer has multiple interfaces or not. However, if you disable discovery, and the peer (and the local node) are single rail then you don't need to manually add peer NIDs. However, if you disable discovery and have multiple interfaces and you want to use these multiple interfaces, then you'll need to manually add the peer NIDs using lnetctl peer add --prim_nid <prim_nid> --nid <list of nids> Should a router have a peer entry for every node in the expanded SAN, including in other networks it needs to be routed to? You don't have to manually add any peers. Whenever a router creates a connection to any peer NID it'll create a local representation of that peer. With discovery on it'll attempt to discover that NID and it'll create a local peer representation which includes all the NIDs of the peer. Let's take an example of a multi-hop setup. A router sends to another router, which sends to the final destination. Because of discovery, it'll attempt to discover both the next hop and final destination on the first go. So if you do lnetctl peer show then you'll find that the router has peer representations of both the next hop and the final destination The manual states "The number of credits currently in flight (number of transmit credits) is shown in the tx column.... Therefore, rtr – tx is the number of transmits in flight." It seems "in flight" for the "tx" description should be "available" so that rtr-tx would be "in flight", right? I'm not sure if the explanation in the manual is correct. The rtr column shows the currently available router credits. And the tx column shows the currently available transmit credits. The rtr credits are decremented when a message is received and a routing buffer is pulled. If the rtr credits go into the negative then that means there are received messages which are queued due to no rtr credits available. They'll remain queued until rtr credits are available for use (IE messages from that peer which are being forwarded to the final destination are completed).The tx credits is decremented when we pass the message down to the LND for sending to the peer. Let's take an example to work through this: For simplicity say the max rtr credits and max tx credits are 32. Therefore max_rtr_credits - available_rtr_credits = active messages from the peer in the process of being routed to their final destination. max_tx_credits - available_tx_credits = inflight messages to the peer. I don't think rtr-tx would really give us any meaningful information, would it? Should a NID ever show for the wrong interface (e.g. tcp instead of o2ibXX)? We will sometimes see messages in logs from <addr>@tcp when it should be <addr>@o2ibX. This comes back to discovery. If you have a node with multiple NIDs, the first configured NID becomes the primary NID of the peer. So when looking at the logs you have to keep that in mind. Even though the primary NID can be tcp, the actual NID which is used will be the o2ib. For example if you have two nodes. The first has a tcp and o2ib NIDs configured, tcp being the primary NID. The second node has only an o2ib NID. When the second node discovers the first, it'll determine that the primary NID is the tcp one. So it is potentially possible that you might see the tcp NID in the logs referring to the peer even though the o2ib NID is the one being used. The primary NID in effect is the identifying NID of the peer. That peer can have multiple NIDs and only a subset of them can be used. That said, however, there is an issue where discovery might glitch and you might get an incomplete representation of the peer. This has been resolved in master. Do the older mlx4 lnet settings need to be updated for mlx5 or are they still applicable? ( https://wiki.lustre.org/LNet_Router_Config_Guide#Configure_Lustre_Servers)? I usually recommend from what I have observed on different sites the following values: peer_credits=32, peer_credits_hiw=16, concurrent_sends=64. These seem to work well for mlx5. The rest of the values can remain at their default values. It might be helpful if you explain to me the major issues you ran into when you tried to switch to 2.12.5. If I have more specifics I might be able to provide more focused feedback.

People

Assignee:: Amir Shehata (Inactive)

Reporter:: Cameron Harr

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 16/Sep/20 10:49 PM

Updated:: 18/Oct/21 11:25 PM

Resolved:: 18/Oct/21 11:25 PM