Details

    • Question/Request
    • Resolution: Fixed
    • Minor
    • None
    • None
    • Lustre 2.12.5
      Infiniband (MLX4, MLX5)
      TCP
      OPA
    • 9223372036854775807

    Description

      We recently experienced major issues switching Lustre routing clusters to 2.12 and ended up reverting them to 2.10. In trying to better understand LNet, I read through various documentation pages, but was left with several questions. Can you help answer the following questions and perhaps update the LNet docs as well?

      Questions:

      1. What is the data flow, or order of operations, for sending LNet messages from client to server, through routers? For example, a mock (and incorrect?) model might be:
        1. Client determines next hop router
        2. Client checks available routing buffer credits (rtr) on router
        3. Client checks available send credits (tx) and peer_credits to that router on self
        4. Client sends <= #peer_credits messages, decrementing tx for each message
        5. Router receives messages in routing buffers, depending on message size, and decrements # routing buffer credits (rtr) for each message.
        6. Router then acts as the client, repeating steps 1-5 above to the next hop as well as back to the original client (as data is received)
      2. Is there any need to manually add peers in a non-MR config? My understanding is no.
      3. Should a router have a peer entry for every node in the expanded SAN, including in other networks it needs to be routed to?
      4. The manual states "The number of credits currently in flight (number of transmit credits) is shown in the tx column.... Therefore, rtr – tx is the number of transmits in flight." It seems "in flight" for the "tx" description should be "available" so that rtr-tx would be "in flight", right?
      5. Should a NID ever show for the wrong interface (e.g. tcp instead of o2ibXX)? We will sometimes see messages in logs from <addr>@tcp when it should be <addr>@o2ibX.
      6. Do the older mlx4 lnet settings need to be updated for mlx5 or are they still applicable? (https://wiki.lustre.org/LNet_Router_Config_Guide#Configure_Lustre_Servers)?

      Attachments

        Activity

          [LUDOC-479] Need LNet clarifications
          ofaaland Olaf Faaland made changes -
          Resolution New: Fixed [ 1 ]
          Status Original: Open [ 1 ] New: Resolved [ 5 ]
          ofaaland Olaf Faaland made changes -
          Labels Original: llnl topllnl New: llnl
          pjones Peter Jones made changes -
          Link Original: This issue is related to JFC-21 [ JFC-21 ]

          Andreas Dilger (adilger@whamcloud.com) merged in patch https://review.whamcloud.com/40143/
          Subject: LUDOC-479 lnet: Clarify transmit and routing credits
          Project: doc/manual
          Branch: master
          Current Patch Set:
          Commit: d2c7df42886ed80cf2e5a82d9a1521c0003dddf8

          gerrit Gerrit Updater added a comment - Andreas Dilger (adilger@whamcloud.com) merged in patch https://review.whamcloud.com/40143/ Subject: LUDOC-479 lnet: Clarify transmit and routing credits Project: doc/manual Branch: master Current Patch Set: Commit: d2c7df42886ed80cf2e5a82d9a1521c0003dddf8
          charr Cameron Harr added a comment -

          Thanks again. I'll let you know if I have additional questions.

          charr Cameron Harr added a comment - Thanks again. I'll let you know if I have additional questions.

          Hi Cameron,

          I was thinking from the the router perspective. If a router is forwarding from o2iblnd <->socklnd and you configure the socklnd with more credits than the o2iblnd, you effectively widen the pipe on the socklnd side. So you allow more messages on the socklnd side while throttling the rate on the o2iblnd side. Wouldn't that have the desired effect? And now thinking about it, you can attempt to throttle down the rate by adjusting the max_rpcs_in_flight parameter to reduce the number of RPCs on the o2iblnd side as well. You'd do that on the clients, I believe. Take a look at "39.3.1. Monitoring the Client RPC Stream" in the manual.

          RE the credits. When you see credits going negative that means you're starting to queue internally because you're going into throttling mode. That in itself is okay, since you do use the credits to limit the rate to peers. However, if the MIN credits is getting larger and larger (in the negative), that means the system could be getting overwhelmed an is unable to cope with the traffic. In that case I don't think just increasing the credits will be enough. Some other measures will need to be taken. Increasing the number of routers or servers, is an option.

          Also note when you start seeing a lot of queuing going on that could eventually lead to timeouts as well. Since a message is tagged by a deadline when you initially send it. The timeout value covers the time messages spend on the queue and the time messages spend on the wire before receiving a completion event (speaking from the o2iblnd perspective). It's a bit of a compound effect. As you get more queuing the time messages wait before they get sent gets longer, which could result in more timeouts. So adjusting the timeout values in this scenario might be needed as well. I'm talking about lnet_transaction_timeout.

          ashehata Amir Shehata (Inactive) added a comment - Hi Cameron, I was thinking from the the router perspective. If a router is forwarding from o2iblnd <->socklnd and you configure the socklnd with more credits than the o2iblnd, you effectively widen the pipe on the socklnd side. So you allow more messages on the socklnd side while throttling the rate on the o2iblnd side. Wouldn't that have the desired effect? And now thinking about it, you can attempt to throttle down the rate by adjusting the  max_rpcs_in_flight parameter to reduce the number of RPCs on the o2iblnd side as well. You'd do that on the clients, I believe. Take a look at "39.3.1. Monitoring the Client RPC Stream" in the manual. RE the credits. When you see credits going negative that means you're starting to queue internally because you're going into throttling mode. That in itself is okay, since you do use the credits to limit the rate to peers. However, if the MIN credits is getting larger and larger (in the negative), that means the system could be getting overwhelmed an is unable to cope with the traffic. In that case I don't think just increasing the credits will be enough. Some other measures will need to be taken. Increasing the number of routers or servers, is an option. Also note when you start seeing a lot of queuing going on that could eventually lead to timeouts as well. Since a message is tagged by a deadline when you initially send it. The timeout value covers the time messages spend on the queue and the time messages spend on the wire before receiving a completion event (speaking from the o2iblnd perspective). It's a bit of a compound effect. As you get more queuing the time messages wait before they get sent gets longer, which could result in more timeouts. So adjusting the timeout values in this scenario might be needed as well. I'm talking about lnet_transaction_timeout .
          charr Cameron Harr added a comment -

          Thanks Amir. Can you clarify your comment of, "Since o2iblnd is generally more performant  than the socklnd, it would make sense to have a larger number of peer_credits for the socklnd network"?  Are you saying that because socklnd can't send/recv messages as fast as o2iblnd, you want to increase peer credits to allow more messages to its o2iblnd peers? Wouldn't increasing that number also cause more messages to be send to the socklnd NI, overwhelming it more?

          As for credits/buffers, we do have many cases where the min number of RTR or TX credits (via the peers file) show negative, but the buffers via lnetctl routing show are not close to negative. So, rather than a buffer issue, does this imply a higher number of peer_credits (and total credits) needs to be specified?

          charr Cameron Harr added a comment - Thanks Amir. Can you clarify your comment of, " Since o2iblnd is generally more performant  than the socklnd, it would make sense to have a larger number of peer_credits for the socklnd network" ?  Are you saying that because socklnd can't send/recv messages as fast as o2iblnd, you want to increase peer credits to allow more messages to its o2iblnd peers? Wouldn't increasing that number also cause more messages to be send to the socklnd NI, overwhelming it more? As for credits/buffers, we do have many cases where the min number of RTR or TX credits (via the peers file) show negative, but the buffers via lnetctl routing show are not close to negative. So, rather than a buffer issue, does this imply a higher number of peer_credits (and total credits) needs to be specified?

          1) In some cases, we're routing packets from a cluster compute node that may be OPA or IB to a cluster's router node to a data center router node, across a campus WAN, back to another data center router which sits on an IB SAN the Lustre cluster sits on. So, ko2iblnd <> ko2iblnd <> ko2iblnd/ksocklnd <> ksocklnd/ko2iblnd <> ko2iblnd. Does the }}{{credits parameter for the ksocklnd modules need to match the credits parameter for ko2iblnd on router nodes with both interfaces?

          2) Given the context in 1), do the number of ko2iblnd credits need to match on servers along the entire path or is it appropriate for router nodes to have a larger number of credits set?

          The peer_credits parameters determines how many concurrent messages can be inflight to the same peer. Since o2iblnd is generally more performant  than the socklnd, it would make sense to have a larger number of peer_credits for the socklnd network. The o2iblnd negotiates the peer credits per connection. So even if the peer_credits on two nodes are different, they'll be negotiated down to the least common denominator value. I would recommend then to keep the peer_credits the same across homogeneous networks. That said we know of a limitation with socklnd where the performance isn't great per interface. One work around we currently have is to create multiple virtual interfaces on the same ethernet interface and then configure those in MR config. This increases the performance. We're tracking this under: https://jira.whamcloud.com/browse/LU-12815. So this might be a way for you to increase the performance on the socklnd side.

          I would also differentiate between the credits module parameters and peer_credits module parameter. The former determines the limit on the total concurrent sends to all peers on a particular Network Interface. While the latter limits the number of concurrent sends per peer. So if you increase the number of peer_credits, you'd want to increase the number of global credits as well for the NI.

          The credits are calculated per CPT. You can take a look at {{lnet_ni_tq_credits()}}for more details.

          3) Should there be a relation between the number of credits defined for a node's LND driver and it's buffer (or other relevant settings)?

          For a router I think you need to look at the total number of large/small/tiny buffers you've specified. lnetctl routing show shows you stats on these buffers; the minimum credits for each. if the minimum credits are dipping in the negative that means you have instances where you're queuing due to lack of buffers. In that case you can increase the buffers allocated for that size. This can be done dynamically via: lnetctl set [tiny_buffers|small_buffers|large_buffers] <value>.

          My above comments explain the peer_credits/credits relationship.

          Let me know if you have other questions.

          ashehata Amir Shehata (Inactive) added a comment - 1) In some cases, we're routing packets from a cluster compute node that may be OPA or IB to a cluster's router node to a data center router node, across a campus WAN, back to another data center router which sits on an IB SAN the Lustre cluster sits on. So, ko2iblnd < > ko2iblnd < > ko2iblnd/ksocklnd < > ksocklnd/ko2iblnd < > ko2iblnd. Does the }}{{credits parameter for the ksocklnd modules need to match the credits parameter for ko2iblnd on router nodes with both interfaces? 2) Given the context in 1), do the number of ko2iblnd credits need to match on servers along the entire path or is it appropriate for router nodes to have a larger number of credits set? The peer_credits parameters determines how many concurrent messages can be inflight to the same peer. Since o2iblnd is generally more performant  than the socklnd, it would make sense to have a larger number of peer_credits for the socklnd network. The o2iblnd negotiates the peer credits per connection. So even if the peer_credits on two nodes are different, they'll be negotiated down to the least common denominator value. I would recommend then to keep the peer_credits the same across homogeneous networks. That said we know of a limitation with socklnd where the performance isn't great per interface. One work around we currently have is to create multiple virtual interfaces on the same ethernet interface and then configure those in MR config. This increases the performance. We're tracking this under: https://jira.whamcloud.com/browse/LU-12815 . So this might be a way for you to increase the performance on the socklnd side. I would also differentiate between the credits module parameters and peer_credits module parameter. The former determines the limit on the total concurrent sends to all peers on a particular Network Interface. While the latter limits the number of concurrent sends per peer. So if you increase the number of peer_credits, you'd want to increase the number of global credits as well for the NI. The credits are calculated per CPT. You can take a look at {{lnet_ni_tq_credits()}}for more details. 3) Should there be a relation between the number of credits defined for a node's LND driver and it's buffer (or other relevant settings)? For a router I think you need to look at the total number of large/small/tiny buffers you've specified. lnetctl routing show shows you stats on these buffers; the minimum credits for each. if the minimum credits are dipping in the negative that means you have instances where you're queuing due to lack of buffers. In that case you can increase the buffers allocated for that size. This can be done dynamically via: lnetctl set [tiny_buffers|small_buffers|large_buffers] <value> . My above comments explain the peer_credits/credits relationship. Let me know if you have other questions.
          charr Cameron Harr added a comment - - edited

          Amir - I was trying to keep details of our specific issue out of this documentation ticket. I would like to pass a few more questions past you though:

          1) In some cases, we're routing packets from a cluster compute node that may be OPA or IB to a cluster's router node to a data center router node, across a campus WAN, back to another data center router which sits on an IB SAN the Lustre cluster sits on. So, ko2iblnd <> ko2iblnd <> ko2iblnd/ksocklnd <> ksocklnd/ko2iblnd <> ko2iblnd. Does the {{credits}} parameter for the ksocklnd modules need to match the credits parameter for ko2iblnd on router nodes with both interfaces?

          2) Given the context in 1), do the number of ko2iblnd credits need to match on servers along the entire path or is it appropriate for router nodes to have a larger number of credits set?

          3) Should there be a relation between the number of credits defined for a node's LND driver and it's buffer (or other relevant settings)?

           

          charr Cameron Harr added a comment - - edited Amir - I was trying to keep details of our specific issue out of this documentation ticket. I would like to pass a few more questions past you though: 1) In some cases, we're routing packets from a cluster compute node that may be OPA or IB to a cluster's router node to a data center router node, across a campus WAN, back to another data center router which sits on an IB SAN the Lustre cluster sits on. So, ko2iblnd <> ko2iblnd <> ko2iblnd/ksocklnd <> ksocklnd/ko2iblnd <> ko2iblnd. Does the {{credits }} parameter for the ksocklnd modules need to match the credits parameter for ko2iblnd on router nodes with both interfaces? 2) Given the context in 1), do the number of ko2iblnd credits need to match on servers along the entire path or is it appropriate for router nodes to have a larger number of credits set? 3) Should there be a relation between the number of credits defined for a node's LND driver and it's buffer (or other relevant settings)?  

          Hi Amir,

          I'll create a separate ticket for our issue.  It seems like we've worked through the general questions.

          ofaaland Olaf Faaland added a comment - Hi Amir, I'll create a separate ticket for our issue.  It seems like we've worked through the general questions.

          People

            ashehata Amir Shehata (Inactive)
            charr Cameron Harr
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: