Details

    • Question/Request
    • Resolution: Fixed
    • Minor
    • None
    • None
    • Lustre 2.12.5
      Infiniband (MLX4, MLX5)
      TCP
      OPA
    • 9223372036854775807

    Description

      We recently experienced major issues switching Lustre routing clusters to 2.12 and ended up reverting them to 2.10. In trying to better understand LNet, I read through various documentation pages, but was left with several questions. Can you help answer the following questions and perhaps update the LNet docs as well?

      Questions:

      1. What is the data flow, or order of operations, for sending LNet messages from client to server, through routers? For example, a mock (and incorrect?) model might be:
        1. Client determines next hop router
        2. Client checks available routing buffer credits (rtr) on router
        3. Client checks available send credits (tx) and peer_credits to that router on self
        4. Client sends <= #peer_credits messages, decrementing tx for each message
        5. Router receives messages in routing buffers, depending on message size, and decrements # routing buffer credits (rtr) for each message.
        6. Router then acts as the client, repeating steps 1-5 above to the next hop as well as back to the original client (as data is received)
      2. Is there any need to manually add peers in a non-MR config? My understanding is no.
      3. Should a router have a peer entry for every node in the expanded SAN, including in other networks it needs to be routed to?
      4. The manual states "The number of credits currently in flight (number of transmit credits) is shown in the tx column.... Therefore, rtr – tx is the number of transmits in flight." It seems "in flight" for the "tx" description should be "available" so that rtr-tx would be "in flight", right?
      5. Should a NID ever show for the wrong interface (e.g. tcp instead of o2ibXX)? We will sometimes see messages in logs from <addr>@tcp when it should be <addr>@o2ibX.
      6. Do the older mlx4 lnet settings need to be updated for mlx5 or are they still applicable? (https://wiki.lustre.org/LNet_Router_Config_Guide#Configure_Lustre_Servers)?

      Attachments

        Activity

          [LUDOC-479] Need LNet clarifications

          Hi Amir,

          I'll create a separate ticket for our issue.  It seems like we've worked through the general questions.

          ofaaland Olaf Faaland added a comment - Hi Amir, I'll create a separate ticket for our issue.  It seems like we've worked through the general questions.

          In your upgrade procedure do you bring down a router, upgrade to 2.12 and then bring up the router? And that's when you start seeing timeout issues?

          Have you tried disabling discovery on the router as you bring it up?

          Would we be able to setup a debugging session to get to the bottom of this?

          ashehata Amir Shehata (Inactive) added a comment - In your upgrade procedure do you bring down a router, upgrade to 2.12 and then bring up the router? And that's when you start seeing timeout issues? Have you tried disabling discovery on the router as you bring it up? Would we be able to setup a debugging session to get to the bottom of this?

          Amir,

          We're still having major timeout issues when we try to bring up routers (one by one) in 2.12. Another question I have is whether it should be fine to mix 2.10 and 2.12 routers? I would think so, but wanted to verify.

           

          charr Cameron Harr added a comment - Amir, We're still having major timeout issues when we try to bring up routers (one by one) in 2.12. Another question I have is whether it should be fine to mix 2.10 and 2.12 routers? I would think so, but wanted to verify.  
          charr Cameron Harr added a comment -

          Thanks for the clarification. It's helpful.

          charr Cameron Harr added a comment - Thanks for the clarification. It's helpful.

          "On the second-to-last question, the confusion on our part in some cases was because the node should only have had a single NID (@o2ibXX) but the error message referenced @tcp"

          There is currently an issue where this could happen. We have fixed it on master.

          LU-13477 lnet: Force full discovery cycle

          We're currently trying to port it back to b2_12

          "can you also clarify the separation of buffers and routing buffers for me? I understand from your flow diagram the regular buffers (divided in bins by page size) are for receiving messages. Are the routing buffers then used once the node (a router) determines the message is not for it, at which point it places the message in a routing buffer to be processed. Is that correct? Are those routing buffers typically larger than the regular buffers?"

          For non-gateway nodes (IE clients and servers), usually lustre sets up the RDMA buffer to receive or transmit from. However for gateways, there are no such buffers. In fact lustre doesn't even have to be loaded at all on the gateways. Therefore, a gateway needs to allocate buffers to accept the RDMAed data into and then turn around and forward this data by RDMAing it to the next hop. These are the buffers I was trying to show in the second flow diagram in the link above. In my diagram there are no "regular buffers". I reference only "routing buffers". I'll update the diagram to clarify.

          There are 3 router buffer pools, tiny, small and large. These are only allocated when you turn on the routing feature. When a message is received by a gateway and it determines it needs to forward this over to the next hop, then it looks at the size of data in this message and pulls out an appropriately sized buffer (max 1MB). It receives the RDMAed data into that buffer, then turns around and forwards it to the next hop.

          Does that answer your question?

          ashehata Amir Shehata (Inactive) added a comment - "On the second-to-last question, the confusion on our part in some cases was because the node should only have had a single NID (@o2ibXX) but the error message referenced @tcp" There is currently an issue where this could happen. We have fixed it on master. LU-13477 lnet: Force full discovery cycle We're currently trying to port it back to b2_12 "can you also clarify the separation of buffers and routing buffers for me? I understand from your flow diagram the regular buffers (divided in bins by page size) are for receiving messages. Are the routing buffers then used once the node (a router) determines the message is not for it, at which point it places the message in a routing buffer to be processed. Is that correct? Are those routing buffers typically larger than the regular buffers?" For non-gateway nodes (IE clients and servers), usually lustre sets up the RDMA buffer to receive or transmit from. However for gateways, there are no such buffers. In fact lustre doesn't even have to be loaded at all on the gateways. Therefore, a gateway needs to allocate buffers to accept the RDMAed data into and then turn around and forward this data by RDMAing it to the next hop. These are the buffers I was trying to show in the second flow diagram in the link above. In my diagram there are no "regular buffers". I reference only "routing buffers". I'll update the diagram to clarify. There are 3 router buffer pools, tiny, small and large. These are only allocated when you turn on the routing feature. When a message is received by a gateway and it determines it needs to forward this over to the next hop, then it looks at the size of data in this message and pulls out an appropriately sized buffer (max 1MB). It receives the RDMAed data into that buffer, then turns around and forwards it to the next hop. Does that answer your question?
          charr Cameron Harr added a comment -

          Amir, can you also clarify the separation of buffers and routing buffers for me? I understand from your flow diagram the regular buffers (divided in bins by page size) are for receiving messages. Are the routing buffers then used once the node (a router) determines the message is not for it, at which point it places the message in a routing buffer to be processed. Is that correct? Are those routing buffers typically larger than the regular buffers?

          charr Cameron Harr added a comment - Amir, can you also clarify the separation of buffers and routing buffers for me? I understand from your flow diagram the regular buffers (divided in bins by page size) are for receiving messages. Are the routing buffers then used once the node (a router) determines the message is not for it, at which point it places the message in a routing buffer to be processed. Is that correct? Are those routing buffers typically larger than the regular buffers?
          charr Cameron Harr added a comment -

          Amir,

          This is very helpful and answers most of my questions. On the second-to-last question, the confusion on our part in some cases was because the node should only have had a single NID (@o2ibXX) but the error message referenced @tcp. I'll have to recreate the issue and dump lnetctl settings to make sure there isn't a TCP network active on it when we get that message.

          Regarding the "Issues" we've been having, we're still trying to characterize them. When we enable Lustre 2.12 on the routers, messages from some clients get jammed up and we start seeing hangs until we take the 2.12 routers offline.

          charr Cameron Harr added a comment - Amir, This is very helpful and answers most of my questions. On the second-to-last question, the confusion on our part in some cases was because the node should only have had a single NID (@o2ibXX) but the error message referenced @tcp. I'll have to recreate the issue and dump lnetctl settings to make sure there isn't a TCP network active on it when we get that message. Regarding the "Issues" we've been having, we're still trying to characterize them. When we enable Lustre 2.12 on the routers, messages from some clients get jammed up and we start seeing hangs until we take the 2.12 routers offline.

          Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/40143
          Subject: LUDOC-479 lnet: Clarify transmit and routing credits
          Project: doc/manual
          Branch: master
          Current Patch Set: 1
          Commit: 9211ffd0087986733e03bc62b2d9f894a19ac3a3

          gerrit Gerrit Updater added a comment - Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/40143 Subject: LUDOC-479 lnet: Clarify transmit and routing credits Project: doc/manual Branch: master Current Patch Set: 1 Commit: 9211ffd0087986733e03bc62b2d9f894a19ac3a3
          1. What is the data flow, or order of operations, for sending LNet messages from client to server, through routers? For example, a mock (and incorrect?) model might be:
            • I wrote a detailed document on how message processing works in the LNet: https://wiki.whamcloud.com/x/LQfvC
              • I'm not sure if this level of detail is appropriate for the user manual.
          2. Is there any need to manually add peers in a non-MR config? My understanding is no.
            • I think we should be careful when we talk about non-MR config. in 2.11 and later MR code is always active. Whether there are multiple local or remote interfaces available, that's a different matter. With discovery enabled, you don't have to do any manual peer addition. The node will automatically discover all the peer's NIDs, whether the peer has multiple interfaces or not. However, if you disable discovery, and the peer (and the local node) are single rail then you don't need to manually add peer NIDs. However, if you disable discovery and have multiple interfaces and you want to use these multiple interfaces, then you'll need to manually add the peer NIDs using
              •  lnetctl peer add --prim_nid <prim_nid> --nid <list of nids>
          1. Should a router have a peer entry for every node in the expanded SAN, including in other networks it needs to be routed to?
            • You don't have to manually add any peers. Whenever a router creates a connection to any peer NID it'll create a local representation of that peer. With discovery on it'll attempt to discover that NID and it'll create a local peer representation which includes all the NIDs of the peer. Let's take an example of a multi-hop setup. A router sends to another router, which sends to the final destination. Because of discovery, it'll attempt to discover both the next hop and final destination on the first go. So if you do
              lnetctl peer show

          then you'll find that the router has peer representations of both the next hop and the final destination

          1. The manual states "The number of credits currently in flight (number of transmit credits) is shown in the tx column.... Therefore, rtr – tx is the number of transmits in flight." It seems "in flight" for the "tx" description should be "available" so that rtr-tx would be "in flight", right?
            • I'm not sure if the explanation in the manual is correct. The rtr column shows the currently available router credits. And the tx column shows the currently available transmit credits. The rtr credits are decremented when a message is received and a routing buffer is pulled. If the rtr credits go into the negative then that means there are received messages which are queued due to no rtr credits available. They'll remain queued until rtr credits are available for use (IE messages from that peer which are being forwarded to the final destination are completed).The tx credits is decremented when we pass the message down to the LND for sending to the peer. Let's take an example to work through this: For simplicity say the max rtr credits and max tx credits are 32. Therefore max_rtr_credits - available_rtr_credits = active messages from the peer in the process of being routed to their final destination. max_tx_credits - available_tx_credits = inflight messages to the peer. I don't think rtr-tx would really give us any meaningful information, would it?
          2. Should a NID ever show for the wrong interface (e.g. tcp instead of o2ibXX)? We will sometimes see messages in logs from <addr>@tcp when it should be <addr>@o2ibX.
            • This comes back to discovery. If you have a node with multiple NIDs, the first configured NID becomes the primary NID of the peer. So when looking at the logs you have to keep that in mind. Even though the primary NID can be tcp, the actual NID which is used will be the o2ib. For example if you have two nodes. The first has a tcp and o2ib NIDs configured, tcp being the primary NID. The second node has only an o2ib NID. When the second node discovers the first, it'll determine that the primary NID is the tcp one. So it is potentially possible that you might see the tcp NID in the logs referring to the peer even though the o2ib NID is the one being used. The primary NID in effect is the identifying NID of the peer. That peer can have multiple NIDs and only a subset of them can be used.
              That said, however, there is an issue where discovery might glitch and you might get an incomplete representation of the peer. This has been resolved in master.
          3. Do the older mlx4 lnet settings need to be updated for mlx5 or are they still applicable? (https://wiki.lustre.org/LNet_Router_Config_Guide#Configure_Lustre_Servers)?
            • I usually recommend from what I have observed on different sites the following values: peer_credits=32, peer_credits_hiw=16, concurrent_sends=64. These seem to work well for mlx5. The rest of the values can remain at their default values.

           It might be helpful if you explain to me the major issues you ran into when you tried to switch to 2.12.5. If I have more specifics I might be able to provide more focused feedback.

          ashehata Amir Shehata (Inactive) added a comment - What is the data flow, or order of operations, for sending LNet messages from client to server, through routers? For example, a mock (and incorrect?) model might be: I wrote a detailed document on how message processing works in the LNet: https://wiki.whamcloud.com/x/LQfvC I'm not sure if this level of detail is appropriate for the user manual. Is there any need to manually add peers in a non-MR config? My understanding is no. I think we should be careful when we talk about non-MR config. in 2.11 and later MR code is always active. Whether there are multiple local or remote interfaces available, that's a different matter. With discovery enabled, you don't have to do any manual peer addition. The node will automatically discover all the peer's NIDs, whether the peer has multiple interfaces or not. However, if you disable discovery, and the peer (and the local node) are single rail then you don't need to manually add peer NIDs. However, if you disable discovery and have multiple interfaces and you want to use these multiple interfaces, then you'll need to manually add the peer NIDs using lnetctl peer add --prim_nid <prim_nid> --nid <list of nids> Should a router have a peer entry for every node in the expanded SAN, including in other networks it needs to be routed to? You don't have to manually add any peers. Whenever a router creates a connection to any peer NID it'll create a local representation of that peer. With discovery on it'll attempt to discover that NID and it'll create a local peer representation which includes all the NIDs of the peer. Let's take an example of a multi-hop setup. A router sends to another router, which sends to the final destination. Because of discovery, it'll attempt to discover both the next hop and final destination on the first go. So if you do lnetctl peer show then you'll find that the router has peer representations of both the next hop and the final destination The manual states "The number of credits currently in flight (number of transmit credits) is shown in the tx column.... Therefore, rtr – tx is the number of transmits in flight." It seems "in flight" for the "tx" description should be "available" so that rtr-tx would be "in flight", right? I'm not sure if the explanation in the manual is correct. The rtr column shows the currently available router credits. And the tx column shows the currently available transmit credits. The rtr credits are decremented when a message is received and a routing buffer is pulled. If the rtr credits go into the negative then that means there are received messages which are queued due to no rtr credits available. They'll remain queued until rtr credits are available for use (IE messages from that peer which are being forwarded to the final destination are completed).The tx credits is decremented when we pass the message down to the LND for sending to the peer. Let's take an example to work through this: For simplicity say the max rtr credits and max tx credits are 32. Therefore max_rtr_credits - available_rtr_credits = active messages from the peer in the process of being routed to their final destination. max_tx_credits - available_tx_credits = inflight messages to the peer. I don't think rtr-tx would really give us any meaningful information, would it? Should a NID ever show for the wrong interface (e.g. tcp instead of o2ibXX)? We will sometimes see messages in logs from <addr>@tcp when it should be <addr>@o2ibX. This comes back to discovery. If you have a node with multiple NIDs, the first configured NID becomes the primary NID of the peer. So when looking at the logs you have to keep that in mind. Even though the primary NID can be tcp, the actual NID which is used will be the o2ib. For example if you have two nodes. The first has a tcp and o2ib NIDs configured, tcp being the primary NID. The second node has only an o2ib NID. When the second node discovers the first, it'll determine that the primary NID is the tcp one. So it is potentially possible that you might see the tcp NID in the logs referring to the peer even though the o2ib NID is the one being used. The primary NID in effect is the identifying NID of the peer. That peer can have multiple NIDs and only a subset of them can be used. That said, however, there is an issue where discovery might glitch and you might get an incomplete representation of the peer. This has been resolved in master. Do the older mlx4 lnet settings need to be updated for mlx5 or are they still applicable? ( https://wiki.lustre.org/LNet_Router_Config_Guide#Configure_Lustre_Servers)? I usually recommend from what I have observed on different sites the following values: peer_credits=32, peer_credits_hiw=16, concurrent_sends=64. These seem to work well for mlx5. The rest of the values can remain at their default values.  It might be helpful if you explain to me the major issues you ran into when you tried to switch to 2.12.5. If I have more specifics I might be able to provide more focused feedback.
          pjones Peter Jones added a comment -

          Amir

          Could you please advise?

          Thanks

          Peter

          pjones Peter Jones added a comment - Amir Could you please advise? Thanks Peter

          People

            ashehata Amir Shehata (Inactive)
            charr Cameron Harr
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: