Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-7646

Infinite CON RACE Condition after rebooting LNet router

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.9.0
    • None
    • 3
    • 9223372036854775807

    Description

      While investigating/working on the fix for LU-7569 we stumbled on another bug when testing on a customer's system. When an LNet router is rebooted and mlx5-based cards are being used, it is possible for a client's attempt to reconnect to the router to get stuck in a permanent connecting state. When the router comes up and tries to create a connection back to the client, that connection will be rejected as CON RACE. This is an infinite loop because the stuck connection is always present on the client triggering the rejection.

      This ticket has been opened to create a fix which compliments LU-7569. I appreciate that the mlx5 driver should be fixed to prevent stuck connection attempts, but at the same time, we need LNet to be immune to such situations as the result is pretty severe. We need self-healing code here.

      Attachments

        Issue Links

          Activity

            [LU-7646] Infinite CON RACE Condition after rebooting LNet router

            Annnnd, now I can't get the bug to hit at all. Sigh.

            Well, I've got some debugging in place that might give us some insight into the active connection attempt state...but only if there is a single active connection attempts at a time. If there are multiple, they'll trample the same state variable that I added to lnet_peer_t.

            morrone Christopher Morrone (Inactive) added a comment - Annnnd, now I can't get the bug to hit at all. Sigh. Well, I've got some debugging in place that might give us some insight into the active connection attempt state...but only if there is a single active connection attempts at a time. If there are multiple, they'll trample the same state variable that I added to lnet_peer_t.

            I suspect that a timeout counts as an error and those events are supposed to be generated.

            It is unfortunate that there is no easily accessible structure that can give us the current state of the active connection attempt when it is stuck.

            morrone Christopher Morrone (Inactive) added a comment - - edited I suspect that a timeout counts as an error and those events are supposed to be generated. It is unfortunate that there is no easily accessible structure that can give us the current state of the active connection attempt when it is stuck.
            doug Doug Oucharek (Inactive) added a comment - - edited

            It is interesting that both rdma_resolve_addr() and rdma_resolve_route() have a timeout parameter. We pass in a default of 50 secs. The man pages do not say what happens when a timeout occurs. Do we get an RDMA_CM_EVENT_ADDR_ERROR or RDMA_CM_EVENT_ROUTE_ERROR CM event? If not, then we could be missing out on a timeout which would allow us to do what you have indicated above.

            Where can I find some "real" OFED documentation?

            doug Doug Oucharek (Inactive) added a comment - - edited It is interesting that both rdma_resolve_addr() and rdma_resolve_route() have a timeout parameter. We pass in a default of 50 secs. The man pages do not say what happens when a timeout occurs. Do we get an RDMA_CM_EVENT_ADDR_ERROR or RDMA_CM_EVENT_ROUTE_ERROR CM event? If not, then we could be missing out on a timeout which would allow us to do what you have indicated above. Where can I find some "real" OFED documentation?

            Since we are seeing the stuck side rejecting connections indefinitely because of reason IBLND_REJECT_CONN_RACE, we know that ibp_connecting is non-zero. There are only two places that make that happen, kiblnd_launch_tx() and kiblnd_reconnect_peer(), both before calling kiblnd_connect_peer().

            Since this problem tends to happen when the node with the lower NID is rebooted, it seems likely we are mostly concerned with the kiblnd_reconnect_peer() path.

            kiblnd_connect_peer() likely didn't fail, because if it did it would have called kiblnd_peer_connect_failed(), which in turn decrements ibp_connecting, so we wouldn't be in stuck in a single connection attempt.

            Hmmm.

            Of course, kiblnd_connect_peer() really just starts an asynchronous connection process. It kicks off the address resolution, and then we have to wait for the RDMA_CM_EVENT_ADDR_RESOLVED callback. That in turn starts route resolution, and we wait for RDMA_CM_EVENT_ROUTE_RESOLVED. That callback is where we call kiblnd_active_connect().

            And so on.

            I think it would be good to know exactly which phase the stuck connect is in. I can work on a debug patch to reveal that tomorrow.

            It is not clear to me why a better approach to a work-around wouldn't be to augment kiblnd_check_conns() to check and time out active connection attempts. We already put an arbitrary time limit on any lnd queued tx messages. If any messages have timed out we explicitly close the connection and reconnect.

            It would seem like we could add logic there to watch for active connection attempts that have taken too long, abort the attempt, and restart the active connection attempt.

            morrone Christopher Morrone (Inactive) added a comment - Since we are seeing the stuck side rejecting connections indefinitely because of reason IBLND_REJECT_CONN_RACE, we know that ibp_connecting is non-zero. There are only two places that make that happen, kiblnd_launch_tx() and kiblnd_reconnect_peer(), both before calling kiblnd_connect_peer(). Since this problem tends to happen when the node with the lower NID is rebooted, it seems likely we are mostly concerned with the kiblnd_reconnect_peer() path. kiblnd_connect_peer() likely didn't fail, because if it did it would have called kiblnd_peer_connect_failed(), which in turn decrements ibp_connecting, so we wouldn't be in stuck in a single connection attempt. Hmmm. Of course, kiblnd_connect_peer() really just starts an asynchronous connection process. It kicks off the address resolution, and then we have to wait for the RDMA_CM_EVENT_ADDR_RESOLVED callback. That in turn starts route resolution, and we wait for RDMA_CM_EVENT_ROUTE_RESOLVED. That callback is where we call kiblnd_active_connect(). And so on. I think it would be good to know exactly which phase the stuck connect is in. I can work on a debug patch to reveal that tomorrow. It is not clear to me why a better approach to a work-around wouldn't be to augment kiblnd_check_conns() to check and time out active connection attempts. We already put an arbitrary time limit on any lnd queued tx messages. If any messages have timed out we explicitly close the connection and reconnect. It would seem like we could add logic there to watch for active connection attempts that have taken too long, abort the attempt, and restart the active connection attempt.

            Oh, and as for not having a system to test it on...now you do! If you've got debug patches and things to investigate, we can facilitate that on our testbed.

            morrone Christopher Morrone (Inactive) added a comment - Oh, and as for not having a system to test it on...now you do! If you've got debug patches and things to investigate, we can facilitate that on our testbed.

            I'm all for starting new tickets for separate problems. But the connection jam is exactly the problem being dealt with in this ticket. Why would we start a new one?

            morrone Christopher Morrone (Inactive) added a comment - I'm all for starting new tickets for separate problems. But the connection jam is exactly the problem being dealt with in this ticket. Why would we start a new one?

            Interesting. I had hypothesised that this issue is either caused by, or augmented by, MLX5. We had never seen this until some clusters started using MLX5. I suspect the connection jam is MLX5-related.

            Sadly, I have no access to MLX5 so cannot dig into the nature of the connection lock up. The current patch, though not perfect, allows systems to move forward and work even if there is a potential of a "leaked" connection structure or two.

            I think the connection jam should be a new Jira ticket. We need to get Mellanox involved to help understand the MLX5-specific change which is triggering this.

            doug Doug Oucharek (Inactive) added a comment - Interesting. I had hypothesised that this issue is either caused by, or augmented by, MLX5. We had never seen this until some clusters started using MLX5. I suspect the connection jam is MLX5-related. Sadly, I have no access to MLX5 so cannot dig into the nature of the connection lock up. The current patch, though not perfect, allows systems to move forward and work even if there is a potential of a "leaked" connection structure or two. I think the connection jam should be a new Jira ticket. We need to get Mellanox involved to help understand the MLX5-specific change which is triggering this.

            Yes, we do. In our testbed we have MDS and OSS nodes on the same mlx5 network. Probability of getting into this connection race if very high even without significant clients or load.

            morrone Christopher Morrone (Inactive) added a comment - Yes, we do. In our testbed we have MDS and OSS nodes on the same mlx5 network. Probability of getting into this connection race if very high even without significant clients or load.

            Do you have an easy-to-reproduce scenario for this infinite CON RACE? The original problem involved a router surrounded by thousands of nodes, rebooting triggering a mass of reconnections. Probability of getting into this infinite CON RACE is very high especially if MLX5 is involved.

            doug Doug Oucharek (Inactive) added a comment - Do you have an easy-to-reproduce scenario for this infinite CON RACE? The original problem involved a router surrounded by thousands of nodes, rebooting triggering a mass of reconnections. Probability of getting into this infinite CON RACE is very high especially if MLX5 is involved.

            The IB connection operation is hidden in the o2iblnd below the level of lnet credits. It would not negatively effect any of the current guarantees to abort the IB connection operation (not the ptlrpc level connection operation) and retry.

            Yes, waiting 20 messages that come in on 1 seconds intervals is essentially a strange way to implement a 20 second timeout. But that would seem to me to be the more complicated solution to understand and maintain in the long run versus an actually timeout.

            After all, the current solution basically just goes "oh, you've tried 20 times, sure, you can connect". It is fine in the normal case of resolving a connection race to do that, because asynchronously elsewhere the other racing connection message is expected to get an error and cleanup whatever resources were associated with it. But here we already know that is never going to happen, so aren't we leaking resources every time? Couldn't this potentially cause problems on long running systems?

            morrone Christopher Morrone (Inactive) added a comment - The IB connection operation is hidden in the o2iblnd below the level of lnet credits. It would not negatively effect any of the current guarantees to abort the IB connection operation (not the ptlrpc level connection operation) and retry. Yes, waiting 20 messages that come in on 1 seconds intervals is essentially a strange way to implement a 20 second timeout. But that would seem to me to be the more complicated solution to understand and maintain in the long run versus an actually timeout. After all, the current solution basically just goes "oh, you've tried 20 times, sure, you can connect". It is fine in the normal case of resolving a connection race to do that, because asynchronously elsewhere the other racing connection message is expected to get an error and cleanup whatever resources were associated with it. But here we already know that is never going to happen, so aren't we leaking resources every time? Couldn't this potentially cause problems on long running systems?

            That would mean adding something to LNet it currently does not have: a timeout. LNet depends on two things: 1- that we have a Reliable Connection (RC for IB) and that our own QoS mechanism (credits and peer_credits) saves us from packet drops, and 2- the layers above LNet will let us know that something has taken too long to happen.

            I'm not sure a timer will make this work any better than it does with a counter. Once we bang our head into the CON RACE brick wall 20 times, I think we can be pretty sure the connecting connection which is in our way is stuck and can be abandoned. I originally had that set to just 2 failures as I'm pretty sure that would be good enough to declare a connection stuck. But inspectors convinced me to up it to 20. Simple solutions are usually the best approach.

            doug Doug Oucharek (Inactive) added a comment - That would mean adding something to LNet it currently does not have: a timeout. LNet depends on two things: 1- that we have a Reliable Connection (RC for IB) and that our own QoS mechanism (credits and peer_credits) saves us from packet drops, and 2- the layers above LNet will let us know that something has taken too long to happen. I'm not sure a timer will make this work any better than it does with a counter. Once we bang our head into the CON RACE brick wall 20 times, I think we can be pretty sure the connecting connection which is in our way is stuck and can be abandoned. I originally had that set to just 2 failures as I'm pretty sure that would be good enough to declare a connection stuck. But inspectors convinced me to up it to 20. Simple solutions are usually the best approach.

            People

              doug Doug Oucharek (Inactive)
              doug Doug Oucharek (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              17 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: