Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-7646

Infinite CON RACE Condition after rebooting LNet router

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.9.0
    • None
    • 3
    • 9223372036854775807

    Description

      While investigating/working on the fix for LU-7569 we stumbled on another bug when testing on a customer's system. When an LNet router is rebooted and mlx5-based cards are being used, it is possible for a client's attempt to reconnect to the router to get stuck in a permanent connecting state. When the router comes up and tries to create a connection back to the client, that connection will be rejected as CON RACE. This is an infinite loop because the stuck connection is always present on the client triggering the rejection.

      This ticket has been opened to create a fix which compliments LU-7569. I appreciate that the mlx5 driver should be fixed to prevent stuck connection attempts, but at the same time, we need LNet to be immune to such situations as the result is pretty severe. We need self-healing code here.

      Attachments

        Issue Links

          Activity

            [LU-7646] Infinite CON RACE Condition after rebooting LNet router
            pjones Peter Jones added a comment -

            Landed for 2.9

            pjones Peter Jones added a comment - Landed for 2.9

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/19430/
            Subject: LU-7646 lnet: Stop Infinite CON RACE Condition
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 94f757bf67d58694201b2434f7879974c7abd622

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/19430/ Subject: LU-7646 lnet: Stop Infinite CON RACE Condition Project: fs/lustre-release Branch: master Current Patch Set: Commit: 94f757bf67d58694201b2434f7879974c7abd622

            It's back!

            [ 8866.277966] LNet: 509:0:(o2iblnd_cb.c:2418:kiblnd_passive_connect()) LU-7646 connection attempt to 172.19.1.52@o2ib100 probably stuck (state=2)
            

            state=2 is almost certainly RDMA_CM_EVENT_ROUTE_RESOLVED (if the compiler is not doing something strange with enums). That would imply that we entered "case RDMA_CM_EVENT_ROUTE_RESOLVED".

            And I don't see any error messages from that section. That would seem to imply that we called rdma_connect successfully and never received an event after that.

            morrone Christopher Morrone (Inactive) added a comment - It's back! [ 8866.277966] LNet: 509:0:(o2iblnd_cb.c:2418:kiblnd_passive_connect()) LU-7646 connection attempt to 172.19.1.52@o2ib100 probably stuck (state=2) state=2 is almost certainly RDMA_CM_EVENT_ROUTE_RESOLVED (if the compiler is not doing something strange with enums). That would imply that we entered "case RDMA_CM_EVENT_ROUTE_RESOLVED". And I don't see any error messages from that section. That would seem to imply that we called rdma_connect successfully and never received an event after that.

            Annnnd, now I can't get the bug to hit at all. Sigh.

            Well, I've got some debugging in place that might give us some insight into the active connection attempt state...but only if there is a single active connection attempts at a time. If there are multiple, they'll trample the same state variable that I added to lnet_peer_t.

            morrone Christopher Morrone (Inactive) added a comment - Annnnd, now I can't get the bug to hit at all. Sigh. Well, I've got some debugging in place that might give us some insight into the active connection attempt state...but only if there is a single active connection attempts at a time. If there are multiple, they'll trample the same state variable that I added to lnet_peer_t.

            I suspect that a timeout counts as an error and those events are supposed to be generated.

            It is unfortunate that there is no easily accessible structure that can give us the current state of the active connection attempt when it is stuck.

            morrone Christopher Morrone (Inactive) added a comment - - edited I suspect that a timeout counts as an error and those events are supposed to be generated. It is unfortunate that there is no easily accessible structure that can give us the current state of the active connection attempt when it is stuck.
            doug Doug Oucharek (Inactive) added a comment - - edited

            It is interesting that both rdma_resolve_addr() and rdma_resolve_route() have a timeout parameter. We pass in a default of 50 secs. The man pages do not say what happens when a timeout occurs. Do we get an RDMA_CM_EVENT_ADDR_ERROR or RDMA_CM_EVENT_ROUTE_ERROR CM event? If not, then we could be missing out on a timeout which would allow us to do what you have indicated above.

            Where can I find some "real" OFED documentation?

            doug Doug Oucharek (Inactive) added a comment - - edited It is interesting that both rdma_resolve_addr() and rdma_resolve_route() have a timeout parameter. We pass in a default of 50 secs. The man pages do not say what happens when a timeout occurs. Do we get an RDMA_CM_EVENT_ADDR_ERROR or RDMA_CM_EVENT_ROUTE_ERROR CM event? If not, then we could be missing out on a timeout which would allow us to do what you have indicated above. Where can I find some "real" OFED documentation?

            Since we are seeing the stuck side rejecting connections indefinitely because of reason IBLND_REJECT_CONN_RACE, we know that ibp_connecting is non-zero. There are only two places that make that happen, kiblnd_launch_tx() and kiblnd_reconnect_peer(), both before calling kiblnd_connect_peer().

            Since this problem tends to happen when the node with the lower NID is rebooted, it seems likely we are mostly concerned with the kiblnd_reconnect_peer() path.

            kiblnd_connect_peer() likely didn't fail, because if it did it would have called kiblnd_peer_connect_failed(), which in turn decrements ibp_connecting, so we wouldn't be in stuck in a single connection attempt.

            Hmmm.

            Of course, kiblnd_connect_peer() really just starts an asynchronous connection process. It kicks off the address resolution, and then we have to wait for the RDMA_CM_EVENT_ADDR_RESOLVED callback. That in turn starts route resolution, and we wait for RDMA_CM_EVENT_ROUTE_RESOLVED. That callback is where we call kiblnd_active_connect().

            And so on.

            I think it would be good to know exactly which phase the stuck connect is in. I can work on a debug patch to reveal that tomorrow.

            It is not clear to me why a better approach to a work-around wouldn't be to augment kiblnd_check_conns() to check and time out active connection attempts. We already put an arbitrary time limit on any lnd queued tx messages. If any messages have timed out we explicitly close the connection and reconnect.

            It would seem like we could add logic there to watch for active connection attempts that have taken too long, abort the attempt, and restart the active connection attempt.

            morrone Christopher Morrone (Inactive) added a comment - Since we are seeing the stuck side rejecting connections indefinitely because of reason IBLND_REJECT_CONN_RACE, we know that ibp_connecting is non-zero. There are only two places that make that happen, kiblnd_launch_tx() and kiblnd_reconnect_peer(), both before calling kiblnd_connect_peer(). Since this problem tends to happen when the node with the lower NID is rebooted, it seems likely we are mostly concerned with the kiblnd_reconnect_peer() path. kiblnd_connect_peer() likely didn't fail, because if it did it would have called kiblnd_peer_connect_failed(), which in turn decrements ibp_connecting, so we wouldn't be in stuck in a single connection attempt. Hmmm. Of course, kiblnd_connect_peer() really just starts an asynchronous connection process. It kicks off the address resolution, and then we have to wait for the RDMA_CM_EVENT_ADDR_RESOLVED callback. That in turn starts route resolution, and we wait for RDMA_CM_EVENT_ROUTE_RESOLVED. That callback is where we call kiblnd_active_connect(). And so on. I think it would be good to know exactly which phase the stuck connect is in. I can work on a debug patch to reveal that tomorrow. It is not clear to me why a better approach to a work-around wouldn't be to augment kiblnd_check_conns() to check and time out active connection attempts. We already put an arbitrary time limit on any lnd queued tx messages. If any messages have timed out we explicitly close the connection and reconnect. It would seem like we could add logic there to watch for active connection attempts that have taken too long, abort the attempt, and restart the active connection attempt.

            Oh, and as for not having a system to test it on...now you do! If you've got debug patches and things to investigate, we can facilitate that on our testbed.

            morrone Christopher Morrone (Inactive) added a comment - Oh, and as for not having a system to test it on...now you do! If you've got debug patches and things to investigate, we can facilitate that on our testbed.

            I'm all for starting new tickets for separate problems. But the connection jam is exactly the problem being dealt with in this ticket. Why would we start a new one?

            morrone Christopher Morrone (Inactive) added a comment - I'm all for starting new tickets for separate problems. But the connection jam is exactly the problem being dealt with in this ticket. Why would we start a new one?

            Interesting. I had hypothesised that this issue is either caused by, or augmented by, MLX5. We had never seen this until some clusters started using MLX5. I suspect the connection jam is MLX5-related.

            Sadly, I have no access to MLX5 so cannot dig into the nature of the connection lock up. The current patch, though not perfect, allows systems to move forward and work even if there is a potential of a "leaked" connection structure or two.

            I think the connection jam should be a new Jira ticket. We need to get Mellanox involved to help understand the MLX5-specific change which is triggering this.

            doug Doug Oucharek (Inactive) added a comment - Interesting. I had hypothesised that this issue is either caused by, or augmented by, MLX5. We had never seen this until some clusters started using MLX5. I suspect the connection jam is MLX5-related. Sadly, I have no access to MLX5 so cannot dig into the nature of the connection lock up. The current patch, though not perfect, allows systems to move forward and work even if there is a potential of a "leaked" connection structure or two. I think the connection jam should be a new Jira ticket. We need to get Mellanox involved to help understand the MLX5-specific change which is triggering this.

            People

              doug Doug Oucharek (Inactive)
              doug Doug Oucharek (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              17 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: