Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-7646

Infinite CON RACE Condition after rebooting LNet router

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.9.0
    • None
    • 3
    • 9223372036854775807

    Description

      While investigating/working on the fix for LU-7569 we stumbled on another bug when testing on a customer's system. When an LNet router is rebooted and mlx5-based cards are being used, it is possible for a client's attempt to reconnect to the router to get stuck in a permanent connecting state. When the router comes up and tries to create a connection back to the client, that connection will be rejected as CON RACE. This is an infinite loop because the stuck connection is always present on the client triggering the rejection.

      This ticket has been opened to create a fix which compliments LU-7569. I appreciate that the mlx5 driver should be fixed to prevent stuck connection attempts, but at the same time, we need LNet to be immune to such situations as the result is pretty severe. We need self-healing code here.

      Attachments

        Issue Links

          Activity

            [LU-7646] Infinite CON RACE Condition after rebooting LNet router

            Oh, and as for not having a system to test it on...now you do! If you've got debug patches and things to investigate, we can facilitate that on our testbed.

            morrone Christopher Morrone (Inactive) added a comment - Oh, and as for not having a system to test it on...now you do! If you've got debug patches and things to investigate, we can facilitate that on our testbed.

            I'm all for starting new tickets for separate problems. But the connection jam is exactly the problem being dealt with in this ticket. Why would we start a new one?

            morrone Christopher Morrone (Inactive) added a comment - I'm all for starting new tickets for separate problems. But the connection jam is exactly the problem being dealt with in this ticket. Why would we start a new one?

            Interesting. I had hypothesised that this issue is either caused by, or augmented by, MLX5. We had never seen this until some clusters started using MLX5. I suspect the connection jam is MLX5-related.

            Sadly, I have no access to MLX5 so cannot dig into the nature of the connection lock up. The current patch, though not perfect, allows systems to move forward and work even if there is a potential of a "leaked" connection structure or two.

            I think the connection jam should be a new Jira ticket. We need to get Mellanox involved to help understand the MLX5-specific change which is triggering this.

            doug Doug Oucharek (Inactive) added a comment - Interesting. I had hypothesised that this issue is either caused by, or augmented by, MLX5. We had never seen this until some clusters started using MLX5. I suspect the connection jam is MLX5-related. Sadly, I have no access to MLX5 so cannot dig into the nature of the connection lock up. The current patch, though not perfect, allows systems to move forward and work even if there is a potential of a "leaked" connection structure or two. I think the connection jam should be a new Jira ticket. We need to get Mellanox involved to help understand the MLX5-specific change which is triggering this.

            Yes, we do. In our testbed we have MDS and OSS nodes on the same mlx5 network. Probability of getting into this connection race if very high even without significant clients or load.

            morrone Christopher Morrone (Inactive) added a comment - Yes, we do. In our testbed we have MDS and OSS nodes on the same mlx5 network. Probability of getting into this connection race if very high even without significant clients or load.

            Do you have an easy-to-reproduce scenario for this infinite CON RACE? The original problem involved a router surrounded by thousands of nodes, rebooting triggering a mass of reconnections. Probability of getting into this infinite CON RACE is very high especially if MLX5 is involved.

            doug Doug Oucharek (Inactive) added a comment - Do you have an easy-to-reproduce scenario for this infinite CON RACE? The original problem involved a router surrounded by thousands of nodes, rebooting triggering a mass of reconnections. Probability of getting into this infinite CON RACE is very high especially if MLX5 is involved.

            The IB connection operation is hidden in the o2iblnd below the level of lnet credits. It would not negatively effect any of the current guarantees to abort the IB connection operation (not the ptlrpc level connection operation) and retry.

            Yes, waiting 20 messages that come in on 1 seconds intervals is essentially a strange way to implement a 20 second timeout. But that would seem to me to be the more complicated solution to understand and maintain in the long run versus an actually timeout.

            After all, the current solution basically just goes "oh, you've tried 20 times, sure, you can connect". It is fine in the normal case of resolving a connection race to do that, because asynchronously elsewhere the other racing connection message is expected to get an error and cleanup whatever resources were associated with it. But here we already know that is never going to happen, so aren't we leaking resources every time? Couldn't this potentially cause problems on long running systems?

            morrone Christopher Morrone (Inactive) added a comment - The IB connection operation is hidden in the o2iblnd below the level of lnet credits. It would not negatively effect any of the current guarantees to abort the IB connection operation (not the ptlrpc level connection operation) and retry. Yes, waiting 20 messages that come in on 1 seconds intervals is essentially a strange way to implement a 20 second timeout. But that would seem to me to be the more complicated solution to understand and maintain in the long run versus an actually timeout. After all, the current solution basically just goes "oh, you've tried 20 times, sure, you can connect". It is fine in the normal case of resolving a connection race to do that, because asynchronously elsewhere the other racing connection message is expected to get an error and cleanup whatever resources were associated with it. But here we already know that is never going to happen, so aren't we leaking resources every time? Couldn't this potentially cause problems on long running systems?

            That would mean adding something to LNet it currently does not have: a timeout. LNet depends on two things: 1- that we have a Reliable Connection (RC for IB) and that our own QoS mechanism (credits and peer_credits) saves us from packet drops, and 2- the layers above LNet will let us know that something has taken too long to happen.

            I'm not sure a timer will make this work any better than it does with a counter. Once we bang our head into the CON RACE brick wall 20 times, I think we can be pretty sure the connecting connection which is in our way is stuck and can be abandoned. I originally had that set to just 2 failures as I'm pretty sure that would be good enough to declare a connection stuck. But inspectors convinced me to up it to 20. Simple solutions are usually the best approach.

            doug Doug Oucharek (Inactive) added a comment - That would mean adding something to LNet it currently does not have: a timeout. LNet depends on two things: 1- that we have a Reliable Connection (RC for IB) and that our own QoS mechanism (credits and peer_credits) saves us from packet drops, and 2- the layers above LNet will let us know that something has taken too long to happen. I'm not sure a timer will make this work any better than it does with a counter. Once we bang our head into the CON RACE brick wall 20 times, I think we can be pretty sure the connecting connection which is in our way is stuck and can be abandoned. I originally had that set to just 2 failures as I'm pretty sure that would be good enough to declare a connection stuck. But inspectors convinced me to up it to 20. Simple solutions are usually the best approach.

            What about just starting a timer on the connection message, and aborting the attempt if the timer is exceeded? There isn't anything actually racy about this problem, the connection message never gets a reply, and the one side just sits there waiting forever, right? It should probably timeout eventually instead.

            morrone Christopher Morrone (Inactive) added a comment - What about just starting a timer on the connection message, and aborting the attempt if the timer is exceeded? There isn't anything actually racy about this problem, the connection message never gets a reply, and the one side just sits there waiting forever, right? It should probably timeout eventually instead.

            Change 17892 landed before Lustre 2.8.0. So, yes, we have that.

            morrone Christopher Morrone (Inactive) added a comment - Change 17892 landed before Lustre 2.8.0. So, yes, we have that.

            My mistake. The first patch which slows down reconnections on CON RACE was done under another ticket: LU-7569, patch http://review.whamcloud.com/#/c/17892.

            This ticket was opened as a follow up to abort what we consider to be a stuck connection. Originally, Liang wanted that to be done via messages (a change to the protocol). Inspectors did not favour changing the protocol for this. So I did a simple counter fix to act as a shield against an infinite looping situation. That is why this ticket has a reverted patch and then patch 19430.

            doug Doug Oucharek (Inactive) added a comment - My mistake. The first patch which slows down reconnections on CON RACE was done under another ticket: LU-7569 , patch http://review.whamcloud.com/#/c/17892 . This ticket was opened as a follow up to abort what we consider to be a stuck connection. Originally, Liang wanted that to be done via messages (a change to the protocol). Inspectors did not favour changing the protocol for this. So I did a simple counter fix to act as a shield against an infinite looping situation. That is why this ticket has a reverted patch and then patch 19430.

            We don't have either of the patches currently. And which two do you mean? 18037 was landed on master but then was reverted by 18541 before 2.8.0 was tagged because it was faulty. Are you counting that as one of the two? Then there is 19430, which is the current workaround patch. That appear to be the only live patch under way at the moment. Am I missing anything?

            With the MDT and only one message queued for send to that peer, the lnet reconnect rate is much, much slower. It looks like it is pretty much once per second. Here is an excerpt:

            00000800:00000100:8.0:1466735508.628339:0:434:0:(o2iblnd_cb.c:2621:kiblnd_check_reconnect()) 172.19.1.130@o2ib100: reconnect (conn race), 12, 12, msg_size: 4096, queue_depth: 8/8, max_frags: 256/256
            00000800:00000100:8.0:1466735509.628793:0:434:0:(o2iblnd_cb.c:2621:kiblnd_check_reconnect()) 172.19.1.130@o2ib100: reconnect (conn race), 12, 12, msg_size: 4096, queue_depth: 8/8, max_frags: 256/256
            00000800:00000100:8.0:1466735510.628463:0:434:0:(o2iblnd_cb.c:2621:kiblnd_check_reconnect()) 172.19.1.130@o2ib100: reconnect (conn race), 12, 12, msg_size: 4096, queue_depth: 8/8, max_frags: 256/256
            00000800:00000100:8.0:1466735511.628345:0:434:0:(o2iblnd_cb.c:2621:kiblnd_check_reconnect()) 172.19.1.130@o2ib100: reconnect (conn race), 12, 12, msg_size: 4096, queue_depth: 8/8, max_frags: 256/256
            00000800:00000100:8.0:1466735512.628332:0:434:0:(o2iblnd_cb.c:2621:kiblnd_check_reconnect()) 172.19.1.130@o2ib100: reconnect (conn race), 12, 12, msg_size: 4096, queue_depth: 8/8, max_frags: 256/256
            

            I believe you when you say the router is trying to connect more rapidly, but it looks to me like the rate of reconnect is a factor of load in some way. With the MDT, there is only a single higher-level ptlprc connect message (I assume) sitting in the queue for that peer. A router under use will probably have a full lnet tx queue and more messages queuing up behind that all the time. Perhaps a reconnect happens on every new message arrival. I didn't look into that yet.

            But OOMs and reconnect rates are somewhat orthogonal to the problem of one node sitting on a lost connect message indefinitely.

            morrone Christopher Morrone (Inactive) added a comment - - edited We don't have either of the patches currently. And which two do you mean? 18037 was landed on master but then was reverted by 18541 before 2.8.0 was tagged because it was faulty. Are you counting that as one of the two? Then there is 19430 , which is the current workaround patch. That appear to be the only live patch under way at the moment. Am I missing anything? With the MDT and only one message queued for send to that peer, the lnet reconnect rate is much, much slower. It looks like it is pretty much once per second. Here is an excerpt: 00000800:00000100:8.0:1466735508.628339:0:434:0:(o2iblnd_cb.c:2621:kiblnd_check_reconnect()) 172.19.1.130@o2ib100: reconnect (conn race), 12, 12, msg_size: 4096, queue_depth: 8/8, max_frags: 256/256 00000800:00000100:8.0:1466735509.628793:0:434:0:(o2iblnd_cb.c:2621:kiblnd_check_reconnect()) 172.19.1.130@o2ib100: reconnect (conn race), 12, 12, msg_size: 4096, queue_depth: 8/8, max_frags: 256/256 00000800:00000100:8.0:1466735510.628463:0:434:0:(o2iblnd_cb.c:2621:kiblnd_check_reconnect()) 172.19.1.130@o2ib100: reconnect (conn race), 12, 12, msg_size: 4096, queue_depth: 8/8, max_frags: 256/256 00000800:00000100:8.0:1466735511.628345:0:434:0:(o2iblnd_cb.c:2621:kiblnd_check_reconnect()) 172.19.1.130@o2ib100: reconnect (conn race), 12, 12, msg_size: 4096, queue_depth: 8/8, max_frags: 256/256 00000800:00000100:8.0:1466735512.628332:0:434:0:(o2iblnd_cb.c:2621:kiblnd_check_reconnect()) 172.19.1.130@o2ib100: reconnect (conn race), 12, 12, msg_size: 4096, queue_depth: 8/8, max_frags: 256/256 I believe you when you say the router is trying to connect more rapidly, but it looks to me like the rate of reconnect is a factor of load in some way. With the MDT, there is only a single higher-level ptlprc connect message (I assume) sitting in the queue for that peer. A router under use will probably have a full lnet tx queue and more messages queuing up behind that all the time. Perhaps a reconnect happens on every new message arrival. I didn't look into that yet. But OOMs and reconnect rates are somewhat orthogonal to the problem of one node sitting on a lost connect message indefinitely.

            People

              doug Doug Oucharek (Inactive)
              doug Doug Oucharek (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              17 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: