[LU-7646] Infinite CON RACE Condition after rebooting LNet router - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Blocker
Fix Version/s: Lustre 2.9.0
Affects Version/s: None
Labels:
- llnl

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

While investigating/working on the fix for ~~LU-7569~~ we stumbled on another bug when testing on a customer's system. When an LNet router is rebooted and mlx5-based cards are being used, it is possible for a client's attempt to reconnect to the router to get stuck in a permanent connecting state. When the router comes up and tries to create a connection back to the client, that connection will be rejected as CON RACE. This is an infinite loop because the stuck connection is always present on the client triggering the rejection.

This ticket has been opened to create a fix which compliments ~~LU-7569~~. I appreciate that the mlx5 driver should be fixed to prevent stuck connection attempts, but at the same time, we need LNet to be immune to such situations as the result is pretty severe. We need self-healing code here.

Attachments

Issue Links

is related to

LU-7569 IB leaf switch caused LNet routers to crash

Resolved

Activity

[LU-7646] Infinite CON RACE Condition after rebooting LNet router

Doug Oucharek (Inactive) added a comment - 28/Jun/16 12:50 AM

Do you have an easy-to-reproduce scenario for this infinite CON RACE? The original problem involved a router surrounded by thousands of nodes, rebooting triggering a mass of reconnections. Probability of getting into this infinite CON RACE is very high especially if MLX5 is involved.

Doug Oucharek (Inactive) added a comment - 28/Jun/16 12:50 AM Do you have an easy-to-reproduce scenario for this infinite CON RACE? The original problem involved a router surrounded by thousands of nodes, rebooting triggering a mass of reconnections. Probability of getting into this infinite CON RACE is very high especially if MLX5 is involved.

Christopher Morrone (Inactive) added a comment - 25/Jun/16 1:12 AM

The IB connection operation is hidden in the o2iblnd below the level of lnet credits. It would not negatively effect any of the current guarantees to abort the IB connection operation (not the ptlrpc level connection operation) and retry.

Yes, waiting 20 messages that come in on 1 seconds intervals is essentially a strange way to implement a 20 second timeout. But that would seem to me to be the more complicated solution to understand and maintain in the long run versus an actually timeout.

After all, the current solution basically just goes "oh, you've tried 20 times, sure, you can connect". It is fine in the normal case of resolving a connection race to do that, because asynchronously elsewhere the other racing connection message is expected to get an error and cleanup whatever resources were associated with it. But here we already know that is never going to happen, so aren't we leaking resources every time? Couldn't this potentially cause problems on long running systems?

Christopher Morrone (Inactive) added a comment - 25/Jun/16 1:12 AM The IB connection operation is hidden in the o2iblnd below the level of lnet credits. It would not negatively effect any of the current guarantees to abort the IB connection operation (not the ptlrpc level connection operation) and retry. Yes, waiting 20 messages that come in on 1 seconds intervals is essentially a strange way to implement a 20 second timeout. But that would seem to me to be the more complicated solution to understand and maintain in the long run versus an actually timeout. After all, the current solution basically just goes "oh, you've tried 20 times, sure, you can connect". It is fine in the normal case of resolving a connection race to do that, because asynchronously elsewhere the other racing connection message is expected to get an error and cleanup whatever resources were associated with it. But here we already know that is never going to happen, so aren't we leaking resources every time? Couldn't this potentially cause problems on long running systems?

Doug Oucharek (Inactive) added a comment - 25/Jun/16 12:44 AM

That would mean adding something to LNet it currently does not have: a timeout. LNet depends on two things: 1- that we have a Reliable Connection (RC for IB) and that our own QoS mechanism (credits and peer_credits) saves us from packet drops, and 2- the layers above LNet will let us know that something has taken too long to happen.

I'm not sure a timer will make this work any better than it does with a counter. Once we bang our head into the CON RACE brick wall 20 times, I think we can be pretty sure the connecting connection which is in our way is stuck and can be abandoned. I originally had that set to just 2 failures as I'm pretty sure that would be good enough to declare a connection stuck. But inspectors convinced me to up it to 20. Simple solutions are usually the best approach.

Doug Oucharek (Inactive) added a comment - 25/Jun/16 12:44 AM That would mean adding something to LNet it currently does not have: a timeout. LNet depends on two things: 1- that we have a Reliable Connection (RC for IB) and that our own QoS mechanism (credits and peer_credits) saves us from packet drops, and 2- the layers above LNet will let us know that something has taken too long to happen. I'm not sure a timer will make this work any better than it does with a counter. Once we bang our head into the CON RACE brick wall 20 times, I think we can be pretty sure the connecting connection which is in our way is stuck and can be abandoned. I originally had that set to just 2 failures as I'm pretty sure that would be good enough to declare a connection stuck. But inspectors convinced me to up it to 20. Simple solutions are usually the best approach.

Christopher Morrone (Inactive) added a comment - 25/Jun/16 12:18 AM

What about just starting a timer on the connection message, and aborting the attempt if the timer is exceeded? There isn't anything actually racy about this problem, the connection message never gets a reply, and the one side just sits there waiting forever, right? It should probably timeout eventually instead.

Christopher Morrone (Inactive) added a comment - 25/Jun/16 12:18 AM What about just starting a timer on the connection message, and aborting the attempt if the timer is exceeded? There isn't anything actually racy about this problem, the connection message never gets a reply, and the one side just sits there waiting forever, right? It should probably timeout eventually instead.

Christopher Morrone (Inactive) added a comment - 25/Jun/16 12:13 AM

Change 17892 landed before Lustre 2.8.0. So, yes, we have that.

Christopher Morrone (Inactive) added a comment - 25/Jun/16 12:13 AM Change 17892 landed before Lustre 2.8.0. So, yes, we have that.

Doug Oucharek (Inactive) added a comment - 24/Jun/16 11:44 PM

My mistake. The first patch which slows down reconnections on CON RACE was done under another ticket: ~~LU-7569~~, patch http://review.whamcloud.com/#/c/17892.

This ticket was opened as a follow up to abort what we consider to be a stuck connection. Originally, Liang wanted that to be done via messages (a change to the protocol). Inspectors did not favour changing the protocol for this. So I did a simple counter fix to act as a shield against an infinite looping situation. That is why this ticket has a reverted patch and then patch 19430.

Doug Oucharek (Inactive) added a comment - 24/Jun/16 11:44 PM My mistake. The first patch which slows down reconnections on CON RACE was done under another ticket: LU-7569 , patch http://review.whamcloud.com/#/c/17892 . This ticket was opened as a follow up to abort what we consider to be a stuck connection. Originally, Liang wanted that to be done via messages (a change to the protocol). Inspectors did not favour changing the protocol for this. So I did a simple counter fix to act as a shield against an infinite looping situation. That is why this ticket has a reverted patch and then patch 19430.

Christopher Morrone (Inactive) added a comment - 24/Jun/16 9:30 PM - edited

We don't have either of the patches currently. And which two do you mean? 18037 was landed on master but then was reverted by 18541 before 2.8.0 was tagged because it was faulty. Are you counting that as one of the two? Then there is 19430, which is the current workaround patch. That appear to be the only live patch under way at the moment. Am I missing anything?

With the MDT and only one message queued for send to that peer, the lnet reconnect rate is much, much slower. It looks like it is pretty much once per second. Here is an excerpt:

00000800:00000100:8.0:1466735508.628339:0:434:0:(o2iblnd_cb.c:2621:kiblnd_check_reconnect()) 172.19.1.130@o2ib100: reconnect (conn race), 12, 12, msg_size: 4096, queue_depth: 8/8, max_frags: 256/256
00000800:00000100:8.0:1466735509.628793:0:434:0:(o2iblnd_cb.c:2621:kiblnd_check_reconnect()) 172.19.1.130@o2ib100: reconnect (conn race), 12, 12, msg_size: 4096, queue_depth: 8/8, max_frags: 256/256
00000800:00000100:8.0:1466735510.628463:0:434:0:(o2iblnd_cb.c:2621:kiblnd_check_reconnect()) 172.19.1.130@o2ib100: reconnect (conn race), 12, 12, msg_size: 4096, queue_depth: 8/8, max_frags: 256/256
00000800:00000100:8.0:1466735511.628345:0:434:0:(o2iblnd_cb.c:2621:kiblnd_check_reconnect()) 172.19.1.130@o2ib100: reconnect (conn race), 12, 12, msg_size: 4096, queue_depth: 8/8, max_frags: 256/256
00000800:00000100:8.0:1466735512.628332:0:434:0:(o2iblnd_cb.c:2621:kiblnd_check_reconnect()) 172.19.1.130@o2ib100: reconnect (conn race), 12, 12, msg_size: 4096, queue_depth: 8/8, max_frags: 256/256

I believe you when you say the router is trying to connect more rapidly, but it looks to me like the rate of reconnect is a factor of load in some way. With the MDT, there is only a single higher-level ptlprc connect message (I assume) sitting in the queue for that peer. A router under use will probably have a full lnet tx queue and more messages queuing up behind that all the time. Perhaps a reconnect happens on every new message arrival. I didn't look into that yet.

But OOMs and reconnect rates are somewhat orthogonal to the problem of one node sitting on a lost connect message indefinitely.

Christopher Morrone (Inactive) added a comment - 24/Jun/16 9:30 PM - edited We don't have either of the patches currently. And which two do you mean? 18037 was landed on master but then was reverted by 18541 before 2.8.0 was tagged because it was faulty. Are you counting that as one of the two? Then there is 19430 , which is the current workaround patch. That appear to be the only live patch under way at the moment. Am I missing anything? With the MDT and only one message queued for send to that peer, the lnet reconnect rate is much, much slower. It looks like it is pretty much once per second. Here is an excerpt: 00000800:00000100:8.0:1466735508.628339:0:434:0:(o2iblnd_cb.c:2621:kiblnd_check_reconnect()) 172.19.1.130@o2ib100: reconnect (conn race), 12, 12, msg_size: 4096, queue_depth: 8/8, max_frags: 256/256 00000800:00000100:8.0:1466735509.628793:0:434:0:(o2iblnd_cb.c:2621:kiblnd_check_reconnect()) 172.19.1.130@o2ib100: reconnect (conn race), 12, 12, msg_size: 4096, queue_depth: 8/8, max_frags: 256/256 00000800:00000100:8.0:1466735510.628463:0:434:0:(o2iblnd_cb.c:2621:kiblnd_check_reconnect()) 172.19.1.130@o2ib100: reconnect (conn race), 12, 12, msg_size: 4096, queue_depth: 8/8, max_frags: 256/256 00000800:00000100:8.0:1466735511.628345:0:434:0:(o2iblnd_cb.c:2621:kiblnd_check_reconnect()) 172.19.1.130@o2ib100: reconnect (conn race), 12, 12, msg_size: 4096, queue_depth: 8/8, max_frags: 256/256 00000800:00000100:8.0:1466735512.628332:0:434:0:(o2iblnd_cb.c:2621:kiblnd_check_reconnect()) 172.19.1.130@o2ib100: reconnect (conn race), 12, 12, msg_size: 4096, queue_depth: 8/8, max_frags: 256/256 I believe you when you say the router is trying to connect more rapidly, but it looks to me like the rate of reconnect is a factor of load in some way. With the MDT, there is only a single higher-level ptlprc connect message (I assume) sitting in the queue for that peer. A router under use will probably have a full lnet tx queue and more messages queuing up behind that all the time. Perhaps a reconnect happens on every new message arrival. I didn't look into that yet. But OOMs and reconnect rates are somewhat orthogonal to the problem of one node sitting on a lost connect message indefinitely.

Doug Oucharek (Inactive) added a comment - 24/Jun/16 8:53 PM

This ticket has two patches to it. It is possible your system has the first and not the second? The first patch slows down the rate of reconnections so we have time to clean up resources thereby preventing the OOM. The second patch, 19430, addresses the fact that we can't seem to ever get out of the infinite loop of reconnections.

If you are missing the first patch, then you should be seeing 100's or even 1000's of reconnect attempts per second. A rate too fast for the connd daemon to clean up resources. OOM happens in seconds.

Doug Oucharek (Inactive) added a comment - 24/Jun/16 8:53 PM This ticket has two patches to it. It is possible your system has the first and not the second? The first patch slows down the rate of reconnections so we have time to clean up resources thereby preventing the OOM. The second patch, 19430, addresses the fact that we can't seem to ever get out of the infinite loop of reconnections. If you are missing the first patch, then you should be seeing 100's or even 1000's of reconnect attempts per second. A rate too fast for the connd daemon to clean up resources. OOM happens in seconds.

Christopher Morrone (Inactive) added a comment - 24/Jun/16 7:16 PM

Connecting state in higher level services like the OSP just means that the connect RPC has been sent down to lnet, and the higher levels are waiting for something to happen, right? It doesn't say much at all about the state of the LND connections. You can be connected at the LND level and not be connected at the ptlrpc level. For instance, and lctl ping would create an LND connection without any higher level services showing connected to services on that node.

The reconnects for us are happening slower. I didn't look too closely at the times, but probably just a few a second. There was no OOM after hours sitting there. It is not clear why an OOM would be a likely side effect of this condition. The node attempting the connection gets an error code back and should clean up memory just fine and try again.

Maybe there is something wrong in lnet router code that is allowing an OOM under that situation? Or lnet buffer settings are too large on the router nodes you have?

The more that I think about it, the more it seems like the OOM should be treated as an additional separate bug.

Christopher Morrone (Inactive) added a comment - 24/Jun/16 7:16 PM Connecting state in higher level services like the OSP just means that the connect RPC has been sent down to lnet, and the higher levels are waiting for something to happen, right? It doesn't say much at all about the state of the LND connections. You can be connected at the LND level and not be connected at the ptlrpc level. For instance, and lctl ping would create an LND connection without any higher level services showing connected to services on that node. The reconnects for us are happening slower. I didn't look too closely at the times, but probably just a few a second. There was no OOM after hours sitting there. It is not clear why an OOM would be a likely side effect of this condition. The node attempting the connection gets an error code back and should clean up memory just fine and try again. Maybe there is something wrong in lnet router code that is allowing an OOM under that situation? Or lnet buffer settings are too large on the router nodes you have? The more that I think about it, the more it seems like the OOM should be treated as an additional separate bug.

Doug Oucharek (Inactive) added a comment - 24/Jun/16 6:35 PM

Not sure if the connection is timing out. When investigating this, I know that the active connection was in a permanent "connecting" state (I believe this is associated with one side having been rebooted and the other not). In just a few seconds (far less than the 50 second timeout), we ended up in an OOM situation. A high rate of reconnections can quickly use up memory resources since we clean up failed connections with a zombie list and a background process so they are being created at a much faster rate than they are being cleaned up.

Restricting the reconnections to a specific number and then aborting the connection we consider stuck is in lieu of using time to timeout the stuck connection. The logic goes like this: if both sides are able to participate in rejecting the CON RACE connection multiple times, then there is no reason the other connection should not complete unless it is somehow stuck. Assuming it is stuck, we need to abandon it and let the racing connection succeed so we can get on with things.

Doug Oucharek (Inactive) added a comment - 24/Jun/16 6:35 PM Not sure if the connection is timing out. When investigating this, I know that the active connection was in a permanent "connecting" state (I believe this is associated with one side having been rebooted and the other not). In just a few seconds (far less than the 50 second timeout), we ended up in an OOM situation. A high rate of reconnections can quickly use up memory resources since we clean up failed connections with a zombie list and a background process so they are being created at a much faster rate than they are being cleaned up. Restricting the reconnections to a specific number and then aborting the connection we consider stuck is in lieu of using time to timeout the stuck connection. The logic goes like this: if both sides are able to participate in rejecting the CON RACE connection multiple times, then there is no reason the other connection should not complete unless it is somehow stuck. Assuming it is stuck, we need to abandon it and let the racing connection succeed so we can get on with things.

Christopher Morrone (Inactive) added a comment - 24/Jun/16 2:57 AM

I think that we are seeing this in testing 2.8 between an MDT and and OST. The MDT node is freshly booted, and tries to connect to the OST over and over again. The MDT has the lower NID. The OST thinks it has a connection outstanding.

In change 19430 you are aborting the connection when the other side connects 20 times. That seems a little odd to me. Why isn't the higher NID timing out on its connection attempt at some point? Wouldn't it make more sense to time out and abort the connection attempt at some point? LNET used to abort and tear down the connection after 50 seconds with no progress. Why isn't that happening here?

Christopher Morrone (Inactive) added a comment - 24/Jun/16 2:57 AM I think that we are seeing this in testing 2.8 between an MDT and and OST. The MDT node is freshly booted, and tries to connect to the OST over and over again. The MDT has the lower NID. The OST thinks it has a connection outstanding. In change 19430 you are aborting the connection when the other side connects 20 times. That seems a little odd to me. Why isn't the higher NID timing out on its connection attempt at some point? Wouldn't it make more sense to time out and abort the connection attempt at some point? LNET used to abort and tear down the connection after 50 seconds with no progress. Why isn't that happening here?

People

Assignee:: Doug Oucharek (Inactive)

Reporter:: Doug Oucharek (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 17 Start watching this issue

Dates

Created:: 09/Jan/16 2:20 AM

Updated:: 14/Jun/18 9:41 PM

Resolved:: 15/Aug/16 10:31 PM