[LU-7646] Infinite CON RACE Condition after rebooting LNet router - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Blocker
Fix Version/s: Lustre 2.9.0
Affects Version/s: None
Labels:
- llnl

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

While investigating/working on the fix for ~~LU-7569~~ we stumbled on another bug when testing on a customer's system. When an LNet router is rebooted and mlx5-based cards are being used, it is possible for a client's attempt to reconnect to the router to get stuck in a permanent connecting state. When the router comes up and tries to create a connection back to the client, that connection will be rejected as CON RACE. This is an infinite loop because the stuck connection is always present on the client triggering the rejection.

This ticket has been opened to create a fix which compliments ~~LU-7569~~. I appreciate that the mlx5 driver should be fixed to prevent stuck connection attempts, but at the same time, we need LNet to be immune to such situations as the result is pretty severe. We need self-healing code here.

Attachments

Issue Links

is related to

LU-7569 IB leaf switch caused LNet routers to crash

Resolved

Activity

[LU-7646] Infinite CON RACE Condition after rebooting LNet router

Christopher Morrone (Inactive) added a comment - 25/Jun/16 12:18 AM

What about just starting a timer on the connection message, and aborting the attempt if the timer is exceeded? There isn't anything actually racy about this problem, the connection message never gets a reply, and the one side just sits there waiting forever, right? It should probably timeout eventually instead.

Christopher Morrone (Inactive) added a comment - 25/Jun/16 12:18 AM What about just starting a timer on the connection message, and aborting the attempt if the timer is exceeded? There isn't anything actually racy about this problem, the connection message never gets a reply, and the one side just sits there waiting forever, right? It should probably timeout eventually instead.

Christopher Morrone (Inactive) added a comment - 25/Jun/16 12:13 AM

Change 17892 landed before Lustre 2.8.0. So, yes, we have that.

Christopher Morrone (Inactive) added a comment - 25/Jun/16 12:13 AM Change 17892 landed before Lustre 2.8.0. So, yes, we have that.

Doug Oucharek (Inactive) added a comment - 24/Jun/16 11:44 PM

My mistake. The first patch which slows down reconnections on CON RACE was done under another ticket: ~~LU-7569~~, patch http://review.whamcloud.com/#/c/17892.

This ticket was opened as a follow up to abort what we consider to be a stuck connection. Originally, Liang wanted that to be done via messages (a change to the protocol). Inspectors did not favour changing the protocol for this. So I did a simple counter fix to act as a shield against an infinite looping situation. That is why this ticket has a reverted patch and then patch 19430.

Doug Oucharek (Inactive) added a comment - 24/Jun/16 11:44 PM My mistake. The first patch which slows down reconnections on CON RACE was done under another ticket: LU-7569 , patch http://review.whamcloud.com/#/c/17892 . This ticket was opened as a follow up to abort what we consider to be a stuck connection. Originally, Liang wanted that to be done via messages (a change to the protocol). Inspectors did not favour changing the protocol for this. So I did a simple counter fix to act as a shield against an infinite looping situation. That is why this ticket has a reverted patch and then patch 19430.

Christopher Morrone (Inactive) added a comment - 24/Jun/16 9:30 PM - edited

We don't have either of the patches currently. And which two do you mean? 18037 was landed on master but then was reverted by 18541 before 2.8.0 was tagged because it was faulty. Are you counting that as one of the two? Then there is 19430, which is the current workaround patch. That appear to be the only live patch under way at the moment. Am I missing anything?

With the MDT and only one message queued for send to that peer, the lnet reconnect rate is much, much slower. It looks like it is pretty much once per second. Here is an excerpt:

00000800:00000100:8.0:1466735508.628339:0:434:0:(o2iblnd_cb.c:2621:kiblnd_check_reconnect()) 172.19.1.130@o2ib100: reconnect (conn race), 12, 12, msg_size: 4096, queue_depth: 8/8, max_frags: 256/256
00000800:00000100:8.0:1466735509.628793:0:434:0:(o2iblnd_cb.c:2621:kiblnd_check_reconnect()) 172.19.1.130@o2ib100: reconnect (conn race), 12, 12, msg_size: 4096, queue_depth: 8/8, max_frags: 256/256
00000800:00000100:8.0:1466735510.628463:0:434:0:(o2iblnd_cb.c:2621:kiblnd_check_reconnect()) 172.19.1.130@o2ib100: reconnect (conn race), 12, 12, msg_size: 4096, queue_depth: 8/8, max_frags: 256/256
00000800:00000100:8.0:1466735511.628345:0:434:0:(o2iblnd_cb.c:2621:kiblnd_check_reconnect()) 172.19.1.130@o2ib100: reconnect (conn race), 12, 12, msg_size: 4096, queue_depth: 8/8, max_frags: 256/256
00000800:00000100:8.0:1466735512.628332:0:434:0:(o2iblnd_cb.c:2621:kiblnd_check_reconnect()) 172.19.1.130@o2ib100: reconnect (conn race), 12, 12, msg_size: 4096, queue_depth: 8/8, max_frags: 256/256

I believe you when you say the router is trying to connect more rapidly, but it looks to me like the rate of reconnect is a factor of load in some way. With the MDT, there is only a single higher-level ptlprc connect message (I assume) sitting in the queue for that peer. A router under use will probably have a full lnet tx queue and more messages queuing up behind that all the time. Perhaps a reconnect happens on every new message arrival. I didn't look into that yet.

But OOMs and reconnect rates are somewhat orthogonal to the problem of one node sitting on a lost connect message indefinitely.

Christopher Morrone (Inactive) added a comment - 24/Jun/16 9:30 PM - edited We don't have either of the patches currently. And which two do you mean? 18037 was landed on master but then was reverted by 18541 before 2.8.0 was tagged because it was faulty. Are you counting that as one of the two? Then there is 19430 , which is the current workaround patch. That appear to be the only live patch under way at the moment. Am I missing anything? With the MDT and only one message queued for send to that peer, the lnet reconnect rate is much, much slower. It looks like it is pretty much once per second. Here is an excerpt: 00000800:00000100:8.0:1466735508.628339:0:434:0:(o2iblnd_cb.c:2621:kiblnd_check_reconnect()) 172.19.1.130@o2ib100: reconnect (conn race), 12, 12, msg_size: 4096, queue_depth: 8/8, max_frags: 256/256 00000800:00000100:8.0:1466735509.628793:0:434:0:(o2iblnd_cb.c:2621:kiblnd_check_reconnect()) 172.19.1.130@o2ib100: reconnect (conn race), 12, 12, msg_size: 4096, queue_depth: 8/8, max_frags: 256/256 00000800:00000100:8.0:1466735510.628463:0:434:0:(o2iblnd_cb.c:2621:kiblnd_check_reconnect()) 172.19.1.130@o2ib100: reconnect (conn race), 12, 12, msg_size: 4096, queue_depth: 8/8, max_frags: 256/256 00000800:00000100:8.0:1466735511.628345:0:434:0:(o2iblnd_cb.c:2621:kiblnd_check_reconnect()) 172.19.1.130@o2ib100: reconnect (conn race), 12, 12, msg_size: 4096, queue_depth: 8/8, max_frags: 256/256 00000800:00000100:8.0:1466735512.628332:0:434:0:(o2iblnd_cb.c:2621:kiblnd_check_reconnect()) 172.19.1.130@o2ib100: reconnect (conn race), 12, 12, msg_size: 4096, queue_depth: 8/8, max_frags: 256/256 I believe you when you say the router is trying to connect more rapidly, but it looks to me like the rate of reconnect is a factor of load in some way. With the MDT, there is only a single higher-level ptlprc connect message (I assume) sitting in the queue for that peer. A router under use will probably have a full lnet tx queue and more messages queuing up behind that all the time. Perhaps a reconnect happens on every new message arrival. I didn't look into that yet. But OOMs and reconnect rates are somewhat orthogonal to the problem of one node sitting on a lost connect message indefinitely.

Doug Oucharek (Inactive) added a comment - 24/Jun/16 8:53 PM

This ticket has two patches to it. It is possible your system has the first and not the second? The first patch slows down the rate of reconnections so we have time to clean up resources thereby preventing the OOM. The second patch, 19430, addresses the fact that we can't seem to ever get out of the infinite loop of reconnections.

If you are missing the first patch, then you should be seeing 100's or even 1000's of reconnect attempts per second. A rate too fast for the connd daemon to clean up resources. OOM happens in seconds.

Doug Oucharek (Inactive) added a comment - 24/Jun/16 8:53 PM This ticket has two patches to it. It is possible your system has the first and not the second? The first patch slows down the rate of reconnections so we have time to clean up resources thereby preventing the OOM. The second patch, 19430, addresses the fact that we can't seem to ever get out of the infinite loop of reconnections. If you are missing the first patch, then you should be seeing 100's or even 1000's of reconnect attempts per second. A rate too fast for the connd daemon to clean up resources. OOM happens in seconds.

Christopher Morrone (Inactive) added a comment - 24/Jun/16 7:16 PM

Connecting state in higher level services like the OSP just means that the connect RPC has been sent down to lnet, and the higher levels are waiting for something to happen, right? It doesn't say much at all about the state of the LND connections. You can be connected at the LND level and not be connected at the ptlrpc level. For instance, and lctl ping would create an LND connection without any higher level services showing connected to services on that node.

The reconnects for us are happening slower. I didn't look too closely at the times, but probably just a few a second. There was no OOM after hours sitting there. It is not clear why an OOM would be a likely side effect of this condition. The node attempting the connection gets an error code back and should clean up memory just fine and try again.

Maybe there is something wrong in lnet router code that is allowing an OOM under that situation? Or lnet buffer settings are too large on the router nodes you have?

The more that I think about it, the more it seems like the OOM should be treated as an additional separate bug.

Christopher Morrone (Inactive) added a comment - 24/Jun/16 7:16 PM Connecting state in higher level services like the OSP just means that the connect RPC has been sent down to lnet, and the higher levels are waiting for something to happen, right? It doesn't say much at all about the state of the LND connections. You can be connected at the LND level and not be connected at the ptlrpc level. For instance, and lctl ping would create an LND connection without any higher level services showing connected to services on that node. The reconnects for us are happening slower. I didn't look too closely at the times, but probably just a few a second. There was no OOM after hours sitting there. It is not clear why an OOM would be a likely side effect of this condition. The node attempting the connection gets an error code back and should clean up memory just fine and try again. Maybe there is something wrong in lnet router code that is allowing an OOM under that situation? Or lnet buffer settings are too large on the router nodes you have? The more that I think about it, the more it seems like the OOM should be treated as an additional separate bug.

Doug Oucharek (Inactive) added a comment - 24/Jun/16 6:35 PM

Not sure if the connection is timing out. When investigating this, I know that the active connection was in a permanent "connecting" state (I believe this is associated with one side having been rebooted and the other not). In just a few seconds (far less than the 50 second timeout), we ended up in an OOM situation. A high rate of reconnections can quickly use up memory resources since we clean up failed connections with a zombie list and a background process so they are being created at a much faster rate than they are being cleaned up.

Restricting the reconnections to a specific number and then aborting the connection we consider stuck is in lieu of using time to timeout the stuck connection. The logic goes like this: if both sides are able to participate in rejecting the CON RACE connection multiple times, then there is no reason the other connection should not complete unless it is somehow stuck. Assuming it is stuck, we need to abandon it and let the racing connection succeed so we can get on with things.

Doug Oucharek (Inactive) added a comment - 24/Jun/16 6:35 PM Not sure if the connection is timing out. When investigating this, I know that the active connection was in a permanent "connecting" state (I believe this is associated with one side having been rebooted and the other not). In just a few seconds (far less than the 50 second timeout), we ended up in an OOM situation. A high rate of reconnections can quickly use up memory resources since we clean up failed connections with a zombie list and a background process so they are being created at a much faster rate than they are being cleaned up. Restricting the reconnections to a specific number and then aborting the connection we consider stuck is in lieu of using time to timeout the stuck connection. The logic goes like this: if both sides are able to participate in rejecting the CON RACE connection multiple times, then there is no reason the other connection should not complete unless it is somehow stuck. Assuming it is stuck, we need to abandon it and let the racing connection succeed so we can get on with things.

Christopher Morrone (Inactive) added a comment - 24/Jun/16 2:57 AM

I think that we are seeing this in testing 2.8 between an MDT and and OST. The MDT node is freshly booted, and tries to connect to the OST over and over again. The MDT has the lower NID. The OST thinks it has a connection outstanding.

In change 19430 you are aborting the connection when the other side connects 20 times. That seems a little odd to me. Why isn't the higher NID timing out on its connection attempt at some point? Wouldn't it make more sense to time out and abort the connection attempt at some point? LNET used to abort and tear down the connection after 50 seconds with no progress. Why isn't that happening here?

Christopher Morrone (Inactive) added a comment - 24/Jun/16 2:57 AM I think that we are seeing this in testing 2.8 between an MDT and and OST. The MDT node is freshly booted, and tries to connect to the OST over and over again. The MDT has the lower NID. The OST thinks it has a connection outstanding. In change 19430 you are aborting the connection when the other side connects 20 times. That seems a little odd to me. Why isn't the higher NID timing out on its connection attempt at some point? Wouldn't it make more sense to time out and abort the connection attempt at some point? LNET used to abort and tear down the connection after 50 seconds with no progress. Why isn't that happening here?

Gerrit Updater added a comment - 08/Apr/16 8:06 PM

Doug Oucharek (doug.s.oucharek@intel.com) uploaded a new patch: http://review.whamcloud.com/19430
Subject: ~~LU-7646~~ o2iblnd: connrace protocol improvement
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 246d7596a09ad2972b04cafd3f4c06bab1e46657

Gerrit Updater added a comment - 08/Apr/16 8:06 PM Doug Oucharek (doug.s.oucharek@intel.com) uploaded a new patch: http://review.whamcloud.com/19430 Subject: LU-7646 o2iblnd: connrace protocol improvement Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 246d7596a09ad2972b04cafd3f4c06bab1e46657

Gerrit Updater added a comment - 24/Feb/16 6:08 AM

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/18541/
Subject: Revert "~~LU-7646~~ o2iblnd: connrace protocol improvement"
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 8b2753109a7f284b4aea96cda0207828a0c68515

Gerrit Updater added a comment - 24/Feb/16 6:08 AM Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/18541/ Subject: Revert " LU-7646 o2iblnd: connrace protocol improvement" Project: fs/lustre-release Branch: master Current Patch Set: Commit: 8b2753109a7f284b4aea96cda0207828a0c68515

Doug Oucharek (Inactive) added a comment - 21/Feb/16 8:49 PM

Confirmed that this is causing a lot of problems if you ping yourself. Reverted the patch (as you see above). Will figure out how to fix the patch.

Doug Oucharek (Inactive) added a comment - 21/Feb/16 8:49 PM Confirmed that this is causing a lot of problems if you ping yourself. Reverted the patch (as you see above). Will figure out how to fix the patch.

People

Assignee:: Doug Oucharek (Inactive)

Reporter:: Doug Oucharek (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 17 Start watching this issue

Dates

Created:: 09/Jan/16 2:20 AM

Updated:: 14/Jun/18 9:41 PM

Resolved:: 15/Aug/16 10:31 PM