[LU-7646] Infinite CON RACE Condition after rebooting LNet router - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Blocker
Fix Version/s: Lustre 2.9.0
Affects Version/s: None
Labels:
- llnl

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

While investigating/working on the fix for ~~LU-7569~~ we stumbled on another bug when testing on a customer's system. When an LNet router is rebooted and mlx5-based cards are being used, it is possible for a client's attempt to reconnect to the router to get stuck in a permanent connecting state. When the router comes up and tries to create a connection back to the client, that connection will be rejected as CON RACE. This is an infinite loop because the stuck connection is always present on the client triggering the rejection.

This ticket has been opened to create a fix which compliments ~~LU-7569~~. I appreciate that the mlx5 driver should be fixed to prevent stuck connection attempts, but at the same time, we need LNet to be immune to such situations as the result is pretty severe. We need self-healing code here.

Attachments

Issue Links

is related to

LU-7569 IB leaf switch caused LNet routers to crash

Resolved

Activity

[LU-7646] Infinite CON RACE Condition after rebooting LNet router

Peter Jones added a comment - 15/Aug/16 10:31 PM

Landed for 2.9

Peter Jones added a comment - 15/Aug/16 10:31 PM Landed for 2.9

Gerrit Updater added a comment - 15/Aug/16 9:11 PM

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/19430/
Subject: ~~LU-7646~~ lnet: Stop Infinite CON RACE Condition
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 94f757bf67d58694201b2434f7879974c7abd622

Gerrit Updater added a comment - 15/Aug/16 9:11 PM Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/19430/ Subject: LU-7646 lnet: Stop Infinite CON RACE Condition Project: fs/lustre-release Branch: master Current Patch Set: Commit: 94f757bf67d58694201b2434f7879974c7abd622

Christopher Morrone (Inactive) added a comment - 13/Jul/16 1:46 AM

It's back!

[ 8866.277966] LNet: 509:0:(o2iblnd_cb.c:2418:kiblnd_passive_connect()) LU-7646 connection attempt to 172.19.1.52@o2ib100 probably stuck (state=2)

state=2 is almost certainly RDMA_CM_EVENT_ROUTE_RESOLVED (if the compiler is not doing something strange with enums). That would imply that we entered "case RDMA_CM_EVENT_ROUTE_RESOLVED".

And I don't see any error messages from that section. That would seem to imply that we called rdma_connect successfully and never received an event after that.

Christopher Morrone (Inactive) added a comment - 13/Jul/16 1:46 AM It's back! [ 8866.277966] LNet: 509:0:(o2iblnd_cb.c:2418:kiblnd_passive_connect()) LU-7646 connection attempt to 172.19.1.52@o2ib100 probably stuck (state=2) state=2 is almost certainly RDMA_CM_EVENT_ROUTE_RESOLVED (if the compiler is not doing something strange with enums). That would imply that we entered "case RDMA_CM_EVENT_ROUTE_RESOLVED". And I don't see any error messages from that section. That would seem to imply that we called rdma_connect successfully and never received an event after that.

Christopher Morrone (Inactive) added a comment - 01/Jul/16 10:26 PM

Annnnd, now I can't get the bug to hit at all. Sigh.

Well, I've got some debugging in place that might give us some insight into the active connection attempt state...but only if there is a single active connection attempts at a time. If there are multiple, they'll trample the same state variable that I added to lnet_peer_t.

Christopher Morrone (Inactive) added a comment - 01/Jul/16 10:26 PM Annnnd, now I can't get the bug to hit at all. Sigh. Well, I've got some debugging in place that might give us some insight into the active connection attempt state...but only if there is a single active connection attempts at a time. If there are multiple, they'll trample the same state variable that I added to lnet_peer_t.

Christopher Morrone (Inactive) added a comment - 29/Jun/16 11:41 PM - edited

I suspect that a timeout counts as an error and those events are supposed to be generated.

It is unfortunate that there is no easily accessible structure that can give us the current state of the active connection attempt when it is stuck.

Christopher Morrone (Inactive) added a comment - 29/Jun/16 11:41 PM - edited I suspect that a timeout counts as an error and those events are supposed to be generated. It is unfortunate that there is no easily accessible structure that can give us the current state of the active connection attempt when it is stuck.

Doug Oucharek (Inactive) added a comment - 29/Jun/16 10:53 PM - edited

It is interesting that both rdma_resolve_addr() and rdma_resolve_route() have a timeout parameter. We pass in a default of 50 secs. The man pages do not say what happens when a timeout occurs. Do we get an RDMA_CM_EVENT_ADDR_ERROR or RDMA_CM_EVENT_ROUTE_ERROR CM event? If not, then we could be missing out on a timeout which would allow us to do what you have indicated above.

Where can I find some "real" OFED documentation?

Doug Oucharek (Inactive) added a comment - 29/Jun/16 10:53 PM - edited It is interesting that both rdma_resolve_addr() and rdma_resolve_route() have a timeout parameter. We pass in a default of 50 secs. The man pages do not say what happens when a timeout occurs. Do we get an RDMA_CM_EVENT_ADDR_ERROR or RDMA_CM_EVENT_ROUTE_ERROR CM event? If not, then we could be missing out on a timeout which would allow us to do what you have indicated above. Where can I find some "real" OFED documentation?

Christopher Morrone (Inactive) added a comment - 29/Jun/16 12:54 AM

Since we are seeing the stuck side rejecting connections indefinitely because of reason IBLND_REJECT_CONN_RACE, we know that ibp_connecting is non-zero. There are only two places that make that happen, kiblnd_launch_tx() and kiblnd_reconnect_peer(), both before calling kiblnd_connect_peer().

Since this problem tends to happen when the node with the lower NID is rebooted, it seems likely we are mostly concerned with the kiblnd_reconnect_peer() path.

kiblnd_connect_peer() likely didn't fail, because if it did it would have called kiblnd_peer_connect_failed(), which in turn decrements ibp_connecting, so we wouldn't be in stuck in a single connection attempt.

Hmmm.

Of course, kiblnd_connect_peer() really just starts an asynchronous connection process. It kicks off the address resolution, and then we have to wait for the RDMA_CM_EVENT_ADDR_RESOLVED callback. That in turn starts route resolution, and we wait for RDMA_CM_EVENT_ROUTE_RESOLVED. That callback is where we call kiblnd_active_connect().

And so on.

I think it would be good to know exactly which phase the stuck connect is in. I can work on a debug patch to reveal that tomorrow.

It is not clear to me why a better approach to a work-around wouldn't be to augment kiblnd_check_conns() to check and time out active connection attempts. We already put an arbitrary time limit on any lnd queued tx messages. If any messages have timed out we explicitly close the connection and reconnect.

It would seem like we could add logic there to watch for active connection attempts that have taken too long, abort the attempt, and restart the active connection attempt.

Christopher Morrone (Inactive) added a comment - 29/Jun/16 12:54 AM Since we are seeing the stuck side rejecting connections indefinitely because of reason IBLND_REJECT_CONN_RACE, we know that ibp_connecting is non-zero. There are only two places that make that happen, kiblnd_launch_tx() and kiblnd_reconnect_peer(), both before calling kiblnd_connect_peer(). Since this problem tends to happen when the node with the lower NID is rebooted, it seems likely we are mostly concerned with the kiblnd_reconnect_peer() path. kiblnd_connect_peer() likely didn't fail, because if it did it would have called kiblnd_peer_connect_failed(), which in turn decrements ibp_connecting, so we wouldn't be in stuck in a single connection attempt. Hmmm. Of course, kiblnd_connect_peer() really just starts an asynchronous connection process. It kicks off the address resolution, and then we have to wait for the RDMA_CM_EVENT_ADDR_RESOLVED callback. That in turn starts route resolution, and we wait for RDMA_CM_EVENT_ROUTE_RESOLVED. That callback is where we call kiblnd_active_connect(). And so on. I think it would be good to know exactly which phase the stuck connect is in. I can work on a debug patch to reveal that tomorrow. It is not clear to me why a better approach to a work-around wouldn't be to augment kiblnd_check_conns() to check and time out active connection attempts. We already put an arbitrary time limit on any lnd queued tx messages. If any messages have timed out we explicitly close the connection and reconnect. It would seem like we could add logic there to watch for active connection attempts that have taken too long, abort the attempt, and restart the active connection attempt.

Christopher Morrone (Inactive) added a comment - 28/Jun/16 11:08 PM

Oh, and as for not having a system to test it on...now you do! If you've got debug patches and things to investigate, we can facilitate that on our testbed.

Christopher Morrone (Inactive) added a comment - 28/Jun/16 11:08 PM Oh, and as for not having a system to test it on...now you do! If you've got debug patches and things to investigate, we can facilitate that on our testbed.

Christopher Morrone (Inactive) added a comment - 28/Jun/16 10:53 PM

I'm all for starting new tickets for separate problems. But the connection jam is exactly the problem being dealt with in this ticket. Why would we start a new one?

Christopher Morrone (Inactive) added a comment - 28/Jun/16 10:53 PM I'm all for starting new tickets for separate problems. But the connection jam is exactly the problem being dealt with in this ticket. Why would we start a new one?

Doug Oucharek (Inactive) added a comment - 28/Jun/16 10:26 PM

Interesting. I had hypothesised that this issue is either caused by, or augmented by, MLX5. We had never seen this until some clusters started using MLX5. I suspect the connection jam is MLX5-related.

Sadly, I have no access to MLX5 so cannot dig into the nature of the connection lock up. The current patch, though not perfect, allows systems to move forward and work even if there is a potential of a "leaked" connection structure or two.

I think the connection jam should be a new Jira ticket. We need to get Mellanox involved to help understand the MLX5-specific change which is triggering this.

Doug Oucharek (Inactive) added a comment - 28/Jun/16 10:26 PM Interesting. I had hypothesised that this issue is either caused by, or augmented by, MLX5. We had never seen this until some clusters started using MLX5. I suspect the connection jam is MLX5-related. Sadly, I have no access to MLX5 so cannot dig into the nature of the connection lock up. The current patch, though not perfect, allows systems to move forward and work even if there is a potential of a "leaked" connection structure or two. I think the connection jam should be a new Jira ticket. We need to get Mellanox involved to help understand the MLX5-specific change which is triggering this.

People

Assignee:: Doug Oucharek (Inactive)

Reporter:: Doug Oucharek (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 17 Start watching this issue

Dates

Created:: 09/Jan/16 2:20 AM

Updated:: 14/Jun/18 9:41 PM

Resolved:: 15/Aug/16 10:31 PM