Details
-
Bug
-
Resolution: Fixed
-
Blocker
-
None
-
3
-
9223372036854775807
Description
While investigating/working on the fix for LU-7569 we stumbled on another bug when testing on a customer's system. When an LNet router is rebooted and mlx5-based cards are being used, it is possible for a client's attempt to reconnect to the router to get stuck in a permanent connecting state. When the router comes up and tries to create a connection back to the client, that connection will be rejected as CON RACE. This is an infinite loop because the stuck connection is always present on the client triggering the rejection.
This ticket has been opened to create a fix which compliments LU-7569. I appreciate that the mlx5 driver should be fixed to prevent stuck connection attempts, but at the same time, we need LNet to be immune to such situations as the result is pretty severe. We need self-healing code here.
Attachments
Issue Links
- is related to
-
LU-7569 IB leaf switch caused LNet routers to crash
- Resolved