[LU-7646] Infinite CON RACE Condition after rebooting LNet router - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Blocker
Fix Version/s: Lustre 2.9.0
Affects Version/s: None
Labels:
- llnl

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

While investigating/working on the fix for ~~LU-7569~~ we stumbled on another bug when testing on a customer's system. When an LNet router is rebooted and mlx5-based cards are being used, it is possible for a client's attempt to reconnect to the router to get stuck in a permanent connecting state. When the router comes up and tries to create a connection back to the client, that connection will be rejected as CON RACE. This is an infinite loop because the stuck connection is always present on the client triggering the rejection.

This ticket has been opened to create a fix which compliments ~~LU-7569~~. I appreciate that the mlx5 driver should be fixed to prevent stuck connection attempts, but at the same time, we need LNet to be immune to such situations as the result is pretty severe. We need self-healing code here.

Attachments

Issue Links

is related to

LU-7569 IB leaf switch caused LNet routers to crash

Resolved

Activity

People

Assignee:: Doug Oucharek (Inactive)

Reporter:: Doug Oucharek (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 17 Start watching this issue

Dates

Created:: 09/Jan/16 2:20 AM

Updated:: 14/Jun/18 9:41 PM

Resolved:: 15/Aug/16 10:31 PM