Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-7646

Infinite CON RACE Condition after rebooting LNet router

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.9.0
    • None
    • 3
    • 9223372036854775807

    Description

      While investigating/working on the fix for LU-7569 we stumbled on another bug when testing on a customer's system. When an LNet router is rebooted and mlx5-based cards are being used, it is possible for a client's attempt to reconnect to the router to get stuck in a permanent connecting state. When the router comes up and tries to create a connection back to the client, that connection will be rejected as CON RACE. This is an infinite loop because the stuck connection is always present on the client triggering the rejection.

      This ticket has been opened to create a fix which compliments LU-7569. I appreciate that the mlx5 driver should be fixed to prevent stuck connection attempts, but at the same time, we need LNet to be immune to such situations as the result is pretty severe. We need self-healing code here.

      Attachments

        Issue Links

          Activity

            People

              doug Doug Oucharek (Inactive)
              doug Doug Oucharek (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              17 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: