Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-13569

LNet Health should not recover interfaces indefinitely

XMLWordPrintable

    • Icon: Improvement Improvement
    • Resolution: Fixed
    • Icon: Minor Minor
    • Lustre 2.15.0
    • None
    • None

      We should not recover interfaces, remote or local, indefinitely. A remote interface may never come back (remote peer rebooted with new LNet config). A local interface may be down for extended periods of time when, say, a NIC needs to be swapped or cable replaced.

      Let's define criteria for when we should stop trying to recover an interface.

      • After X recovery attempts?
      • After Y amount of time?

      We also need to decide how to stop doing recover.

      • Flag the ni/lpni so that it doesn't get placed back onto the recovery queue?
      • Delete the ni/lpni outright?
      • Rather than stop completely, use an exponential backoff algorithm so that recovery doesn't add any real load to the system.

            hornc Chris Horn
            hornc Chris Horn
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

              Created:
              Updated:
              Resolved: