Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-13569

LNet Health should not recover interfaces indefinitely

    XMLWordPrintable

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: Lustre 2.15.0
    • Labels:
      None

      Description

      We should not recover interfaces, remote or local, indefinitely. A remote interface may never come back (remote peer rebooted with new LNet config). A local interface may be down for extended periods of time when, say, a NIC needs to be swapped or cable replaced.

      Let's define criteria for when we should stop trying to recover an interface.

      • After X recovery attempts?
      • After Y amount of time?

      We also need to decide how to stop doing recover.

      • Flag the ni/lpni so that it doesn't get placed back onto the recovery queue?
      • Delete the ni/lpni outright?
      • Rather than stop completely, use an exponential backoff algorithm so that recovery doesn't add any real load to the system.

        Attachments

          Activity

            People

            Assignee:
            hornc Chris Horn
            Reporter:
            hornc Chris Horn
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Dates

              Created:
              Updated:
              Resolved: