Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-13569

LNet Health should not recover interfaces indefinitely

    XMLWordPrintable

Details

    • Improvement
    • Resolution: Fixed
    • Minor
    • Lustre 2.15.0
    • None
    • None

    Description

      We should not recover interfaces, remote or local, indefinitely. A remote interface may never come back (remote peer rebooted with new LNet config). A local interface may be down for extended periods of time when, say, a NIC needs to be swapped or cable replaced.

      Let's define criteria for when we should stop trying to recover an interface.

      • After X recovery attempts?
      • After Y amount of time?

      We also need to decide how to stop doing recover.

      • Flag the ni/lpni so that it doesn't get placed back onto the recovery queue?
      • Delete the ni/lpni outright?
      • Rather than stop completely, use an exponential backoff algorithm so that recovery doesn't add any real load to the system.

      Attachments

        Issue Links

          Activity

            People

              hornc Chris Horn
              hornc Chris Horn
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: