Details
-
Improvement
-
Resolution: Fixed
-
Minor
-
None
-
None
-
9223372036854775807
Description
We should not recover interfaces, remote or local, indefinitely. A remote interface may never come back (remote peer rebooted with new LNet config). A local interface may be down for extended periods of time when, say, a NIC needs to be swapped or cable replaced.
Let's define criteria for when we should stop trying to recover an interface.
- After X recovery attempts?
- After Y amount of time?
We also need to decide how to stop doing recover.
- Flag the ni/lpni so that it doesn't get placed back onto the recovery queue?
- Delete the ni/lpni outright?
- Rather than stop completely, use an exponential backoff algorithm so that recovery doesn't add any real load to the system.
Attachments
Issue Links
- is duplicated by
-
LU-13572 LNet Health should only attempt recovery of remote NIs for which it has successfully communicated with
-
- Closed
-
Landed for 2.15