Affects Version/s: None
Fix Version/s: Lustre 2.15.0
We should not recover interfaces, remote or local, indefinitely. A remote interface may never come back (remote peer rebooted with new LNet config). A local interface may be down for extended periods of time when, say, a NIC needs to be swapped or cable replaced.
Let's define criteria for when we should stop trying to recover an interface.
- After X recovery attempts?
- After Y amount of time?
We also need to decide how to stop doing recover.
- Flag the ni/lpni so that it doesn't get placed back onto the recovery queue?
- Delete the ni/lpni outright?
- Rather than stop completely, use an exponential backoff algorithm so that recovery doesn't add any real load to the system.