Don't recover indefinitely. There has to be a criteria of when to stop recovering an interface which is consistently down.
When a route is deleted and the gateway is not referenced by any other route, it should be deleted and removed from the recovery queue.
There are a category of errors, like unable to resolve address or route which shouldn't result in health of the remote or the local being decremented or recovered. This category of errors indicate that the remote address does not exist or is unreachable.
Recovery should be strictly for NIDs which we have already communicated with before. So we're going from a known good state, to a known bad state.
Don't resend recovery messages
Disable health on single-rail deployments, since the concept of resiliency doesn't exist on single-rail deployments. There is no other interface to fail over to.
LNet retries shouldn't delay ptlrpc timeouts
When an interface fails and stays out of commission for a period of time, and then is brought back into commission, the sequence numbers for the interface which has been currently in use would be far larger than the newly commissioned interface. This leads to the new interface being used continuously until its sequence number catches up with the in use interface. This is not ideal behavior, because the system has two available interfaces, but only one is being used simply because of the sequence number, which is intended to allow round robin. Ideally, once an interface comes back into service, it should immediately be used.
Related to this point: A similar thing happens when there are a lot of source specified sends. One NI gets a bunch of sequence increments so then it takes a while for other NIs to "catch up".
When a device is in fatal state, reflect that in the lnetctl net show command. Makes life easier when debugging, instead of relying on debug output.