Details
-
Improvement
-
Resolution: Unresolved
-
Major
-
None
-
None
-
None
-
3
-
9223372036854775807
Description
When a client is decommissioned, it stays forever in the peer list of the servers and generates a stream of messages like:
lnet_handle_recovery_reply()) peer NI (10.11.12.13@o2ib) recovery failed with -113 lnet_handle_recovery_reply()) Skipped 1234 similar messages
It is cleaner to remove it from the list of peers. It avoids the need to change the value of lnet_recovery_limit to remove LNetError messages about this removed client. Moreover, having this parameter changed can mask a problem on an active but faulty node.
However, it can be cumbersome to remove it manually. That is why an automatic deletion could be relevant.
This feature could use a parameter to enable it and to set the delay before a client is considered to be removed.
Attachments
Issue Links
- is related to
-
LU-14654 Need to check if lnet_recovery_limit is non-zero in lnet_peer_ni_add_to_recoveryq_locked()
-
- Resolved
-
cbordage I don't think we came to a final decision regarding peer deletion.
The options as I remember were:
I think the simplest would be to "optionally delete peers as soon as lnet_recovery_limit is reached" if we're agreed to delete inactive peers at all.
Please feel free to review the options and voice your opinion. If you have the cycles it would be great if you take this on.
Thanks,
Serguei