Loading...

XML

Word

Printable

Details

Type: Improvement
Resolution: Unresolved
Priority: Major
Fix Version/s: None
Affects Version/s: None
Labels:
None

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

When a client is decommissioned, it stays forever in the peer list of the servers and generates a stream of messages like:

lnet_handle_recovery_reply()) peer NI (10.11.12.13@o2ib) recovery failed with -113
lnet_handle_recovery_reply()) Skipped 1234 similar messages

It is cleaner to remove it from the list of peers. It avoids the need to change the value of lnet_recovery_limit to remove LNetError messages about this removed client. Moreover, having this parameter changed can mask a problem on an active but faulty node.

However, it can be cumbersome to remove it manually. That is why an automatic deletion could be relevant.

This feature could use a parameter to enable it and to set the delay before a client is considered to be removed.

Attachments

Issue Links

is related to

LU-14654 Need to check if lnet_recovery_limit is non-zero in lnet_peer_ni_add_to_recoveryq_locked()

Resolved

Activity

People

Assignee:: Cyril Bordage

Reporter:: Cyril Bordage

Votes:: 0 Vote for this issue

Watchers:: 13 Start watching this issue

Dates

Created:: 09/Feb/24 3:57 PM

Updated:: 25/Feb/25 5:24 PM