[LU-18681] Histogram of client reconnection times during recovery - Whamcloud Community JIRA

Details

Type: Improvement
Resolution: Unresolved
Priority: Minor
Fix Version/s: None
Affects Version/s: Lustre 2.16.0
Labels:
- debug
- medium

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

In order to get a better understanding how long that the MDS and OSS should wait before recovery is aborted, it would be useful to keep a histogram of how long it takes for clients to reconnect to the server after recovery is started.

This can be read on demand by "lctl get_param" and collected via sosreport during log collection. It would also be useful to print a one-line summary line to the console at the end of recovery with eg. the time it took for 50%, 90%, and 95% of clients to reconnect.

This would allow us to better tune the default at_max value so that the cluster is not waiting at_max=900s for recovery to timeout/abort when some clients are not reconnecting, if we know that clients normally reconnect within 45s and waiting longer than that is just a waste of time.

Attachments

Activity

People

Assignee:: WC Triage

Reporter:: Andreas Dilger

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 28/Jan/25 9:17 PM

Updated:: 13/Feb/25 9:08 PM