Details
Description
In order to get a better understanding how long that the MDS and OSS should wait before recovery is aborted, it would be useful to keep a histogram of how long it takes for clients to reconnect to the server after recovery is started.
This can be read on demand by "lctl get_param" and collected via sosreport during log collection. It would also be useful to print a one-line summary line to the console at the end of recovery with eg. the time it took for 50%, 90%, and 95% of clients to reconnect.
This would allow us to better tune the default at_max value so that the cluster is not waiting at_max=900s for recovery to timeout/abort when some clients are not reconnecting, if we know that clients normally reconnect within 45s and waiting longer than that is just a waste of time.