Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-18681

Histogram of client reconnection times during recovery

Details

    • Improvement
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.16.0
    • 3
    • 9223372036854775807

    Description

      In order to get a better understanding how long that the MDS and OSS should wait before recovery is aborted, it would be useful to keep a histogram of how long it takes for clients to reconnect to the server after recovery is started.

      This can be read on demand by "lctl get_param" and collected via sosreport during log collection. It would also be useful to print a one-line summary line to the console at the end of recovery with eg. the time it took for 50%, 90%, and 95% of clients to reconnect.

      This would allow us to better tune the default at_max value so that the cluster is not waiting at_max=900s for recovery to timeout/abort when some clients are not reconnecting, if we know that clients normally reconnect within 45s and waiting longer than that is just a waste of time.

      Attachments

        Activity

          People

            wc-triage WC Triage
            adilger Andreas Dilger
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated: