Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-13932

reduce maximum wait_time for MMP recovery

    XMLWordPrintable

Details

    • Improvement
    • Resolution: Unresolved
    • Minor
    • None
    • None
    • 9223372036854775807

    Description

      When the MMP block is written, mmp_check_interval is computed as EXT4_MMP_CHECK_MULT = 2 times the actual IO completion time:

              mmp_check_interval = max(min(EXT4_MMP_CHECK_MULT * diff / HZ, 
                                           EXT4_MMP_MAX_CHECK_INTERVAL),
                                       EXT4_MMP_MIN_CHECK_INTERVAL);
      

      Later, during MMP recovery after a crash, the wait_time is computed as either 2x mmp_check_interval or 60s longer.

              wait_time = min(mmp_check_interval * 2 + 1,
                              mmp_check_interval + 60);
      
              /* Print MMP interval if more than 20 secs. */
              if (wait_time > EXT4_MMP_MIN_CHECK_INTERVAL * 4)
                      ext4_warning(sb, "MMP interval %u higher than expected, please"
                                   " wait.\n", wait_time * 2);
      

      There should be some margin in order to compensate for nodes that became more busy after the last time the MMP block was updated, but this seems excessive, given that we also need to wait twice that interval in order to finish recovery (once to detect if the MMP block is idle, and once again after writing our own MMP block to detect races with other nodes also trying to mount the filesystem). That may result in a mount time of 12 minutes (720s) after all of the doublings are taken into account.

      We don't really need to increase the wait_time by a factor of two or 60s. It would be enough to use e.g. min(mmp_check_interval * 2, mmp_check_interval + 20 or similar, given that the second value will take precedence once mmp_check_interval is above 160s already. This would reduce the maximum wait interval to 640s (-80s).

      Also, based on the MMP code in ZFS, it probably makes sense to have the mmp_check_interval written by ext4_kmmpd() to use a decaying average time rather than just the most recent interval. That would avoid writing a very short interval during a period of alternating short and long IO submission times (e.g. due to fluctuating load or intermittent IO errors). Something like:

              new_check_interval = EXT4_MMP_CHECK_MULT * diff / HZ;
              /* Increase mmp_check_interval immediately if IO completion time
               * is longer, but decay slowly to minimum if it is shorter.
               */
              if (new_check_interval >= mmp_check_interval)
                      mmp_check_interval = min(new_check_interval,
                                               EXT4_MMP_MAX_CHECK_INTERVAL);
              else
                      mmp_check_interval = (mmp_check_interval * 15 +
                                            max(EXT4_MMP_MIN_CHECK_INTERVAL,
                                                new_check_interval)) / 16;
      

      Attachments

        Activity

          People

            wc-triage WC Triage
            adilger Andreas Dilger
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: