Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-13932

reduce maximum wait_time for MMP recovery



    • Improvement
    • Resolution: Unresolved
    • Minor
    • None
    • None
    • 9223372036854775807


      When the MMP block is written, mmp_check_interval is computed as EXT4_MMP_CHECK_MULT = 2 times the actual IO completion time:

              mmp_check_interval = max(min(EXT4_MMP_CHECK_MULT * diff / HZ, 

      Later, during MMP recovery after a crash, the wait_time is computed as either 2x mmp_check_interval or 60s longer.

              wait_time = min(mmp_check_interval * 2 + 1,
                              mmp_check_interval + 60);
              /* Print MMP interval if more than 20 secs. */
              if (wait_time > EXT4_MMP_MIN_CHECK_INTERVAL * 4)
                      ext4_warning(sb, "MMP interval %u higher than expected, please"
                                   " wait.\n", wait_time * 2);

      There should be some margin in order to compensate for nodes that became more busy after the last time the MMP block was updated, but this seems excessive, given that we also need to wait twice that interval in order to finish recovery (once to detect if the MMP block is idle, and once again after writing our own MMP block to detect races with other nodes also trying to mount the filesystem). That may result in a mount time of 12 minutes (720s) after all of the doublings are taken into account.

      We don't really need to increase the wait_time by a factor of two or 60s. It would be enough to use e.g. min(mmp_check_interval * 2, mmp_check_interval + 20 or similar, given that the second value will take precedence once mmp_check_interval is above 160s already. This would reduce the maximum wait interval to 640s (-80s).

      Also, based on the MMP code in ZFS, it probably makes sense to have the mmp_check_interval written by ext4_kmmpd() to use a decaying average time rather than just the most recent interval. That would avoid writing a very short interval during a period of alternating short and long IO submission times (e.g. due to fluctuating load or intermittent IO errors). Something like:

              new_check_interval = EXT4_MMP_CHECK_MULT * diff / HZ;
              /* Increase mmp_check_interval immediately if IO completion time
               * is longer, but decay slowly to minimum if it is shorter.
              if (new_check_interval >= mmp_check_interval)
                      mmp_check_interval = min(new_check_interval,
                      mmp_check_interval = (mmp_check_interval * 15 +
                                                new_check_interval)) / 16;




            wc-triage WC Triage
            adilger Andreas Dilger
            0 Vote for this issue
            3 Start watching this issue