[LU-13932] reduce maximum wait_time for MMP recovery - Whamcloud Community JIRA

Details

Type: Improvement
Resolution: Unresolved
Priority: Minor
Fix Version/s: None
Affects Version/s: None
Labels:
- e2fsprogs
- easy
- ldiskfs
- lug24dd

Rank (Obsolete):
9223372036854775807

Description

When the MMP block is written, mmp_check_interval is computed with EXT4_MMP_CHECK_MULT = 2 times the actual IO completion time:

        mmp_check_interval = max(min(EXT4_MMP_CHECK_MULT * diff / HZ, 
                                     EXT4_MMP_MAX_CHECK_INTERVAL),
                                 EXT4_MMP_MIN_CHECK_INTERVAL);

Later, during MMP recovery after a crash, the wait_time is computed as either 2x mmp_check_interval or 60s longer.

        wait_time = min(mmp_check_interval * 2 + 1,
                        mmp_check_interval + 60);

        /* Print MMP interval if more than 20 secs. */
        if (wait_time > EXT4_MMP_MIN_CHECK_INTERVAL * 4)
                ext4_warning(sb, "MMP interval %u higher than expected, please"
                             " wait.\n", wait_time * 2);

There should be some margin in order to compensate for nodes that became more busy after the last time the MMP block was updated, but this seems excessive, given that we also need to wait twice that interval in order to finish recovery (once to detect if the MMP block is idle, and once again after writing our own MMP block to detect races with other nodes also trying to mount the filesystem). That may result in a mount time of 12 minutes (720s) after all of the doublings are taken into account.

We don't really need to increase the wait_time by a factor of two or 60s. It would be enough to use e.g. min(mmp_check_interval * 2, mmp_check_interval + 20) or similar, given that the second value will take precedence once mmp_check_interval is above 160s already. This would reduce the maximum wait interval to 640s (-80s).

Also, based on the MMP code in ZFS, it probably makes sense to have the mmp_check_interval written by ext4_kmmpd() to use a decaying average time rather than just the most recent interval. That would avoid writing a very short interval during a period of alternating short and long IO submission times (e.g. due to fluctuating load or intermittent IO errors). Something like:

        new_check_interval = EXT4_MMP_CHECK_MULT * diff / HZ;
        /* Increase mmp_check_interval immediately if IO completion time
         * is longer, but decay slowly to minimum if it is shorter.
         */
        if (new_check_interval >= mmp_check_interval)
                mmp_check_interval = min(new_check_interval,
                                         EXT4_MMP_MAX_CHECK_INTERVAL);
        else
                mmp_check_interval = (mmp_check_interval * 15 +
                                      max(EXT4_MMP_MIN_CHECK_INTERVAL,
                                          new_check_interval)) / 16;

Attachments

Activity

People

Assignee:: WC Triage

Reporter:: Andreas Dilger

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 28/Aug/20 2:45 AM

Updated:: 28/May/25 7:33 PM