[LU-13932] reduce maximum wait_time for MMP recovery Created: 28/Aug/20  Updated: 07/Oct/22

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Minor
Reporter: Andreas Dilger Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: e2fsprogs, easy, ldiskfs

Issue Links:
Related
Rank (Obsolete): 9223372036854775807

 Description   

When the MMP block is written, mmp_check_interval is computed as EXT4_MMP_CHECK_MULT = 2 times the actual IO completion time:

        mmp_check_interval = max(min(EXT4_MMP_CHECK_MULT * diff / HZ, 
                                     EXT4_MMP_MAX_CHECK_INTERVAL),
                                 EXT4_MMP_MIN_CHECK_INTERVAL);

Later, during MMP recovery after a crash, the wait_time is computed as either 2x mmp_check_interval or 60s longer.

        wait_time = min(mmp_check_interval * 2 + 1,
                        mmp_check_interval + 60);

        /* Print MMP interval if more than 20 secs. */
        if (wait_time > EXT4_MMP_MIN_CHECK_INTERVAL * 4)
                ext4_warning(sb, "MMP interval %u higher than expected, please"
                             " wait.\n", wait_time * 2);

There should be some margin in order to compensate for nodes that became more busy after the last time the MMP block was updated, but this seems excessive, given that we also need to wait twice that interval in order to finish recovery (once to detect if the MMP block is idle, and once again after writing our own MMP block to detect races with other nodes also trying to mount the filesystem). That may result in a mount time of 12 minutes (720s) after all of the doublings are taken into account.

We don't really need to increase the wait_time by a factor of two or 60s. It would be enough to use e.g. min(mmp_check_interval * 2, mmp_check_interval + 20 or similar, given that the second value will take precedence once mmp_check_interval is above 160s already. This would reduce the maximum wait interval to 640s (-80s).

Also, based on the MMP code in ZFS, it probably makes sense to have the mmp_check_interval written by ext4_kmmpd() to use a decaying average time rather than just the most recent interval. That would avoid writing a very short interval during a period of alternating short and long IO submission times (e.g. due to fluctuating load or intermittent IO errors). Something like:

        new_check_interval = EXT4_MMP_CHECK_MULT * diff / HZ;
        /* Increase mmp_check_interval immediately if IO completion time
         * is longer, but decay slowly to minimum if it is shorter.
         */
        if (new_check_interval >= mmp_check_interval)
                mmp_check_interval = min(new_check_interval,
                                         EXT4_MMP_MAX_CHECK_INTERVAL);
        else
                mmp_check_interval = (mmp_check_interval * 15 +
                                      max(EXT4_MMP_MIN_CHECK_INTERVAL,
                                          new_check_interval)) / 16;

Generated at Sat Feb 10 03:05:26 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.