[LU-12785] DOM2: dynamic DoM component size as MDT becomes full Created: 18/Sep/19  Updated: 18/Mar/21  Resolved: 19/Jun/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.14.0

Type: Improvement Priority: Major
Reporter: Mikhail Pershin Assignee: Mikhail Pershin
Resolution: Fixed Votes: 0
Labels: DoM2

Issue Links:
Related
is related to LU-12624 DNE3: striped directory allocate stri... Resolved
is related to LU-13058 Intermediate component removal (PFL/SEL) Open
Rank (Obsolete): 9223372036854775807

 Description   

As the MDT becomes full it makes sense to reduce the size or completely remove DOM components from the layout if created from default directory or root filesystem layout. i think a reasonable heuristic would be that if the percentage of free inodes is larger than the percentage of free space, the size of the DoM component can be increased (up to the mdt.*.dom_stripesize maximum). If the percentage of free inodes is smaller than the percentage of free space, or if the MDT is within configurable threshold (e.g. mdt.*.dom_threshold=10%) of being full, the DoM component size should be cut in half, and within 1/2 of mdt.*.dom_threshold the DoM component should be removed (or similar, see more complex options below).

Note that the DoM component size must be a multiple of LOV_MIN_STRIPE_SIZE (64KiB) so it will not be possible to exactly match the inode ratio with the blocks ratio, but it makes sense to keep them relatively well balanced by default.

It could be proposed to have a policy that each 1/4 reduction in free space below mdt.*.dom_threshold should reduce the DoM component size by 1/2 until it is below the 64KiB minimum component size. That would ensure that the ldiskfs MDT+DoM filesystem is not completely filled with DoM data when it is close to being filled. This is most critical for ldiskfs filesystems, since ZFS has dynamic inode allocation, but can still help ZFS to avoid being totally filled by DoM data. This should also be helped by LU-12624 to balance DNE directory allocations across MDTs, but that is only a coarse-grained balance and will not prevent MDTs filling with DoM data too quickly.



 Comments   
Comment by Andreas Dilger [ 07/Dec/19 ]

Mike, what is the behavior of DoM today if, say, one were to enable a default "-E 64K -L mdt" component on an MDT that was formatted with only the old default 2.5KB per inode?

1) Would the files get ENOSPC errors when the MDT was totally full, and the filesystem would be unusable?
2) Would the mdt component be automatically dropped when the filesystem was totally full (allowing some limited use, but there would be no free space for directory/changelog block allocations?
3) Is there some free blocks threshold on the MDT below which DoM will drop the mdt component, but reserve some space for non-DoM allocations so the filesystem can continue to work?

if it is not #3, it seems that this would be relatively easy to implement and backport to a 2.12.x release so that it would be possible to default to enabling DoM on all fileystsmes, and if the MDT wasn't formatted for it, then it would just revert to non-DoM behavior for most files.

Comment by Mikhail Pershin [ 11/Dec/19 ]

Andreas, now DOM size is limited just by lod.*.dom_stripesize so that can be implemented in any way. With DOM threshold introduced it will be possible to limit its size or drop component.

Comment by Andreas Dilger [ 05/Feb/20 ]

Note that while the internal variable is named "lod_dom_max_stripesize", the userspace tunable parameter name is actually named "dom_stripesize", which is confusing for everyone. This makes the tunable name different from the internal variable name, which I would normally suggest to fix by renaming the internal variable name to match, so that searching for this name finds both the internal variable and the parameter handling functions. In this case, I think the "_max_" part of the name is important for both the code and the user's understanding of what that parameter does.

I think it would be useful to submit a patch to add a second "dom_stripesize_max" tunable for userspace that also sets the lod_dom_stripesize_max, then add a warning message into the next release if "dom_stripesize" is used, and eventually deprecate/remove the "dom_stripesize" tunable. We might consider to name the new tunable "dom_stripesize_max_kb" since it doesn't really make sense to store it in units of bytes (currently it must always be a multiple of 64KB).

Comment by Andreas Dilger [ 06/Feb/20 ]

I think that having a setting like "dom_stripesize=-1" (possibly set as the default), and some basic helper function called by lod_fix_dom_stripe() like the following, which could be improved later if needed:

/* Max files created before dom_max_stripesize is recalculated */
unsigned long lod_dom_max_stripesize_recalc_count = 1048576;

unsigned int lod_dom_stripesize_tune(struct lod_device *lod)
{
        unsigned long avg_free_kb;

        /* autotune is disabled by a specific max_stripesize set by user */
        if (lod->lod_dom_max_stripesize != -1)
                 return lod->lod_dom_stripesize_max;

        /*  if this has never been set, then block for one thread to finish it */
        if (unlikely(lod->lod_dom_stripesize_tune == 0)) {
                spin_lock(&lod->lod_dom_stripesize_tune_lock);
                if (lod->lod_dom_stripesize_tune)
                        goto out_unlock;
        /* don't really care if this check is racy on SMP if there is _some_ limit set */
        } else if (++lod->lod_dom_stripesize_count < lod->lod_dom_stripesize_limit ||
                   !spin_trylock(&lod->lod_dom_stripesize_tune_lock)) {
                 goto out;
        }
        lod->lod_dom_stripesize_count = 0;

        /* I _think_ statfs is always cached by this point, but that should be checked */

        avg_free_kb = osfs->os_bavail * (osfs->os_bsize >> 10) / (osfs->os_ffree + 1);

        /* This algorithm may need to change for ZFS, since we estimate the value
         *      os_ffree = os_bfree * usedobjs / usedblocks
         *      avg_used_kb = usedblocks / usedobjs
         * so,
         *     avg_free_kb = avg_used_kb * (os_bavail / os_bfree)
         * which means avg_free_kb will always be lower than avg_used_kb so the
         * lod_dom_stripesize_tune will never increase (inode count will just grow),
         * which is bad since ZFS is much more flexible with allocation than ldiskfs...
         */
        if (avg_free_kb < lod->lod_dom_max_stripesize_tune * 3 / 4 ||
            avg_free_kb >= lod->lod_dom_max_stripesize_tune * 9 / 4)
                lod->lod_dom_max_stripesize_tune =
                        min(avg_free_kb & ~(LOV_MIN_STRIPE_SIZE - 1), DT_MAX_BRW_SIZE);

        /* allow at most 10% of the filesystem to be used before recalculating */
        if (lod->lod_dom_max_stripesize_tune > 0) {
               lod->lod_dom_stripesize_recalc = min(osfs->os_bavail / (lod->lod_dom_max_stripesize_tune * 10),
                           lod->lod_dom_stripesize_recalc_count);
        } else {
               lod->lod_dom_stripesize_recalc = lod->lod_dom_stripesize_recalc_count;
        }
out_unlock:
        spin_unlock(&lod->lod_dom_stripesize_tune_lock);
out:
        return lod->lod_dom_stripesize_tune;
}
Comment by Andreas Dilger [ 06/Feb/20 ]

This algorithm may need to change for ZFS, since we estimate the value

One option would be to set an OS_STATE_FILES_EST flag in the osfs->os_state field, so that this code (and clients as well) can see that the os_files field is estimated (and by extension os_ffree as well), and use a different algorithm for deciding the maximum lod_dom_stripesize_tune value (perhaps just limiting it to MD_MAX_BRW_SIZE by default or a static value set by the admin).

Comment by Gerrit Updater [ 12/Mar/20 ]

Mike Pershin (mpershin@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/37904
Subject: LU-12785 dom: adjust DOM stripe size by free space
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 9eef5743de88857ff7f3c05d2bb2b7c0d2bd5d41

Comment by Gerrit Updater [ 23/Apr/20 ]

Mike Pershin (mpershin@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38337
Subject: LU-12785 dom: fix DoM component deletion code
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: f4e21c851bd4b0f0cae6b13223ae62e43fb335fc

Comment by Gerrit Updater [ 07/May/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37904/
Subject: LU-12785 dom: adjust DOM stripe size by free space
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: f2a3bbfb3f3fef910201259dd1827bf8c475da06

Comment by Gerrit Updater [ 19/Jun/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38337/
Subject: LU-12785 dom: fix DoM component deletion code
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: b24ba6c6ea1b3cc514241b01968bf31bc8f9cf46

Comment by Peter Jones [ 19/Jun/20 ]

Landed for 2.14

Comment by Gerrit Updater [ 17/Sep/20 ]

Mike Pershin (mpershin@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/39958
Subject: LU-12785 dom: adjust DOM stripe size by free space
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 2a9ebe70d33e6f02cb7db7d3d810c16ef40e587a

Generated at Sat Feb 10 02:55:37 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.