Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12785

DOM2: dynamic DoM component size as MDT becomes full

Details

    • Improvement
    • Resolution: Fixed
    • Major
    • Lustre 2.14.0
    • None
    • 9223372036854775807

    Description

      As the MDT becomes full it makes sense to reduce the size or completely remove DOM components from the layout if created from default directory or root filesystem layout. i think a reasonable heuristic would be that if the percentage of free inodes is larger than the percentage of free space, the size of the DoM component can be increased (up to the mdt.*.dom_stripesize maximum). If the percentage of free inodes is smaller than the percentage of free space, or if the MDT is within configurable threshold (e.g. mdt.*.dom_threshold=10%) of being full, the DoM component size should be cut in half, and within 1/2 of mdt.*.dom_threshold the DoM component should be removed (or similar, see more complex options below).

      Note that the DoM component size must be a multiple of LOV_MIN_STRIPE_SIZE (64KiB) so it will not be possible to exactly match the inode ratio with the blocks ratio, but it makes sense to keep them relatively well balanced by default.

      It could be proposed to have a policy that each 1/4 reduction in free space below mdt.*.dom_threshold should reduce the DoM component size by 1/2 until it is below the 64KiB minimum component size. That would ensure that the ldiskfs MDT+DoM filesystem is not completely filled with DoM data when it is close to being filled. This is most critical for ldiskfs filesystems, since ZFS has dynamic inode allocation, but can still help ZFS to avoid being totally filled by DoM data. This should also be helped by LU-12624 to balance DNE directory allocations across MDTs, but that is only a coarse-grained balance and will not prevent MDTs filling with DoM data too quickly.

      Attachments

        Issue Links

          Activity

            [LU-12785] DOM2: dynamic DoM component size as MDT becomes full

            Mike Pershin (mpershin@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/39958
            Subject: LU-12785 dom: adjust DOM stripe size by free space
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set: 1
            Commit: 2a9ebe70d33e6f02cb7db7d3d810c16ef40e587a

            gerrit Gerrit Updater added a comment - Mike Pershin (mpershin@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/39958 Subject: LU-12785 dom: adjust DOM stripe size by free space Project: fs/lustre-release Branch: b2_12 Current Patch Set: 1 Commit: 2a9ebe70d33e6f02cb7db7d3d810c16ef40e587a
            pjones Peter Jones added a comment -

            Landed for 2.14

            pjones Peter Jones added a comment - Landed for 2.14

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38337/
            Subject: LU-12785 dom: fix DoM component deletion code
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: b24ba6c6ea1b3cc514241b01968bf31bc8f9cf46

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38337/ Subject: LU-12785 dom: fix DoM component deletion code Project: fs/lustre-release Branch: master Current Patch Set: Commit: b24ba6c6ea1b3cc514241b01968bf31bc8f9cf46

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37904/
            Subject: LU-12785 dom: adjust DOM stripe size by free space
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: f2a3bbfb3f3fef910201259dd1827bf8c475da06

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37904/ Subject: LU-12785 dom: adjust DOM stripe size by free space Project: fs/lustre-release Branch: master Current Patch Set: Commit: f2a3bbfb3f3fef910201259dd1827bf8c475da06

            Mike Pershin (mpershin@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38337
            Subject: LU-12785 dom: fix DoM component deletion code
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: f4e21c851bd4b0f0cae6b13223ae62e43fb335fc

            gerrit Gerrit Updater added a comment - Mike Pershin (mpershin@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38337 Subject: LU-12785 dom: fix DoM component deletion code Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: f4e21c851bd4b0f0cae6b13223ae62e43fb335fc

            Mike Pershin (mpershin@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/37904
            Subject: LU-12785 dom: adjust DOM stripe size by free space
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 9eef5743de88857ff7f3c05d2bb2b7c0d2bd5d41

            gerrit Gerrit Updater added a comment - Mike Pershin (mpershin@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/37904 Subject: LU-12785 dom: adjust DOM stripe size by free space Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 9eef5743de88857ff7f3c05d2bb2b7c0d2bd5d41

            This algorithm may need to change for ZFS, since we estimate the value

            One option would be to set an OS_STATE_FILES_EST flag in the osfs->os_state field, so that this code (and clients as well) can see that the os_files field is estimated (and by extension os_ffree as well), and use a different algorithm for deciding the maximum lod_dom_stripesize_tune value (perhaps just limiting it to MD_MAX_BRW_SIZE by default or a static value set by the admin).

            adilger Andreas Dilger added a comment - This algorithm may need to change for ZFS, since we estimate the value One option would be to set an OS_STATE_FILES_EST flag in the osfs->os_state field, so that this code (and clients as well) can see that the os_files field is estimated (and by extension os_ffree as well), and use a different algorithm for deciding the maximum lod_dom_stripesize_tune value (perhaps just limiting it to MD_MAX_BRW_SIZE by default or a static value set by the admin).

            I think that having a setting like "dom_stripesize=-1" (possibly set as the default), and some basic helper function called by lod_fix_dom_stripe() like the following, which could be improved later if needed:

            /* Max files created before dom_max_stripesize is recalculated */
            unsigned long lod_dom_max_stripesize_recalc_count = 1048576;
            
            unsigned int lod_dom_stripesize_tune(struct lod_device *lod)
            {
                    unsigned long avg_free_kb;
            
                    /* autotune is disabled by a specific max_stripesize set by user */
                    if (lod->lod_dom_max_stripesize != -1)
                             return lod->lod_dom_stripesize_max;
            
                    /*  if this has never been set, then block for one thread to finish it */
                    if (unlikely(lod->lod_dom_stripesize_tune == 0)) {
                            spin_lock(&lod->lod_dom_stripesize_tune_lock);
                            if (lod->lod_dom_stripesize_tune)
                                    goto out_unlock;
                    /* don't really care if this check is racy on SMP if there is _some_ limit set */
                    } else if (++lod->lod_dom_stripesize_count < lod->lod_dom_stripesize_limit ||
                               !spin_trylock(&lod->lod_dom_stripesize_tune_lock)) {
                             goto out;
                    }
                    lod->lod_dom_stripesize_count = 0;
            
                    /* I _think_ statfs is always cached by this point, but that should be checked */
            
                    avg_free_kb = osfs->os_bavail * (osfs->os_bsize >> 10) / (osfs->os_ffree + 1);
            
                    /* This algorithm may need to change for ZFS, since we estimate the value
                     *      os_ffree = os_bfree * usedobjs / usedblocks
                     *      avg_used_kb = usedblocks / usedobjs
                     * so,
                     *     avg_free_kb = avg_used_kb * (os_bavail / os_bfree)
                     * which means avg_free_kb will always be lower than avg_used_kb so the
                     * lod_dom_stripesize_tune will never increase (inode count will just grow),
                     * which is bad since ZFS is much more flexible with allocation than ldiskfs...
                     */
                    if (avg_free_kb < lod->lod_dom_max_stripesize_tune * 3 / 4 ||
                        avg_free_kb >= lod->lod_dom_max_stripesize_tune * 9 / 4)
                            lod->lod_dom_max_stripesize_tune =
                                    min(avg_free_kb & ~(LOV_MIN_STRIPE_SIZE - 1), DT_MAX_BRW_SIZE);
            
                    /* allow at most 10% of the filesystem to be used before recalculating */
                    if (lod->lod_dom_max_stripesize_tune > 0) {
                           lod->lod_dom_stripesize_recalc = min(osfs->os_bavail / (lod->lod_dom_max_stripesize_tune * 10),
                                       lod->lod_dom_stripesize_recalc_count);
                    } else {
                           lod->lod_dom_stripesize_recalc = lod->lod_dom_stripesize_recalc_count;
                    }
            out_unlock:
                    spin_unlock(&lod->lod_dom_stripesize_tune_lock);
            out:
                    return lod->lod_dom_stripesize_tune;
            }
            
            adilger Andreas Dilger added a comment - I think that having a setting like " dom_stripesize=-1 " (possibly set as the default), and some basic helper function called by lod_fix_dom_stripe() like the following, which could be improved later if needed: /* Max files created before dom_max_stripesize is recalculated */ unsigned long lod_dom_max_stripesize_recalc_count = 1048576; unsigned int lod_dom_stripesize_tune(struct lod_device *lod) { unsigned long avg_free_kb; /* autotune is disabled by a specific max_stripesize set by user */ if (lod->lod_dom_max_stripesize != -1) return lod->lod_dom_stripesize_max; /* if this has never been set, then block for one thread to finish it */ if (unlikely(lod->lod_dom_stripesize_tune == 0)) { spin_lock(&lod->lod_dom_stripesize_tune_lock); if (lod->lod_dom_stripesize_tune) goto out_unlock; /* don't really care if this check is racy on SMP if there is _some_ limit set */ } else if (++lod->lod_dom_stripesize_count < lod->lod_dom_stripesize_limit || !spin_trylock(&lod->lod_dom_stripesize_tune_lock)) { goto out; } lod->lod_dom_stripesize_count = 0; /* I _think_ statfs is always cached by this point, but that should be checked */ avg_free_kb = osfs->os_bavail * (osfs->os_bsize >> 10) / (osfs->os_ffree + 1); /* This algorithm may need to change for ZFS, since we estimate the value * os_ffree = os_bfree * usedobjs / usedblocks * avg_used_kb = usedblocks / usedobjs * so, * avg_free_kb = avg_used_kb * (os_bavail / os_bfree) * which means avg_free_kb will always be lower than avg_used_kb so the * lod_dom_stripesize_tune will never increase (inode count will just grow), * which is bad since ZFS is much more flexible with allocation than ldiskfs... */ if (avg_free_kb < lod->lod_dom_max_stripesize_tune * 3 / 4 || avg_free_kb >= lod->lod_dom_max_stripesize_tune * 9 / 4) lod->lod_dom_max_stripesize_tune = min(avg_free_kb & ~(LOV_MIN_STRIPE_SIZE - 1), DT_MAX_BRW_SIZE); /* allow at most 10% of the filesystem to be used before recalculating */ if (lod->lod_dom_max_stripesize_tune > 0) { lod->lod_dom_stripesize_recalc = min(osfs->os_bavail / (lod->lod_dom_max_stripesize_tune * 10), lod->lod_dom_stripesize_recalc_count); } else { lod->lod_dom_stripesize_recalc = lod->lod_dom_stripesize_recalc_count; } out_unlock: spin_unlock(&lod->lod_dom_stripesize_tune_lock); out: return lod->lod_dom_stripesize_tune; }
            adilger Andreas Dilger added a comment - - edited

            Note that while the internal variable is named "lod_dom_max_stripesize", the userspace tunable parameter name is actually named "dom_stripesize", which is confusing for everyone. This makes the tunable name different from the internal variable name, which I would normally suggest to fix by renaming the internal variable name to match, so that searching for this name finds both the internal variable and the parameter handling functions. In this case, I think the "_max_" part of the name is important for both the code and the user's understanding of what that parameter does.

            I think it would be useful to submit a patch to add a second "dom_stripesize_max" tunable for userspace that also sets the lod_dom_stripesize_max, then add a warning message into the next release if "dom_stripesize" is used, and eventually deprecate/remove the "dom_stripesize" tunable. We might consider to name the new tunable "dom_stripesize_max_kb" since it doesn't really make sense to store it in units of bytes (currently it must always be a multiple of 64KB).

            adilger Andreas Dilger added a comment - - edited Note that while the internal variable is named " lod_dom_max_stripesize ", the userspace tunable parameter name is actually named " dom_stripesize ", which is confusing for everyone. This makes the tunable name different from the internal variable name, which I would normally suggest to fix by renaming the internal variable name to match, so that searching for this name finds both the internal variable and the parameter handling functions. In this case, I think the " _max_ " part of the name is important for both the code and the user's understanding of what that parameter does. I think it would be useful to submit a patch to add a second " dom_stripesize_max " tunable for userspace that also sets the lod_dom_stripesize_max , then add a warning message into the next release if " dom_stripesize " is used, and eventually deprecate/remove the " dom_stripesize " tunable. We might consider to name the new tunable " dom_stripesize_max_kb " since it doesn't really make sense to store it in units of bytes (currently it must always be a multiple of 64KB).
            tappro Mikhail Pershin added a comment - - edited

            Andreas, now DOM size is limited just by lod.*.dom_stripesize so that can be implemented in any way. With DOM threshold introduced it will be possible to limit its size or drop component.

            tappro Mikhail Pershin added a comment - - edited Andreas, now DOM size is limited just by lod.*.dom_stripesize so that can be implemented in any way. With DOM threshold introduced it will be possible to limit its size or drop component.

            People

              tappro Mikhail Pershin
              tappro Mikhail Pershin
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: