Details

    • Improvement
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.12.0
    • 9223372036854775807

    Description

      Shrinking directories in ldiskfs would be desirable for cases where a directory had a large number of files created, but the files are deleted and the directory is empty and could be deallocated.

      There is a patch submitted to upstream ext4 that is the start of the support for this functionality, but it is not very aggressive about removing directory blocks: https://patchwork.ozlabs.org/project/linux-ext4/list/?series=168937

      There is intended to be additional work in this area to improve the directory shrinking functionality.

      In addition to the directory shrinking, removal of old OST object directory trees (O/*/d*) is also useful, and could potentially be a substitute for having online directory shrink once the directories are completely empty.

      Attachments

        Issue Links

          Activity

            [LU-12051] ldiskfs directory shrink

            The upstream ext4 directory shrink patches have been refreshed:

            The most complexity will be around integration of the "shrink directories on dentry delete" patches with the ext4-pdirop.patch patch, especially related to locking order as levels of the htree are removed. We will also need to disable the htree dx_root removal in make_unindexed() in the same way we do for ext4_update_dx_flag() because this would break htree locking and is of marginal benefit. At the point where all objects in a {SEQ}/d*/ directory tree have been removed on an OST, we can just delete the whole sequence directory tree rather than worry about the few remaining blocks for dx_root.

            These will mostly only shrink the directory when it is almost completely empty, but for LU-11912 this would still help reduce space usage as old objects are removed. There still needs to be a patch that merges adjacent htree blocks when they are nearly empty. My proposal for a possible implementation for htree leaf block merging was in this linux-ext4 thread on an earlier version of the patch:

            On Mar 25, 2020, at 3:37 AM, Harshad Shirwadkar <harshadshirwadkar@gmail.com> wrote:
            > But note that most of the shrinking happens during last 1-2% deletions
            > in an average case. Therefore, the next step here is to merge dx nodes
            > when possible. That can be achieved by storing the fullness index in
            > htree nodes. But that's an on-disk format change. We can instead build
            > on tooling added by this patch to perform reverse lookup on a dx
            > node and then reading adjacent nodes to check their fullness.

            As for storing the fullness on disk changing the on-disk format... That is
            true, but the original htree implementation anticipated this and reserved
            space in the htree index to store the fullness, so it would not break the
            ability of older kernels to access directories with the fullness information.

            I think if you used just a few bits (maybe just 2) to store:
            0 = unset (every directory today)
            1 = under 20% full
            2 = under 40% full
            3 = under 60% full

            or similar. It doesn't matter if they are more full since they won't be
            candidates for merging, and then lazily update the htree index fullness
            as entries are removed, this will simplify the shrinking process, and will
            avoid the need to repeatedly scan the leaf blocks to see if they are empty
            enough for merging. It wouldn't be any worse not to store these values
            on disk after the first time a "0 = unset" entry was found and not merged,
            or setting the fullness on the merged block if it is merged, and running
            "e2fsck -D" can easily update the fullness values.

            The benefit of using 20%, 40%, and 60% as the fullness markers is that it
            is possible to either merge adjacent 60% and 40% blocks or alternately a
            60% and two adjacent 20% blocks. Also, since these values are very coarse
            they would not need to be updated frequently. If the values are slightly
            outdated, then it is again not worse than the "always scan" model (one scan
            and the fullness would be updated), but more efficient than repeat scanning.

            Using only two bits for fullness also leaves two bits free for future use.

            adilger Andreas Dilger added a comment - The upstream ext4 directory shrink patches have been refreshed: PATCH v2,1/3 ext4: return lblk from ext4_find_entry PATCH v2,2/3 ext4: shrink directories on dentry delete PATCH v2,3/3 ext4: reimplement ext4_empty_dir() using is_dirent_block_empty The most complexity will be around integration of the " shrink directories on dentry delete " patches with the ext4-pdirop.patch patch, especially related to locking order as levels of the htree are removed. We will also need to disable the htree dx_root removal in make_unindexed() in the same way we do for ext4_update_dx_flag() because this would break htree locking and is of marginal benefit. At the point where all objects in a { SEQ}/d*/ directory tree have been removed on an OST, we can just delete the whole sequence directory tree rather than worry about the few remaining blocks for dx_root . These will mostly only shrink the directory when it is almost completely empty, but for LU-11912 this would still help reduce space usage as old objects are removed. There still needs to be a patch that merges adjacent htree blocks when they are nearly empty. My proposal for a possible implementation for htree leaf block merging was in this linux-ext4 thread on an earlier version of the patch: On Mar 25, 2020, at 3:37 AM, Harshad Shirwadkar <harshadshirwadkar@gmail.com> wrote: > But note that most of the shrinking happens during last 1-2% deletions > in an average case. Therefore, the next step here is to merge dx nodes > when possible. That can be achieved by storing the fullness index in > htree nodes. But that's an on-disk format change. We can instead build > on tooling added by this patch to perform reverse lookup on a dx > node and then reading adjacent nodes to check their fullness. As for storing the fullness on disk changing the on-disk format... That is true, but the original htree implementation anticipated this and reserved space in the htree index to store the fullness, so it would not break the ability of older kernels to access directories with the fullness information. I think if you used just a few bits (maybe just 2) to store: 0 = unset (every directory today) 1 = under 20% full 2 = under 40% full 3 = under 60% full or similar. It doesn't matter if they are more full since they won't be candidates for merging, and then lazily update the htree index fullness as entries are removed, this will simplify the shrinking process, and will avoid the need to repeatedly scan the leaf blocks to see if they are empty enough for merging. It wouldn't be any worse not to store these values on disk after the first time a "0 = unset" entry was found and not merged, or setting the fullness on the merged block if it is merged, and running "e2fsck -D" can easily update the fullness values. The benefit of using 20%, 40%, and 60% as the fullness markers is that it is possible to either merge adjacent 60% and 40% blocks or alternately a 60% and two adjacent 20% blocks. Also, since these values are very coarse they would not need to be updated frequently. If the values are slightly outdated, then it is again not worse than the "always scan" model (one scan and the fullness would be updated), but more efficient than repeat scanning. Using only two bits for fullness also leaves two bits free for future use.

            I suspect that there isn't a lot of work we need to do in this area, but some review and testing of the upstream patch linked in the description (with feedback directly to linux-ext4@kernel.vger.org and the author) would probably speed things up. After the code is landed upstream, or is at least showing good benefits and is robust, we could backport it to ldiskfs/kernel_patches for use until we catch up with a newer kernel.

            adilger Andreas Dilger added a comment - I suspect that there isn't a lot of work we need to do in this area, but some review and testing of the upstream patch linked in the description (with feedback directly to linux-ext4@kernel.vger.org and the author) would probably speed things up. After the code is landed upstream, or is at least showing good benefits and is robust, we could backport it to ldiskfs/kernel_patches for use until we catch up with a newer kernel.

            People

              wc-triage WC Triage
              adilger Andreas Dilger
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated: