[LU-12051] ldiskfs directory shrink Created: 07/Mar/19  Updated: 23/Oct/23

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.0
Fix Version/s: None

Type: Improvement Priority: Minor
Reporter: Andreas Dilger Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: ldiskfs

Issue Links:
Related
is related to LU-11912 reduce number of OST objects created ... Resolved
is related to LU-8465 parallel e2fsck performance at scale Resolved
Rank (Obsolete): 9223372036854775807

 Description   

Shrinking directories in ldiskfs would be desirable for cases where a directory had a large number of files created, but the files are deleted and the directory is empty and could be deallocated.

There is a patch submitted to upstream ext4 that is the start of the support for this functionality, but it is not very aggressive about removing directory blocks: https://patchwork.ozlabs.org/patch/1048658/

There is intended to be additional work in this area to improve the directory shrinking functionality.

In addition to the directory shrinking, removal of old OST object directory trees (O/*/d*) is also useful, and could potentially be a substitute for having online directory shrink once the directories are completely empty.



 Comments   
Comment by Andreas Dilger [ 20/Mar/19 ]

I suspect that there isn't a lot of work we need to do in this area, but some review and testing of the upstream patch linked in the description (with feedback directly to linux-ext4@kernel.vger.org and the author) would probably speed things up. After the code is landed upstream, or is at least showing good benefits and is robust, we could backport it to ldiskfs/kernel_patches for use until we catch up with a newer kernel.

Comment by Andreas Dilger [ 08/Apr/20 ]

The upstream ext4 directory shrink patches have been refreshed:

The most complexity will be around integration of the "shrink directories on dentry delete" patches with the ext4-pdirop.patch patch, especially related to locking order as levels of the htree are removed. We will also need to disable the htree dx_root removal in make_unindexed() in the same way we do for ext4_update_dx_flag() because this would break htree locking and is of marginal benefit. At the point where all objects in a {SEQ}/d*/ directory tree have been removed on an OST, we can just delete the whole sequence directory tree rather than worry about the few remaining blocks for dx_root.

These will mostly only shrink the directory when it is almost completely empty, but for LU-11912 this would still help reduce space usage as old objects are removed. There still needs to be a patch that merges adjacent htree blocks when they are nearly empty. My proposal for a possible implementation for htree leaf block merging was in this linux-ext4 thread on an earlier version of the patch:

On Mar 25, 2020, at 3:37 AM, Harshad Shirwadkar <harshadshirwadkar@gmail.com> wrote:
> But note that most of the shrinking happens during last 1-2% deletions
> in an average case. Therefore, the next step here is to merge dx nodes
> when possible. That can be achieved by storing the fullness index in
> htree nodes. But that's an on-disk format change. We can instead build
> on tooling added by this patch to perform reverse lookup on a dx
> node and then reading adjacent nodes to check their fullness.

As for storing the fullness on disk changing the on-disk format... That is
true, but the original htree implementation anticipated this and reserved
space in the htree index to store the fullness, so it would not break the
ability of older kernels to access directories with the fullness information.

I think if you used just a few bits (maybe just 2) to store:
0 = unset (every directory today)
1 = under 20% full
2 = under 40% full
3 = under 60% full

or similar. It doesn't matter if they are more full since they won't be
candidates for merging, and then lazily update the htree index fullness
as entries are removed, this will simplify the shrinking process, and will
avoid the need to repeatedly scan the leaf blocks to see if they are empty
enough for merging. It wouldn't be any worse not to store these values
on disk after the first time a "0 = unset" entry was found and not merged,
or setting the fullness on the merged block if it is merged, and running
"e2fsck -D" can easily update the fullness values.

The benefit of using 20%, 40%, and 60% as the fullness markers is that it
is possible to either merge adjacent 60% and 40% blocks or alternately a
60% and two adjacent 20% blocks. Also, since these values are very coarse
they would not need to be updated frequently. If the values are slightly
outdated, then it is again not worse than the "always scan" model (one scan
and the fullness would be updated), but more efficient than repeat scanning.

Using only two bits for fullness also leaves two bits free for future use.

Generated at Sat Feb 10 02:49:15 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.