Description
It would be useful to allow ldiskfs to be able to dynamically adjust the number of inodes in the filesystem. Formatting with too many inodes can consume a lot of space that is not usable for data, while formatting with too few inodes can prevent the free space to be used if the average file size is small.
For example, on the MDT the default is 1KB inodes and 2.5KB inode ratio, so 40% of the entire filesystem is consumed with inodes. That is fine for traditional usage where the MDT only stored inodes+layouts, xattrs, directories, logs, but with DoM the average ~1KB of space available per inode is very restrictive. Either the average DoM file size is only 1KB, or only 1/64 files uses DoM to allow a minimum 64KB of data per file.
Conversely, on the OST the average inode ratio is 1MB, and with some newer workloads the average file size can be smaller than this, resulting in OSTs running out of inodes before all the space is used.
Two approaches for allowing dynamic inode allocation in the filesystem:
- adding new inode tables in addition to those created for the initial filesystem that can be allocated from arbitrary data blocks
- reserving inode table blocks (probably a whole group at a time) to be allocated as file data if the filesystem becomes full and they are not used
The first approach is possibly more flexible, in that it could be applied to existing filesystems (mostly OSTs) where there are not enough inodes available. This would favor formatting filesystems with fewer inodes, and adding them later. The drawback is that inode tables might be located anywhere in the filesystem, which makes it difficult for e2fsck to find them if there is corruption, and it may be challenging to find enough contiguous free blocks for a whole inode table, but that seems unlikely in the case of "too much free space" that drives the creation of more inodes to consume it, and on very large OSTs that consider 8MB of contiguous free space a rounding error.
The second approach is would work best for MDTs where groups of inode table blocks were kept unused from the time of formatting (or if the filesystem is extended for space but not inodes) and can be moved a whole group at a time (via a flag in the group descriptor) to be allocated by regular files (about 32MB per group == 32k inodes). This may be less useful for older MDTs, since it is likely a few inodes are allocated in each group for existing filesystems. It may be possible to allocate part of an inode table for regular data (possibly leveraging the uninit_bg feature), but this would increase complexity and risk of file/itable corruption if e2fsck doesn't get the split correct, so is best avoided. That said, even my 13-year-old MDT with 49% of inodes free has 10% of groups totally unused, and 25% of groups are less than 0.5% allocated so it may not be unreasonable to use this approach on existing filesystems.
As yet there are no plans to implement this feature, just some thoughts for tracking.