Details
-
New Feature
-
Resolution: Unresolved
-
Major
-
None
-
None
-
9223372036854775807
Description
With current hardware clusters faced with the trouble of creating enough inodes on LDISKFS partitions. MDS has 0-size files to store some information about Lustre FS files. Current MDS disk sizes allow to store large amount of such files, but EXT4 limits this number to ~4 billions.
Lustre FS has features like DNE to distribute MDS over many targets (disks), but disks are used not effectively. It would be great to have ability to store more then ~4 billions inodes on one EXT4 file system.
This topic ("64-bit inode number") recently was discussed in ext4 list. The resume is:
There are two possible solutions:
1. Store higher 32 bit of inode number in ext4 dirent
2. New feature flag which defines the use a 64-bit inode number
Andreas Dilger gave strong reasons to use 1st solution:
The reasons are:
- this won't use more space for 64-bit inodes than ext4_dir_entry64
- for 32-bit inode numbers will have smaller dirents
- significantly more 32-bit dirents can fit into a leaf block (i.e. 10-25%)
- it is backwards compatible with existing directories and can transparently store 64-bit inode numbers into 32-bit directories without a full update
- it avoids duplicate code paths for ext4_dir_entry vs ext4_dir_entry64
- it would be possible to only store high 16 bits (2^48 inodes) since this may be enough for ext4, since ext4_extent can only address 2^48 blocks (2^60 bytes) and there isn't much value to more inodes than blocks?
This issue is about using dirdata to store high bits of 64bit inode number.
This feature has been raised again in discussions for some large clusters.
My preference is still that we scale metadata capacity with DNE, but the NVMe devices are starting to become large enough that the 16TB MDT limit (at least for 4B inodes) is becoming a problem.
In addition to 64-bit inode numbers, it would really be desirable to have a feature to dynamically instantiate the inode tables for some groups, or leave them as block-only, so that there is more flexibility with the bytes/inode ratio. There were a number of discussions about ways to implement dynamic inode tables in the linux-ext
{2,3,4}lists many years ago (e.g. storing them in a file, using the 64-bit inode number to encode a block offset, keeping inode tables uninitialized until absolutely needed, etc.) that could probably be found online.