Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-9309

Add ldiskfs 64-bit inode number support

Details

    • New Feature
    • Resolution: Unresolved
    • Major
    • None
    • None
    • 9223372036854775807

    Description

      With current hardware clusters faced with the trouble of creating enough inodes on LDISKFS partitions. MDS has 0-size files to store some information about Lustre FS files. Current MDS disk sizes allow to store large amount of such files, but EXT4 limits this number to ~4 billions.
      Lustre FS has features like DNE to distribute MDS over many targets (disks), but disks are used not effectively. It would be great to have ability to store more then ~4 billions inodes on one EXT4 file system.

      This topic ("64-bit inode number") recently was discussed in ext4 list. The resume is:

      There are two possible solutions:
      1. Store higher 32 bit of inode number in ext4 dirent
      2. New feature flag which defines the use a 64-bit inode number

      Andreas Dilger gave strong reasons to use 1st solution:

      The reasons are:

      • this won't use more space for 64-bit inodes than ext4_dir_entry64
      • for 32-bit inode numbers will have smaller dirents
      • significantly more 32-bit dirents can fit into a leaf block (i.e. 10-25%)
      • it is backwards compatible with existing directories and can transparently store 64-bit inode numbers into 32-bit directories without a full update
      • it avoids duplicate code paths for ext4_dir_entry vs ext4_dir_entry64
      • it would be possible to only store high 16 bits (2^48 inodes) since this may be enough for ext4, since ext4_extent can only address 2^48 blocks (2^60 bytes) and there isn't much value to more inodes than blocks?

      This issue is about using dirdata to store high bits of 64bit inode number.

      Attachments

        Issue Links

          Activity

            [LU-9309] Add ldiskfs 64-bit inode number support

            This feature has been raised again in discussions for some large clusters.

            My preference is still that we scale metadata capacity with DNE, but the NVMe devices are starting to become large enough that the 16TB MDT limit (at least for 4B inodes) is becoming a problem.

            In addition to 64-bit inode numbers, it would really be desirable to have a feature to dynamically instantiate the inode tables for some groups, or leave them as block-only, so that there is more flexibility with the bytes/inode ratio. There were a number of discussions about ways to implement dynamic inode tables in the linux-ext

            {2,3,4}

            lists many years ago (e.g. storing them in a file, using the 64-bit inode number to encode a block offset, keeping inode tables uninitialized until absolutely needed, etc.) that could probably be found online.

            adilger Andreas Dilger added a comment - This feature has been raised again in discussions for some large clusters. My preference is still that we scale metadata capacity with DNE, but the NVMe devices are starting to become large enough that the 16TB MDT limit (at least for 4B inodes) is becoming a problem. In addition to 64-bit inode numbers, it would really be desirable to have a feature to dynamically instantiate the inode tables for some groups, or leave them as block-only, so that there is more flexibility with the bytes/inode ratio. There were a number of discussions about ways to implement dynamic inode tables in the linux-ext {2,3,4} lists many years ago (e.g. storing them in a file, using the 64-bit inode number to encode a block offset, keeping inode tables uninitialized until absolutely needed, etc.) that could probably be found online.

            Link to changes improving DNE usage distribution. More work is still needed to get DNE balance as good as OST space balance.

            adilger Andreas Dilger added a comment - Link to changes improving DNE usage distribution. More work is still needed to get DNE balance as good as OST space balance.

            Artem Blagodarenko (artem.blagodarenko@seagate.com) uploaded a new patch: https://review.whamcloud.com/29200
            Subject: LU-9309 quota: quota 64bit inode number cleanup
            Project: tools/e2fsprogs
            Branch: master-lustre
            Current Patch Set: 1
            Commit: f1a786e9c8da9d75760cbf43005f62db31bac3d5

            gerrit Gerrit Updater added a comment - Artem Blagodarenko (artem.blagodarenko@seagate.com) uploaded a new patch: https://review.whamcloud.com/29200 Subject: LU-9309 quota: quota 64bit inode number cleanup Project: tools/e2fsprogs Branch: master-lustre Current Patch Set: 1 Commit: f1a786e9c8da9d75760cbf43005f62db31bac3d5

            Artem Blagodarenko (artem.blagodarenko@seagate.com) uploaded a new patch: https://review.whamcloud.com/29199
            Subject: LU-9309 quota: swaping s_prj_quota_inum superblock field
            Project: tools/e2fsprogs
            Branch: master-lustre
            Current Patch Set: 1
            Commit: 011a538a852f5948bb383b8c82892688f9d78d72

            gerrit Gerrit Updater added a comment - Artem Blagodarenko (artem.blagodarenko@seagate.com) uploaded a new patch: https://review.whamcloud.com/29199 Subject: LU-9309 quota: swaping s_prj_quota_inum superblock field Project: tools/e2fsprogs Branch: master-lustre Current Patch Set: 1 Commit: 011a538a852f5948bb383b8c82892688f9d78d72

            Artem Blagodarenko (artem.blagodarenko@seagate.com) uploaded a new patch: https://review.whamcloud.com/29198
            Subject: LU-9309 ext2fs: add EXT4_FEATURE_INCOMPAT_64INODE suport
            Project: tools/e2fsprogs
            Branch: master-lustre
            Current Patch Set: 1
            Commit: c6f3dd0b051cebf5ca6d5d4ca6af06a323fd8506

            gerrit Gerrit Updater added a comment - Artem Blagodarenko (artem.blagodarenko@seagate.com) uploaded a new patch: https://review.whamcloud.com/29198 Subject: LU-9309 ext2fs: add EXT4_FEATURE_INCOMPAT_64INODE suport Project: tools/e2fsprogs Branch: master-lustre Current Patch Set: 1 Commit: c6f3dd0b051cebf5ca6d5d4ca6af06a323fd8506

            Artem Blagodarenko (artem.blagodarenko@seagate.com) uploaded a new patch: https://review.whamcloud.com/29197
            Subject: LU-9309 badblocks: bad blocks 64bit inode cleanup
            Project: tools/e2fsprogs
            Branch: master-lustre
            Current Patch Set: 1
            Commit: d59ff228a04446e22ee0630ff73c868e0e349b7d

            gerrit Gerrit Updater added a comment - Artem Blagodarenko (artem.blagodarenko@seagate.com) uploaded a new patch: https://review.whamcloud.com/29197 Subject: LU-9309 badblocks: bad blocks 64bit inode cleanup Project: tools/e2fsprogs Branch: master-lustre Current Patch Set: 1 Commit: d59ff228a04446e22ee0630ff73c868e0e349b7d

            Artem Blagodarenko (artem.blagodarenko@seagate.com) uploaded a new patch: https://review.whamcloud.com/29196
            Subject: LU-9309 debugfs: 64bit inode support
            Project: tools/e2fsprogs
            Branch: master-lustre
            Current Patch Set: 1
            Commit: 8b2120300e4d4739afb7e45ad962a645e77430ba

            gerrit Gerrit Updater added a comment - Artem Blagodarenko (artem.blagodarenko@seagate.com) uploaded a new patch: https://review.whamcloud.com/29196 Subject: LU-9309 debugfs: 64bit inode support Project: tools/e2fsprogs Branch: master-lustre Current Patch Set: 1 Commit: 8b2120300e4d4739afb7e45ad962a645e77430ba

            Artem Blagodarenko (artem.blagodarenko@seagate.com) uploaded a new patch: https://review.whamcloud.com/29195
            Subject: LU-9309 ldiskfs: Add 64-bit inode number support
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 4b8dedeba7b8a3b2f24259e3b3442d20e6d5fc69

            gerrit Gerrit Updater added a comment - Artem Blagodarenko (artem.blagodarenko@seagate.com) uploaded a new patch: https://review.whamcloud.com/29195 Subject: LU-9309 ldiskfs: Add 64-bit inode number support Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 4b8dedeba7b8a3b2f24259e3b3442d20e6d5fc69

            Note that I'm not against adding such a feature to ext4/ldiskfs, but it is worthwhile to consider potential issues as well, compared to distributing the filesystem metadata across multiple MDTs with DNE:

            • if there is a problem with such a large MDT then there will only be a single-threaded e2fsck running to repair the MDT filesystem, which could take many hours/days to repair, vs. running e2fsck on multiple MDTs in parallel
            • e2fsck on such a large filesystem will require a large amount of RAM to manage the recovery state
            • if the LMA xattr holding the Lustre FID is lost, there is no easy fallback to IGIF FIDs with 64-bit inode numbers
            • having a single large MDT does not allow scaling performance (network, CPU, RAM) as cost-efficiently as multiple smaller MDTs

            I agree that the current DNE implementation does not scale metadata load automatically across MDTs/MDS nodes effectively, though this will be improved with DNE2 and striped directories. My thought for enabling DNE to be more "automatic" in its load balancing is to allow automatic directory restriping when a directory grows larger than some number of entries (e.g. 16k), so that users can have the benefit of DNE without having to manually create striped directories.

            If you choose to move forward with MDTs with more than 4B inodes, I'd also encourage you to look at making e2fsck multi-threaded and/or event driven so that it can use multiple CPUs and spindles/SSDs effectively, otherwise the check time may become so long that this is not a practical solution even if the on-disk format supports more than 4B inodes.

            adilger Andreas Dilger added a comment - Note that I'm not against adding such a feature to ext4/ldiskfs, but it is worthwhile to consider potential issues as well, compared to distributing the filesystem metadata across multiple MDTs with DNE: if there is a problem with such a large MDT then there will only be a single-threaded e2fsck running to repair the MDT filesystem, which could take many hours/days to repair, vs. running e2fsck on multiple MDTs in parallel e2fsck on such a large filesystem will require a large amount of RAM to manage the recovery state if the LMA xattr holding the Lustre FID is lost, there is no easy fallback to IGIF FIDs with 64-bit inode numbers having a single large MDT does not allow scaling performance (network, CPU, RAM) as cost-efficiently as multiple smaller MDTs I agree that the current DNE implementation does not scale metadata load automatically across MDTs/MDS nodes effectively, though this will be improved with DNE2 and striped directories. My thought for enabling DNE to be more "automatic" in its load balancing is to allow automatic directory restriping when a directory grows larger than some number of entries (e.g. 16k), so that users can have the benefit of DNE without having to manually create striped directories. If you choose to move forward with MDTs with more than 4B inodes, I'd also encourage you to look at making e2fsck multi-threaded and/or event driven so that it can use multiple CPUs and spindles/SSDs effectively, otherwise the check time may become so long that this is not a practical solution even if the on-disk format supports more than 4B inodes.

            People

              wc-triage WC Triage
              artem_blagodarenko Artem Blagodarenko (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated: