Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-9309

Add ldiskfs 64-bit inode number support

Details

    • New Feature
    • Resolution: Unresolved
    • Major
    • None
    • None
    • 9223372036854775807

    Description

      With current hardware clusters faced with the trouble of creating enough inodes on LDISKFS partitions. MDS has 0-size files to store some information about Lustre FS files. Current MDS disk sizes allow to store large amount of such files, but EXT4 limits this number to ~4 billions.
      Lustre FS has features like DNE to distribute MDS over many targets (disks), but disks are used not effectively. It would be great to have ability to store more then ~4 billions inodes on one EXT4 file system.

      This topic ("64-bit inode number") recently was discussed in ext4 list. The resume is:

      There are two possible solutions:
      1. Store higher 32 bit of inode number in ext4 dirent
      2. New feature flag which defines the use a 64-bit inode number

      Andreas Dilger gave strong reasons to use 1st solution:

      The reasons are:

      • this won't use more space for 64-bit inodes than ext4_dir_entry64
      • for 32-bit inode numbers will have smaller dirents
      • significantly more 32-bit dirents can fit into a leaf block (i.e. 10-25%)
      • it is backwards compatible with existing directories and can transparently store 64-bit inode numbers into 32-bit directories without a full update
      • it avoids duplicate code paths for ext4_dir_entry vs ext4_dir_entry64
      • it would be possible to only store high 16 bits (2^48 inodes) since this may be enough for ext4, since ext4_extent can only address 2^48 blocks (2^60 bytes) and there isn't much value to more inodes than blocks?

      This issue is about using dirdata to store high bits of 64bit inode number.

      Attachments

        Issue Links

          Activity

            [LU-9309] Add ldiskfs 64-bit inode number support

            Artem Blagodarenko (artem.blagodarenko@seagate.com) uploaded a new patch: https://review.whamcloud.com/29200
            Subject: LU-9309 quota: quota 64bit inode number cleanup
            Project: tools/e2fsprogs
            Branch: master-lustre
            Current Patch Set: 1
            Commit: f1a786e9c8da9d75760cbf43005f62db31bac3d5

            gerrit Gerrit Updater added a comment - Artem Blagodarenko (artem.blagodarenko@seagate.com) uploaded a new patch: https://review.whamcloud.com/29200 Subject: LU-9309 quota: quota 64bit inode number cleanup Project: tools/e2fsprogs Branch: master-lustre Current Patch Set: 1 Commit: f1a786e9c8da9d75760cbf43005f62db31bac3d5

            Artem Blagodarenko (artem.blagodarenko@seagate.com) uploaded a new patch: https://review.whamcloud.com/29199
            Subject: LU-9309 quota: swaping s_prj_quota_inum superblock field
            Project: tools/e2fsprogs
            Branch: master-lustre
            Current Patch Set: 1
            Commit: 011a538a852f5948bb383b8c82892688f9d78d72

            gerrit Gerrit Updater added a comment - Artem Blagodarenko (artem.blagodarenko@seagate.com) uploaded a new patch: https://review.whamcloud.com/29199 Subject: LU-9309 quota: swaping s_prj_quota_inum superblock field Project: tools/e2fsprogs Branch: master-lustre Current Patch Set: 1 Commit: 011a538a852f5948bb383b8c82892688f9d78d72

            Artem Blagodarenko (artem.blagodarenko@seagate.com) uploaded a new patch: https://review.whamcloud.com/29198
            Subject: LU-9309 ext2fs: add EXT4_FEATURE_INCOMPAT_64INODE suport
            Project: tools/e2fsprogs
            Branch: master-lustre
            Current Patch Set: 1
            Commit: c6f3dd0b051cebf5ca6d5d4ca6af06a323fd8506

            gerrit Gerrit Updater added a comment - Artem Blagodarenko (artem.blagodarenko@seagate.com) uploaded a new patch: https://review.whamcloud.com/29198 Subject: LU-9309 ext2fs: add EXT4_FEATURE_INCOMPAT_64INODE suport Project: tools/e2fsprogs Branch: master-lustre Current Patch Set: 1 Commit: c6f3dd0b051cebf5ca6d5d4ca6af06a323fd8506

            Artem Blagodarenko (artem.blagodarenko@seagate.com) uploaded a new patch: https://review.whamcloud.com/29197
            Subject: LU-9309 badblocks: bad blocks 64bit inode cleanup
            Project: tools/e2fsprogs
            Branch: master-lustre
            Current Patch Set: 1
            Commit: d59ff228a04446e22ee0630ff73c868e0e349b7d

            gerrit Gerrit Updater added a comment - Artem Blagodarenko (artem.blagodarenko@seagate.com) uploaded a new patch: https://review.whamcloud.com/29197 Subject: LU-9309 badblocks: bad blocks 64bit inode cleanup Project: tools/e2fsprogs Branch: master-lustre Current Patch Set: 1 Commit: d59ff228a04446e22ee0630ff73c868e0e349b7d

            Artem Blagodarenko (artem.blagodarenko@seagate.com) uploaded a new patch: https://review.whamcloud.com/29196
            Subject: LU-9309 debugfs: 64bit inode support
            Project: tools/e2fsprogs
            Branch: master-lustre
            Current Patch Set: 1
            Commit: 8b2120300e4d4739afb7e45ad962a645e77430ba

            gerrit Gerrit Updater added a comment - Artem Blagodarenko (artem.blagodarenko@seagate.com) uploaded a new patch: https://review.whamcloud.com/29196 Subject: LU-9309 debugfs: 64bit inode support Project: tools/e2fsprogs Branch: master-lustre Current Patch Set: 1 Commit: 8b2120300e4d4739afb7e45ad962a645e77430ba

            Artem Blagodarenko (artem.blagodarenko@seagate.com) uploaded a new patch: https://review.whamcloud.com/29195
            Subject: LU-9309 ldiskfs: Add 64-bit inode number support
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 4b8dedeba7b8a3b2f24259e3b3442d20e6d5fc69

            gerrit Gerrit Updater added a comment - Artem Blagodarenko (artem.blagodarenko@seagate.com) uploaded a new patch: https://review.whamcloud.com/29195 Subject: LU-9309 ldiskfs: Add 64-bit inode number support Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 4b8dedeba7b8a3b2f24259e3b3442d20e6d5fc69

            Note that I'm not against adding such a feature to ext4/ldiskfs, but it is worthwhile to consider potential issues as well, compared to distributing the filesystem metadata across multiple MDTs with DNE:

            • if there is a problem with such a large MDT then there will only be a single-threaded e2fsck running to repair the MDT filesystem, which could take many hours/days to repair, vs. running e2fsck on multiple MDTs in parallel
            • e2fsck on such a large filesystem will require a large amount of RAM to manage the recovery state
            • if the LMA xattr holding the Lustre FID is lost, there is no easy fallback to IGIF FIDs with 64-bit inode numbers
            • having a single large MDT does not allow scaling performance (network, CPU, RAM) as cost-efficiently as multiple smaller MDTs

            I agree that the current DNE implementation does not scale metadata load automatically across MDTs/MDS nodes effectively, though this will be improved with DNE2 and striped directories. My thought for enabling DNE to be more "automatic" in its load balancing is to allow automatic directory restriping when a directory grows larger than some number of entries (e.g. 16k), so that users can have the benefit of DNE without having to manually create striped directories.

            If you choose to move forward with MDTs with more than 4B inodes, I'd also encourage you to look at making e2fsck multi-threaded and/or event driven so that it can use multiple CPUs and spindles/SSDs effectively, otherwise the check time may become so long that this is not a practical solution even if the on-disk format supports more than 4B inodes.

            adilger Andreas Dilger added a comment - Note that I'm not against adding such a feature to ext4/ldiskfs, but it is worthwhile to consider potential issues as well, compared to distributing the filesystem metadata across multiple MDTs with DNE: if there is a problem with such a large MDT then there will only be a single-threaded e2fsck running to repair the MDT filesystem, which could take many hours/days to repair, vs. running e2fsck on multiple MDTs in parallel e2fsck on such a large filesystem will require a large amount of RAM to manage the recovery state if the LMA xattr holding the Lustre FID is lost, there is no easy fallback to IGIF FIDs with 64-bit inode numbers having a single large MDT does not allow scaling performance (network, CPU, RAM) as cost-efficiently as multiple smaller MDTs I agree that the current DNE implementation does not scale metadata load automatically across MDTs/MDS nodes effectively, though this will be improved with DNE2 and striped directories. My thought for enabling DNE to be more "automatic" in its load balancing is to allow automatic directory restriping when a directory grows larger than some number of entries (e.g. 16k), so that users can have the benefit of DNE without having to manually create striped directories. If you choose to move forward with MDTs with more than 4B inodes, I'd also encourage you to look at making e2fsck multi-threaded and/or event driven so that it can use multiple CPUs and spindles/SSDs effectively, otherwise the check time may become so long that this is not a practical solution even if the on-disk format supports more than 4B inodes.

            People

              wc-triage WC Triage
              artem_blagodarenko Artem Blagodarenko (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated: