Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-14345

e2fsck of very large directories is broken

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.14.0
    • Lustre 2.13.0, Lustre 2.14.0
    • 3
    • 9223372036854775807

    Description

      In patch http://review.whamcloud.com/22008 "LU-1365 e2fsprogs: enable large directory support in tools" support was added to e2fsprogs for directories with 3-level htree and larger than 2GB in size. The large_dir feature was enabled by default for all new ldiskfs filesystems in patch https://review.whamcloud.com/36555 "LU-11546 utils: enable large_dir for ldiskfs" (commit v2_13_50-13-gcd1faa0124).

      While working on the e2fsck code for an unrelated issue, I saw that there is still some code in e2fsck that limits the size of a directory to be less than 2^32 bytes in size. Currently, e2fsck will consider a directory file with a large size to be broken and clear the high bytes of the size if there is any problem with the directory size:

              if (pb.is_dir) {
                      :
                      if (err || sz != inode->i_size) {
                              bad_size = 7;
                              pctx->num = sz;
                      } else if (inode->i_size & (fs->blocksize - 1))
                              bad_size = 5;
                      else if (nblock > (pb.last_block + 1))
                              bad_size = 1;
                      :
                      :
                      if (fix_problem(ctx, PR_1_BAD_I_SIZE, pctx)) {
                              if (LINUX_S_ISDIR(inode->i_mode))
                                      pctx->num &= 0xFFFFFFFFULL;
                              ext2fs_inode_size_set(fs, inode, pctx->num);
      

      and in libext2fs this will also report an error and truncate the size if it is set for directories:

      errcode_t ext2fs_inode_size_set(ext2_filsys fs, struct ext2_inode *inode,
                                      ext2_off64_t size)
      {
              /* Only regular files get to be larger than 4GB */
              if (!LINUX_S_ISREG(inode->i_mode) && (size >> 32))
                      return EXT2_ET_FILE_TOO_BIG;
      

      A 4GB directory can have about 80M 32-byte filename entries (11M 256-byte entries) in it before the 32-bit size is exceeded. While we previously reported a limit of approximately 10M entries in a single directory in the past, it is not completely unlikely that we may see large-OST systems with directories over 4GB in size in the near future.

      It is not totally clear, but it may be that in addition to changing the above code to allow large directories, e2fsck may also need to track large directories (a new ctx->large_dirs counter) and set the LARGEDIR feature flag in the superblock, like it does for LARGE_FILE (with the existing ctx->large_files counter).

      On the one hand, unlike the LARGE_FILE feature (which is set by the kernel at runtime), the LARGEDIR feature should always be set before a large directory is allowed, so it should also be set in all of the superblock backups. This would prevent garbage/corrupt directory inodes from getting a "large size" when in fact they are just sparse files. On the other hand, we don't want to truncate real large directories because a flag is missing in the superblock for some reason. It may be that there is enough logic in the directory leaf block handling that a large size should never be set unless the directory legitimately has enough blocks for that size, but I haven't looked into the details of this yet.

      Attachments

        Issue Links

          Activity

            [LU-14345] e2fsck of very large directories is broken

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/41433/
            Subject: LU-14345 misc: update e2fsprogs to 1.45.6.wc5
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 0bf9c2c34805b017a39c473650496f059aeaef73

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/41433/ Subject: LU-14345 misc: update e2fsprogs to 1.45.6.wc5 Project: fs/lustre-release Branch: master Current Patch Set: Commit: 0bf9c2c34805b017a39c473650496f059aeaef73

            Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/41433
            Subject: LU-14345 misc: update e2fsprogs to 1.45.6.wc4
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 5093c63d2eb8ce8fc074cd49a91092643f79f777

            gerrit Gerrit Updater added a comment - Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/41433 Subject: LU-14345 misc: update e2fsprogs to 1.45.6.wc4 Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 5093c63d2eb8ce8fc074cd49a91092643f79f777

            Andreas Dilger (adilger@whamcloud.com) merged in patch https://review.whamcloud.com/41385/
            Subject: LU-14345 e2fsck: fix check of directories over 4GB
            Project: tools/e2fsprogs
            Branch: master-lustre
            Current Patch Set:
            Commit: 71b74579b7e6c5cfe6af05b00e45c8a75fce0d61

            gerrit Gerrit Updater added a comment - Andreas Dilger (adilger@whamcloud.com) merged in patch https://review.whamcloud.com/41385/ Subject: LU-14345 e2fsck: fix check of directories over 4GB Project: tools/e2fsprogs Branch: master-lustre Current Patch Set: Commit: 71b74579b7e6c5cfe6af05b00e45c8a75fce0d61

            Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/41385
            Subject: LU-14345 e2fsck: fix check of directories over 4GB
            Project: tools/e2fsprogs
            Branch: master-lustre
            Current Patch Set: 1
            Commit: 83d23e3593548735d94934d3d85d16103a86ab0e

            gerrit Gerrit Updater added a comment - Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/41385 Subject: LU-14345 e2fsck: fix check of directories over 4GB Project: tools/e2fsprogs Branch: master-lustre Current Patch Set: 1 Commit: 83d23e3593548735d94934d3d85d16103a86ab0e
            gerrit Gerrit Updater added a comment - - edited

            Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/41383
            Subject: LU-14345 e2fsck: fix check of directories over 4GB
            Project: tools/e2fsprogs
            Branch: master-lustre
            Current Patch Set: 1
            Commit: 4d38f15676c2f26e0feba4fa9b9f52a474aa4e72

            gerrit Gerrit Updater added a comment - - edited Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/41383 Subject: LU-14345 e2fsck: fix check of directories over 4GB Project: tools/e2fsprogs Branch: master-lustre Current Patch Set: 1 Commit: 4d38f15676c2f26e0feba4fa9b9f52a474aa4e72

            If we got LU-11912 implemented, we would never see such large directories on an OST. It looks like 32 dirs x 80M objects/dir would need over 2.5B objects from the same MDT at one time, which is pretty unlikely even for a huge OST, just because there is still a 4B inode limit on each MDT, but it is possible.

            It seems more likely that this would be hit with a single 80M-entry directory (11-20M entries with very long filenames) on the MDT. Due to LU-12406 we are currently not testing large_dir support on master.

            The existing conf_sanity test_111 did not trigger the problem reported here, likely because it limits itself to a 2GB directory size instead of exceeding 4GB, and because e2fsck doesn't find any errors with the large directory inode that would trigger this problem. It may be possible to trigger it with a > 4GB directory by adding "-D" to the run_e2fsck options at the end.

            adilger Andreas Dilger added a comment - If we got LU-11912 implemented, we would never see such large directories on an OST. It looks like 32 dirs x 80M objects/dir would need over 2.5B objects from the same MDT at one time, which is pretty unlikely even for a huge OST, just because there is still a 4B inode limit on each MDT, but it is possible. It seems more likely that this would be hit with a single 80M-entry directory (11-20M entries with very long filenames) on the MDT. Due to LU-12406 we are currently not testing large_dir support on master. The existing conf_sanity test_111 did not trigger the problem reported here, likely because it limits itself to a 2GB directory size instead of exceeding 4GB, and because e2fsck doesn't find any errors with the large directory inode that would trigger this problem. It may be possible to trigger it with a > 4GB directory by adding " -D " to the run_e2fsck options at the end.

            Artem, would you have time to look into this? It would be good to get a fix into e2fsprogs before the 2.14 release is made that enables large_dir by default for all MDT/OST filesystems.

            I think currently that e2fsck will mostly "ignore" a large directory until it has some error in pass1 that sets bad_size, then it would truncate the size to a 32-bit value. If ext2fs_inode_size_set() is called somewhere else, like e2fsck_rehash_dir()->write_directory() called with "e2fsck -fD", then it looks like ext2fs_inode_size_set() will return an error when trying to set the size, but that may cause other problems.

            I think the minimum fix is to check for ext2fs_has_feature_largedir() in ext2fs_inode_size_set() and allow a 64-bit size in that case, and similarly clean up the above code in pass1 that is truncating the size for LINUX_S_ISDIR() inodes.

            adilger Andreas Dilger added a comment - Artem, would you have time to look into this? It would be good to get a fix into e2fsprogs before the 2.14 release is made that enables large_dir by default for all MDT/OST filesystems. I think currently that e2fsck will mostly "ignore" a large directory until it has some error in pass1 that sets bad_size , then it would truncate the size to a 32-bit value. If ext2fs_inode_size_set() is called somewhere else, like e2fsck_rehash_dir()->write_directory() called with " e2fsck -fD ", then it looks like ext2fs_inode_size_set() will return an error when trying to set the size, but that may cause other problems. I think the minimum fix is to check for ext2fs_has_feature_largedir() in ext2fs_inode_size_set() and allow a 64-bit size in that case, and similarly clean up the above code in pass1 that is truncating the size for LINUX_S_ISDIR() inodes.

            People

              adilger Andreas Dilger
              adilger Andreas Dilger
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: