Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6220

push ext4/ldiskfs patches upstream if possible

Details

    • Improvement
    • Resolution: Unresolved
    • Minor
    • Upstream
    • Lustre 2.8.0
    • None
    • 17396

    Description

      There are a number of ldiskfs patches that we could potentially push upstream, possibly with some cleanups. All patches should be run through checkpatch.pl first.

      • ext4-print-inum-in-htree-warning.patch: could be pushed upstream after moving string to the next line so it fits inside 80 columns
      • ext4-disable-mb-cache.patch: I was involved in some discussion upstream about removing the mbcache (http://lwn.net/Articles/564802/), but that patch was never accepted upstream. We could try pushing this patch again, since it is fairly simple, and it wouldn't be very incompatible with the other mbcache speedup patch if that was ever landed.
      • ext4-inode-version.patch: It might be possible to submit a patch upstream that moves dir->i_version++ into ext4_mark_iloc_dirty() like if (IS_I_VERSION(inode) || S_ISDIR(dir)). Then, we can add a mount flag or patch that makes this check a no-op so we only need a very small patch. More importantly, it will reduce the size of ext4_inode_info because we don't need a separate i_fs_version and reduce the memory usage on the MDS.
      • ext4-journal-path-opt.patch: I think this can be pushed upstream without any significant changes
      • ext4-kill-dx-root.patch: could be submitted upstream as a cleanup with some style fixes
      • ext4-max-dir-size.patch: we should print out a deprecation warning in newer kernels to mount with "-o max_dir_size_kb" instead, and then we can eventually remove it? The mount option has existed since kernel 3.6, maybe it would be better to backport support for the mount option to RHEL6 instead of maintaining this patch forever?
      • ext4-large-eas.patch: this is described in LU-908

      Attachments

        Issue Links

          Activity

            [LU-6220] push ext4/ldiskfs patches upstream if possible

            The data-in-dirent patch was being discussed for upstream inclusion for 64-but inode support, but eventually that was not accepted because the 64-bit inode patch was only partially tested (ie. it built and didn't break 32-bit inodes, but had code and testing gaps for actual 64-bit inode numbers).

            Landing the dirdata patch has now become more complex because the "fscrypt+case-insensitive" features are now storing extra information after the name in the dirent. This would have been a perfect case to use dirdata and get it included, except Google had already started using this in Android and didn't want to update the format, so there needs to be additional compatibility added for this case.

            An agreed-upon proposal to get dirdata accepted without the Lustre kernel dependency is to add a dedicated test ioctl interface to store dirdata in directory entries, and confirm that it is being handled properly (eg. across rename, with e2fsck, etc). For compatibility with Lustre, this could store the "inode|generation" into the Lustre FID slot, since this is equivalent to an IGIF FID, and is also easy to verify later.

            adilger Andreas Dilger added a comment - The data-in-dirent patch was being discussed for upstream inclusion for 64-but inode support, but eventually that was not accepted because the 64-bit inode patch was only partially tested (ie. it built and didn't break 32-bit inodes, but had code and testing gaps for actual 64-bit inode numbers). Landing the dirdata patch has now become more complex because the "fscrypt+case-insensitive" features are now storing extra information after the name in the dirent. This would have been a perfect case to use dirdata and get it included, except Google had already started using this in Android and didn't want to update the format, so there needs to be additional compatibility added for this case. An agreed-upon proposal to get dirdata accepted without the Lustre kernel dependency is to add a dedicated test ioctl interface to store dirdata in directory entries, and confirm that it is being handled properly (eg. across rename, with e2fsck, etc). For compatibility with Lustre, this could store the " inode|generation " into the Lustre FID slot, since this is equivalent to an IGIF FID, and is also easy to verify later.

            Track the parallel VFS + ext4 lookup for upstream: https://patchwork.kernel.org/patch/10695037/

            simmonsja James A Simmons added a comment - Track the parallel VFS + ext4 lookup for upstream: https://patchwork.kernel.org/patch/10695037/

            Patch https://review.whamcloud.com/38372 "LU-13054 ldiskfs: split htree_lock as separate patch" split the ect4-pdirops patch into a separate ext4-htree-lock patch, for a net reduction of 6500 lines of patches.

            adilger Andreas Dilger added a comment - Patch https://review.whamcloud.com/38372 " LU-13054 ldiskfs: split htree_lock as separate patch " split the ect4-pdirops patch into a separate ext4-htree-lock patch, for a net reduction of 6500 lines of patches.
            ys Yang Sheng added a comment - - edited

            Move from https://review.whamcloud.com/38372 as record

            Now you have moved htree locking in a separate patch. I don't think it will be possible to get this merged until there is parallel VFS directory locking in the kernel, but at least it cleans up our own patches.

            I think it may be possible to get the ext4-kill-dx-root.patch merged upstream as a code cleanup, which would avoid the need to carry that patch in the Lustre tree forever.

            It would be possible to get ext4-data-in-dirent.patch upstream, if we added an ioctl interface for being able to get/set the FID for testing, and similar patches for debugfs to get/set the FID for testing.

            ys Yang Sheng added a comment - - edited Move from https://review.whamcloud.com/38372 as record Now you have moved htree locking in a separate patch. I don't think it will be possible to get this merged until there is parallel VFS directory locking in the kernel, but at least it cleans up our own patches. I think it may be possible to get the ext4-kill-dx-root.patch merged upstream as a code cleanup, which would avoid the need to carry that patch in the Lustre tree forever. It would be possible to get ext4-data-in-dirent.patch upstream, if we added an ioctl interface for being able to get/set the FID for testing, and similar patches for debugfs to get/set the FID for testing.

            The current ldiskfs-4.4-sles12sp3 patch series looks like the following (patches marked '*' are/will be merged soon):

            sles12sp2/ext4-inode-version.patch
            sles12sp2/ext4-lookup-dotdot.patch
            sles12sp2/ext4-print-inum-in-htree-warning.patch
            sles12sp2/ext4-prealloc.patch
            sles12sp2/ext4-osd-iop-common.patch
            sles12sp2/ext4-misc.patch
            sles12sp3/ext4-mballoc-extra-checks.patch
            sles12sp2/ext4-hash-indexed-dir-dotdot-update.patch
            sles12sp2/ext4-kill-dx-root.patch *
            rhel7/ext4-mballoc-pa-free-mismatch.patch
            sles12sp2/ext4-data-in-dirent.patch *
            sles12sp2/ext4-large-eas.patch *
            sles12sp2/ext4-disable-mb-cache.patch *
            rhel7/ext4-nocmtime.patch
            sles12sp2/ext4-large-dir.patch *
            sles12sp2/ext4-pdirop.patch
            sles12sp2/ext4-max-dir-size.patch
            rhel7/ext4-remove-truncate-warning.patch
            sles12sp3/ext4-corrupted-inode-block-bitmaps-handling-patches.patch * (just under discussion)
            sles12sp2/ext4-give-warning-with-dir-htree-growing.patch
            sles12sp2/ext4-mmp-brelse.patch *
            rhel7/ext4-jcb-optimization.patch
            sles12sp2/ext4-attach-jinode-in-writepages.patch
            sles12sp2/ext4-dont-check-before-replay.patch
            rhel7.2/ext4-dont-check-in-ro.patch
            sles12sp2/ext4-fix-xattr-shifting-when-expanding-inodes.patch *
            rhel7/ext4-use-GFP_NOFS-in-ext4_inode_attach_jinode.patch
            

            There may be slightly different patches in the RHEL7 kernel series, but it is against a significantly older kernel so I didn't want to use that as a starting point. This shows there are definitely still a lot of patches to be merged and/or worked around in some other way. At least several of the major on-disk features have been landed, and most of these are only related to in-core changes (API, functionality), so the risk of incompatible on-disk formats is significantly reduced. I'm hoping that e2fsprogs-1.45 might even be able to run e2fsck against an OST or MDT, though I don't think it would be advised without more testing.

            adilger Andreas Dilger added a comment - The current ldiskfs-4.4-sles12sp3 patch series looks like the following (patches marked '*' are/will be merged soon): sles12sp2/ext4-inode-version.patch sles12sp2/ext4-lookup-dotdot.patch sles12sp2/ext4-print-inum-in-htree-warning.patch sles12sp2/ext4-prealloc.patch sles12sp2/ext4-osd-iop-common.patch sles12sp2/ext4-misc.patch sles12sp3/ext4-mballoc-extra-checks.patch sles12sp2/ext4-hash-indexed-dir-dotdot-update.patch sles12sp2/ext4-kill-dx-root.patch * rhel7/ext4-mballoc-pa-free-mismatch.patch sles12sp2/ext4-data-in-dirent.patch * sles12sp2/ext4-large-eas.patch * sles12sp2/ext4-disable-mb-cache.patch * rhel7/ext4-nocmtime.patch sles12sp2/ext4-large-dir.patch * sles12sp2/ext4-pdirop.patch sles12sp2/ext4-max-dir-size.patch rhel7/ext4-remove-truncate-warning.patch sles12sp3/ext4-corrupted-inode-block-bitmaps-handling-patches.patch * (just under discussion) sles12sp2/ext4-give-warning-with-dir-htree-growing.patch sles12sp2/ext4-mmp-brelse.patch * rhel7/ext4-jcb-optimization.patch sles12sp2/ext4-attach-jinode-in-writepages.patch sles12sp2/ext4-dont-check-before-replay.patch rhel7.2/ext4-dont-check-in-ro.patch sles12sp2/ext4-fix-xattr-shifting-when-expanding-inodes.patch * rhel7/ext4-use-GFP_NOFS-in-ext4_inode_attach_jinode.patch There may be slightly different patches in the RHEL7 kernel series, but it is against a significantly older kernel so I didn't want to use that as a starting point. This shows there are definitely still a lot of patches to be merged and/or worked around in some other way. At least several of the major on-disk features have been landed, and most of these are only related to in-core changes (API, functionality), so the risk of incompatible on-disk formats is significantly reduced. I'm hoping that e2fsprogs-1.45 might even be able to run e2fsck against an OST or MDT, though I don't think it would be advised without more testing.

            What is left? Its sounds like we are almost complete for this work.

            simmonsja James A Simmons added a comment - What is left? Its sounds like we are almost complete for this work.

            The ext4-kill-dx-root, ext4-large-dir, and ext4-disable-mbcache patches were landed in the upstream 4.12 kernel.

            The ext4-large-eas patch was landed in the upstream 4.13 kernel.

            The ext4-data-in-dirent patch is scheduled to land for the upstream 4.17 kernel.

            In the 4.15 kernel, upstream commit ae5e165d855 and commit ee73f9a52a34 changed how i_version is handled by the VFS and the filesystems. We may be able to leverage this change to remove the ext4-inode-version patch. In particular, if we do not set the SB_I_VERSION flag in the superblock, the VFS will no longer modify the i_version field in the inode, so we may be able to avoid storing the Lustre version in struct ext4_inode.i_fs_version, and can use struct inode.i_version for this (save 8 bytes per ldiskfs inode and an ldiskfs patch).

            The ext4-nocmtime patch took a bit of a step backward in kernel 4.9, since the ext4_current_time() wrapper that we were using so conveniently was removed and replaced with dozen(s) of direct calls to current_time() from the VFS.

            The ext4-mmp-brelse patch equivalent landed upstream in kernel 4.5.

            adilger Andreas Dilger added a comment - The ext4-kill-dx-root , ext4-large-dir , and ext4-disable-mbcache patches were landed in the upstream 4.12 kernel. The ext4-large-eas patch was landed in the upstream 4.13 kernel. The ext4-data-in-dirent patch is scheduled to land for the upstream 4.17 kernel. In the 4.15 kernel, upstream commit ae5e165d855 and commit ee73f9a52a34 changed how i_version is handled by the VFS and the filesystems. We may be able to leverage this change to remove the ext4-inode-version patch. In particular, if we do not set the SB_I_VERSION flag in the superblock, the VFS will no longer modify the i_version field in the inode, so we may be able to avoid storing the Lustre version in struct ext4_inode.i_fs_version , and can use struct inode.i_version for this (save 8 bytes per ldiskfs inode and an ldiskfs patch). The ext4-nocmtime patch took a bit of a step backward in kernel 4.9, since the ext4_current_time() wrapper that we were using so conveniently was removed and replaced with dozen(s) of direct calls to current_time() from the VFS. The ext4-mmp-brelse patch equivalent landed upstream in kernel 4.5.
            ys Yang Sheng added a comment -

            From http://review.whamcloud.com/#/c/15961/ for record

            Andreas Dilger 4:09 AM

            Patch Set 5: Code-Review+2

            This is much better than before, but not 100% safe if there is another flag used by ext4.

            It might be useful to try to push a patch upstream that puts BH_BITMAP_DIRTY into an enum with a "last" value to allow us to ensure that BH_DXLock does not conflict. The reason for the upstream patch could be to verify that BH_BITMAP_DIRTY itself does not overflow the b_state field.

            ys Yang Sheng added a comment - From http://review.whamcloud.com/#/c/15961/ for record Andreas Dilger 4:09 AM Patch Set 5: Code-Review+2 This is much better than before, but not 100% safe if there is another flag used by ext4. It might be useful to try to push a patch upstream that puts BH_BITMAP_DIRTY into an enum with a "last" value to allow us to ensure that BH_DXLock does not conflict. The reason for the upstream patch could be to verify that BH_BITMAP_DIRTY itself does not overflow the b_state field.

            Only 3 more. A lot more patch exist that need to merged upstream to make ldiskfs patchless.

            simmonsja James A Simmons added a comment - Only 3 more. A lot more patch exist that need to merged upstream to make ldiskfs patchless.
            ys Yang Sheng added a comment -

            As latest 4.2 kernel status:
            ext4-print-inum-in-htree-warning.patch: Already merged by commit b03a2f7eb21cc06b541142684abf7eed6aaccf3e
            ext4-journal-path-opt.patch: Already merged by commit ad4eec613536dc7e5ea0c6e59849e6edca634d8b

            So just 3 patches need to push upstream.
            ext4-disable-mb-cache.patch
            ext4-kill-dx-root.patch
            ext4-large-dir.patch
            But they are conflict heavily against latest kernel tree. I'll work out a draft and attached here for review.

            ys Yang Sheng added a comment - As latest 4.2 kernel status: ext4-print-inum-in-htree-warning.patch: Already merged by commit b03a2f7eb21cc06b541142684abf7eed6aaccf3e ext4-journal-path-opt.patch: Already merged by commit ad4eec613536dc7e5ea0c6e59849e6edca634d8b So just 3 patches need to push upstream. ext4-disable-mb-cache.patch ext4-kill-dx-root.patch ext4-large-dir.patch But they are conflict heavily against latest kernel tree. I'll work out a draft and attached here for review.

            People

              ys Yang Sheng
              adilger Andreas Dilger
              Votes:
              1 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

                Created:
                Updated: