Details

    • New Feature
    • Resolution: Unresolved
    • Minor
    • None
    • Upstream
    • None

    Description

      We're getting close to integrating proper direct IO support for ZFS and I wanted to start a conversation about how Lustre can best take advantage of it for very fast SSD/NVMe devices.

      From a functionality perspective we've implemented Direct IO such that it entirely bypasses the ARC and avoids as many copies as possible. This includes the copy between user and kernel space (not really an issue for Lustre) as well as any copies in the IO pipeline. Obviously, if features like compression or encryption are enabled those transforms of the data still need to happen. But if not then we'll do the IO to disk with the provided user pages, or in Lustre's case, the pages from the loaned ARC buffer.

      The code in the OpenZFS Direct IO PR makes no functional changes to the ZFS interfaces Lustre is currently using. So when the PR is merged Lustre's behavior when using ZFS OSSs shouldn't change at all. What we have done is provide a couple new interfaces that Lustre can optionally use to request Direct IO on a per dbuf basis.

      We've done some basic initial performance testing by forcing Lustre to always use the new Direct IO paths and have seen very good results. But I think what we really want is for Lustre to somehow more intelligently control which IOs are submitted as buffered and which are are direct. ZFS will guarantee coherency between buffered and direct IOs so it's mainly a matter of how best to issue them.

      One idea would be to integrate with Lustre's existing readcache_max_filesize, read_cache_enable and writethrough_cache_enable tunables but I don't know how practical that would be. In the short term I can propose a small patch which takes the simplest route and lets us enable/disable it for all IOs. That should provide a reasonable starting place to checkout the new interfaces and hopefully we can take it from there.

      Attachments

        Activity

          [LU-14407] osd-zfs: Direct IO

          "Timothy Day <timday@amazon.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/57453
          Subject: LU-14407 osd-zfs: per-IO direct IO support
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: 4636d31a5458644954472b520afa87993d653a59

          gerrit Gerrit Updater added a comment - "Timothy Day <timday@amazon.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/57453 Subject: LU-14407 osd-zfs: per-IO direct IO support Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 4636d31a5458644954472b520afa87993d653a59

          Per discussion at LUG, this still needs to be hooked into the read/write cache tunable parameters that are also available for ldiskfs to tune this on a per-object/per-IO basis:

          osd-ldiskfs.myth-OST0000.read_cache_enable=1
          osd-ldiskfs.myth-OST0000.writethrough_cache_enable=1
          osd-ldiskfs.myth-OST0000.readcache_max_filesize=18446744073709551615
          osd-ldiskfs.myth-OST0000.readcache_max_io_mb=8
          osd-ldiskfs.myth-OST0000.writethrough_max_io_mb=8
          
          adilger Andreas Dilger added a comment - Per discussion at LUG, this still needs to be hooked into the read/write cache tunable parameters that are also available for ldiskfs to tune this on a per-object/per-IO basis: osd-ldiskfs.myth-OST0000.read_cache_enable=1 osd-ldiskfs.myth-OST0000.writethrough_cache_enable=1 osd-ldiskfs.myth-OST0000.readcache_max_filesize=18446744073709551615 osd-ldiskfs.myth-OST0000.readcache_max_io_mb=8 osd-ldiskfs.myth-OST0000.writethrough_max_io_mb=8

          Brian Behlendorf (behlendorf1@llnl.gov) uploaded a new patch: https://review.whamcloud.com/41689
          Subject: LU-14407 osd-zfs: add basic direct IO support
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: 6658c213ae9a3a8664c67036efd9526295d6800a

          gerrit Gerrit Updater added a comment - Brian Behlendorf (behlendorf1@llnl.gov) uploaded a new patch: https://review.whamcloud.com/41689 Subject: LU-14407 osd-zfs: add basic direct IO support Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 6658c213ae9a3a8664c67036efd9526295d6800a

          That does sound useful. I wasn't aware of those tunables, but I agree if we can make use of them we should.

          While there's no existing interface to check if a pool is built on flash or HDD storage we do track the non-rotational flag information internally. Each vdev has a vd->vdev_nonrot flags which is set if the vdev is a leaf and non-rotational, or if it's an interior vdev and all of its children are non-rotational. Checking the flag on the pool's root vdev would be a quick way to determine if there are any HDDs as part of the pool. If that's sufficient, we can add a function to make that check so it's possible to automatically turn off the read/write cache for SSDs like the osd-ldiskfs does.

          behlendorf Brian Behlendorf added a comment - That does sound useful. I wasn't aware of those tunables, but I agree if we can make use of them we should. While there's no existing interface to check if a pool is built on flash or HDD storage we do track the non-rotational flag information internally. Each vdev has a vd->vdev_nonrot flags which is set if the vdev is a leaf and non-rotational, or if it's an interior vdev and all of its children are non-rotational. Checking the flag on the pool's root vdev would be a quick way to determine if there are any HDDs as part of the pool. If that's sufficient, we can add a function to make that check so it's possible to automatically turn off the read/write cache for SSDs like the osd-ldiskfs does.

          Brian, were you aware of the readcache_max_io_mb and writethrough_max_io_mb tunables, which allow deciding on a per-RPC basis whether the IO is large enough to submit directly to storage or not? I think this would also be useful for ZFS as well, since turning off all caching is bad for HDDs, but large IOs (> 8MB) can drive the full HDD bandwidth and do not benefit from cache on the OSS.

          Also, in osd-ldiskfs, it automatically turns off read/write cache for SSD devices completely for best performance. As yet there is no in-kernel mechanism for determining if the underlying dataset is on flash or HDD storage, as we can do in osd-ldiskfs by checking the bdev directly. Having the od_nonrotational flag set at mount would potentially also be useful because this state is exported to the clients with "lfs df -v" and can be used by tools to decide which OSTs are more suitable for IOPS vs. streaming IO.

          adilger Andreas Dilger added a comment - Brian, were you aware of the readcache_max_io_mb and writethrough_max_io_mb tunables, which allow deciding on a per-RPC basis whether the IO is large enough to submit directly to storage or not? I think this would also be useful for ZFS as well, since turning off all caching is bad for HDDs, but large IOs (> 8MB) can drive the full HDD bandwidth and do not benefit from cache on the OSS. Also, in osd-ldiskfs, it automatically turns off read/write cache for SSD devices completely for best performance. As yet there is no in-kernel mechanism for determining if the underlying dataset is on flash or HDD storage, as we can do in osd-ldiskfs by checking the bdev directly. Having the od_nonrotational flag set at mount would potentially also be useful because this state is exported to the clients with " lfs df -v " and can be used by tools to decide which OSTs are more suitable for IOPS vs. streaming IO.

          People

            behlendorf Brian Behlendorf
            behlendorf Brian Behlendorf
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

              Created:
              Updated: