[LU-14407] osd-zfs: Direct IO Created: 09/Feb/21  Updated: 25/May/21

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Upstream
Fix Version/s: None

Type: New Feature Priority: Minor
Reporter: Brian Behlendorf Assignee: Brian Behlendorf
Resolution: Unresolved Votes: 0
Labels: None

Epic/Theme: Performance, zfs
Rank (Obsolete): 9223372036854775807

 Description   

We're getting close to integrating proper direct IO support for ZFS and I wanted to start a conversation about how Lustre can best take advantage of it for very fast SSD/NVMe devices.

From a functionality perspective we've implemented Direct IO such that it entirely bypasses the ARC and avoids as many copies as possible. This includes the copy between user and kernel space (not really an issue for Lustre) as well as any copies in the IO pipeline. Obviously, if features like compression or encryption are enabled those transforms of the data still need to happen. But if not then we'll do the IO to disk with the provided user pages, or in Lustre's case, the pages from the loaned ARC buffer.

The code in the OpenZFS Direct IO PR makes no functional changes to the ZFS interfaces Lustre is currently using. So when the PR is merged Lustre's behavior when using ZFS OSSs shouldn't change at all. What we have done is provide a couple new interfaces that Lustre can optionally use to request Direct IO on a per dbuf basis.

We've done some basic initial performance testing by forcing Lustre to always use the new Direct IO paths and have seen very good results. But I think what we really want is for Lustre to somehow more intelligently control which IOs are submitted as buffered and which are are direct. ZFS will guarantee coherency between buffered and direct IOs so it's mainly a matter of how best to issue them.

One idea would be to integrate with Lustre's existing readcache_max_filesize, read_cache_enable and writethrough_cache_enable tunables but I don't know how practical that would be. In the short term I can propose a small patch which takes the simplest route and lets us enable/disable it for all IOs. That should provide a reasonable starting place to checkout the new interfaces and hopefully we can take it from there.



 Comments   
Comment by Andreas Dilger [ 09/Feb/21 ]

Brian, were you aware of the readcache_max_io_mb and writethrough_max_io_mb tunables, which allow deciding on a per-RPC basis whether the IO is large enough to submit directly to storage or not? I think this would also be useful for ZFS as well, since turning off all caching is bad for HDDs, but large IOs (> 8MB) can drive the full HDD bandwidth and do not benefit from cache on the OSS.

Also, in osd-ldiskfs, it automatically turns off read/write cache for SSD devices completely for best performance. As yet there is no in-kernel mechanism for determining if the underlying dataset is on flash or HDD storage, as we can do in osd-ldiskfs by checking the bdev directly. Having the od_nonrotational flag set at mount would potentially also be useful because this state is exported to the clients with "lfs df -v" and can be used by tools to decide which OSTs are more suitable for IOPS vs. streaming IO.

Comment by Brian Behlendorf [ 10/Feb/21 ]

That does sound useful. I wasn't aware of those tunables, but I agree if we can make use of them we should.

While there's no existing interface to check if a pool is built on flash or HDD storage we do track the non-rotational flag information internally. Each vdev has a vd->vdev_nonrot flags which is set if the vdev is a leaf and non-rotational, or if it's an interior vdev and all of its children are non-rotational. Checking the flag on the pool's root vdev would be a quick way to determine if there are any HDDs as part of the pool. If that's sufficient, we can add a function to make that check so it's possible to automatically turn off the read/write cache for SSDs like the osd-ldiskfs does.

Comment by Gerrit Updater [ 18/Feb/21 ]

Brian Behlendorf (behlendorf1@llnl.gov) uploaded a new patch: https://review.whamcloud.com/41689
Subject: LU-14407 osd-zfs: add basic direct IO support
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 6658c213ae9a3a8664c67036efd9526295d6800a

Comment by Andreas Dilger [ 25/May/21 ]

Per discussion at LUG, this still needs to be hooked into the read/write cache tunable parameters that are also available for ldiskfs to tune this on a per-object/per-IO basis:

osd-ldiskfs.myth-OST0000.read_cache_enable=1
osd-ldiskfs.myth-OST0000.writethrough_cache_enable=1
osd-ldiskfs.myth-OST0000.readcache_max_filesize=18446744073709551615
osd-ldiskfs.myth-OST0000.readcache_max_io_mb=8
osd-ldiskfs.myth-OST0000.writethrough_max_io_mb=8
Generated at Sat Feb 10 03:09:28 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.