<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 03:09:28 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-14407] osd-zfs: Direct IO</title>
                <link>https://jira.whamcloud.com/browse/LU-14407</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;We&apos;re getting close to integrating proper direct IO support for ZFS and I wanted to start a conversation about how Lustre can best take advantage of it for very fast SSD/NVMe devices.&lt;/p&gt;

&lt;p&gt;From a functionality perspective we&apos;ve implemented Direct IO such that it entirely bypasses the ARC and avoids as many copies as possible.  This includes the copy between user and kernel space (not really an issue for Lustre) as well as any copies in the IO pipeline.  Obviously, if features like compression or encryption are enabled those transforms of the data still need to happen.  But if not then we&apos;ll do the IO to disk with the provided user pages, or in Lustre&apos;s case, the pages from the loaned ARC buffer.&lt;/p&gt;

&lt;p&gt;The code in the &lt;a href=&quot;https://github.com/openzfs/zfs/pull/10018&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;OpenZFS Direct IO PR&lt;/a&gt; makes no functional changes to the ZFS interfaces Lustre is currently using.  So when the PR is merged Lustre&apos;s behavior when using ZFS OSSs shouldn&apos;t change at all.  What we have done is provide a couple new interfaces that Lustre can optionally use to request Direct IO on a per dbuf basis.&lt;/p&gt;

&lt;p&gt;We&apos;ve done some basic initial performance testing by forcing Lustre to always use the new Direct IO paths and have seen very good results.  But I think what we really want is for Lustre to somehow more intelligently control which IOs are submitted as buffered and which are are direct.  ZFS will guarantee coherency between buffered and direct IOs so it&apos;s mainly a matter of how best to issue them.&lt;/p&gt;

&lt;p&gt;One idea would be to integrate with Lustre&apos;s existing &lt;tt&gt;readcache_max_filesize&lt;/tt&gt;, &lt;tt&gt;read_cache_enable&lt;/tt&gt; and &lt;tt&gt;writethrough_cache_enable&lt;/tt&gt; tunables but I don&apos;t know how practical that would be.  In the short term I can propose a small patch which takes the simplest route and lets us enable/disable it for all IOs.  That should provide a reasonable starting place to checkout the new interfaces and hopefully we can take it from there.&lt;/p&gt;</description>
                <environment></environment>
        <key id="62792">LU-14407</key>
            <summary>osd-zfs: Direct IO</summary>
                <type id="2" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11311&amp;avatarType=issuetype">New Feature</type>
                                            <priority id="4" iconUrl="https://jira.whamcloud.com/images/icons/priorities/minor.svg">Minor</priority>
                        <status id="1" iconUrl="https://jira.whamcloud.com/images/icons/statuses/open.png" description="The issue is open and ready for the assignee to start work on it.">Open</status>
                    <statusCategory id="2" key="new" colorName="default"/>
                                    <resolution id="-1">Unresolved</resolution>
                                        <assignee username="behlendorf">Brian Behlendorf</assignee>
                                    <reporter username="behlendorf">Brian Behlendorf</reporter>
                        <labels>
                    </labels>
                <created>Tue, 9 Feb 2021 22:59:20 +0000</created>
                <updated>Tue, 25 May 2021 21:45:10 +0000</updated>
                                            <version>Upstream</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>8</watches>
                                                                            <comments>
                            <comment id="291565" author="adilger" created="Tue, 9 Feb 2021 23:23:14 +0000"  >&lt;p&gt;Brian, were you aware of the &lt;tt&gt;readcache_max_io_mb&lt;/tt&gt; and &lt;tt&gt;writethrough_max_io_mb&lt;/tt&gt; tunables, which allow deciding on a per-RPC basis whether the IO is large enough to submit directly to storage or not?  I think this would also be useful for ZFS as well, since turning off &lt;b&gt;all&lt;/b&gt; caching is bad for HDDs, but large IOs (&amp;gt; 8MB) can drive the full HDD bandwidth and do not benefit from cache on the OSS.&lt;/p&gt;

&lt;p&gt;Also, in osd-ldiskfs, it automatically turns off read/write cache for SSD devices completely for best performance.  As yet there is no in-kernel mechanism for determining if the underlying dataset is on flash or HDD storage, as we can do in osd-ldiskfs by checking the bdev directly.  Having the &lt;tt&gt;od_nonrotational&lt;/tt&gt; flag set at mount would potentially also be useful because this state is exported to the clients with &quot;&lt;tt&gt;lfs df -v&lt;/tt&gt;&quot; and can be used by tools to decide which OSTs are more suitable for IOPS vs. streaming IO.&lt;/p&gt;</comment>
                            <comment id="291579" author="behlendorf" created="Wed, 10 Feb 2021 00:33:03 +0000"  >&lt;p&gt;That does sound useful.  I wasn&apos;t aware of those tunables, but I agree if we can make use of them we should.&lt;/p&gt;

&lt;p&gt;While there&apos;s no existing interface to check if a pool is built on flash or HDD storage we do track the non-rotational flag information internally.  Each vdev has a &lt;tt&gt;vd-&amp;gt;vdev_nonrot&lt;/tt&gt; flags which is set if the vdev is a leaf and non-rotational, or if it&apos;s an interior vdev and all of its children are non-rotational.  Checking the flag on the pool&apos;s root vdev would be a quick way to determine if there are any HDDs as part of the pool.   If that&apos;s sufficient, we can add a function to make that check so it&apos;s possible to automatically turn off the read/write cache for SSDs like the osd-ldiskfs does. &lt;/p&gt;</comment>
                            <comment id="292352" author="gerrit" created="Thu, 18 Feb 2021 22:23:34 +0000"  >&lt;p&gt;Brian Behlendorf (behlendorf1@llnl.gov) uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/41689&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/41689&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-14407&quot; title=&quot;osd-zfs: Direct IO&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-14407&quot;&gt;LU-14407&lt;/a&gt; osd-zfs: add basic direct IO support&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: 6658c213ae9a3a8664c67036efd9526295d6800a&lt;/p&gt;</comment>
                            <comment id="302529" author="adilger" created="Tue, 25 May 2021 21:45:10 +0000"  >&lt;p&gt;Per discussion at LUG, this still needs to be hooked into the read/write cache tunable parameters that are also available for ldiskfs to tune this on a per-object/per-IO basis:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;osd-ldiskfs.myth-OST0000.read_cache_enable=1
osd-ldiskfs.myth-OST0000.writethrough_cache_enable=1
osd-ldiskfs.myth-OST0000.readcache_max_filesize=18446744073709551615
osd-ldiskfs.myth-OST0000.readcache_max_io_mb=8
osd-ldiskfs.myth-OST0000.writethrough_max_io_mb=8
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                    </comments>
                    <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                    <customfield id="customfield_10030" key="com.atlassian.jira.plugin.system.customfieldtypes:labels">
                        <customfieldname>Epic/Theme</customfieldname>
                        <customfieldvalues>
                                        <label>Performance</label>
            <label>zfs</label>
    
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|i01m3b:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                                                                                </customfields>
    </item>
</channel>
</rss>