<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 02:24:03 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-9194] single stream read performance with ZFS OSTs</title>
                <link>https://jira.whamcloud.com/browse/LU-9194</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;With ZFS OSTs the single stream read performance for files with 1 OST has dropped and is depending very strongly on some tunables. I use OSTs which are capable of reading a file directly from ZFS with 1.8GB/s and more due to good performance of the zfetch prefetcher. With Lustre read performance is often in the range of 300-500MB/s, which is pretty low compared to solid 1GB/s which we can see with ldiskfs on a hardware RAID controller storage unit.&lt;/p&gt;

&lt;p&gt;The explanation for the performance problem is that the RPCs in flight (up to 256) that do read-ahead for the Lustre client are scheduled on the server in more or less random order on the OSS side (as ll_ost_io* kernel threads) and break the zfetch pattern. The best performance is with low max_rpcs_in_flight and large max_read_ahead_per_file_mb values. Unfortunately these values are bad for loads with many streams per client.&lt;/p&gt;

&lt;p&gt;The effort in &lt;a href=&quot;https://jira.hpdd.intel.com/browse/LU-8964&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;LU-8964&lt;/a&gt; actually makes read performance with ZFS OSTs worse, because the additional parallel tasks are again scheduled in random order.&lt;/p&gt;

&lt;p&gt;Measurements are with &quot;dd bs=1MB count=100000 ...&quot;, between the measurements the caches were dropped on both OSS and client.&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;single stream dd bs=1M read bandwidth
-----------------------------------------
rpcs&#160;&#160;&#160;&#160;&#160;&#160;&#160; ZFS prefetch enabled
in&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; max_read_ahead_per_file_mb&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; &#160;
flight&#160;&#160;&#160;&#160; 1&#160;&#160;&#160;&#160; 16&#160;&#160;&#160; 32&#160;&#160;&#160; 64&#160;&#160; 256&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; &#160;
-----------------------------------------
1&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; 335&#160;&#160; 597&#160;&#160; 800&#160;&#160; 817&#160;&#160; 910&#160;&#160; &#160;
2&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; 379&#160;&#160; 500&#160;&#160; 657&#160;&#160; 705&#160;&#160; 690&#160;&#160; &#160;
4&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; 335&#160;&#160; 444&#160;&#160; 516&#160;&#160; 558&#160;&#160; 615&#160;&#160; &#160;
8&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; 339&#160;&#160; 396&#160;&#160; 439&#160;&#160; 471&#160;&#160; 546&#160;&#160; &#160;
16&#160;&#160;&#160;&#160;&#160;&#160;&#160; 378&#160;&#160; 359&#160;&#160; 385&#160;&#160; 404&#160;&#160; 507&#160;&#160; &#160;
32&#160;&#160;&#160;&#160;&#160;&#160;&#160; 333&#160;&#160; 360&#160;&#160; 378&#160;&#160; 379&#160;&#160; 429&#160;&#160; &#160;
64&#160;&#160;&#160;&#160;&#160;&#160;&#160; 332&#160;&#160; 346&#160;&#160; 377&#160;&#160; 377&#160;&#160; 398&#160;&#160; &#160;
128&#160;&#160;&#160;&#160;&#160;&#160; 375&#160;&#160; 359&#160;&#160; 379&#160;&#160; 381&#160;&#160; 402&#160;&#160; &#160;
256&#160;&#160;&#160;&#160;&#160;&#160; 339&#160;&#160; 351&#160;&#160; 380&#160;&#160; 378&#160;&#160; 409&#160;&#160; &#160;
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;Disabling the ZFS prefetcher completely helps when max_read_ahead_per_file_mb is huge and max_rpcs_in_flight are large because the many (random) streams lead to some sort of prefetching effect. Unfortunately the multiple stream workloads are very bad, so no prefetch is not really a solution.&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;single stream dd bs=1M read bandwidth
-----------------------------------------
rpcs&#160;&#160;&#160;&#160;&#160;&#160;&#160; ZFS prefetch disabled
in&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; max_read_ahead_per_file_mb&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; &#160;
flight&#160;&#160;&#160;&#160; 1&#160;&#160;&#160;&#160; 16&#160;&#160;&#160; 32&#160;&#160;&#160; 64&#160;&#160; 256
-------------------------- --------------
1&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; 155&#160;&#160; 247&#160;&#160; 286&#160;&#160; 288&#160;&#160; 283&#160;&#160; &#160;
2&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; 157&#160;&#160; 292&#160;&#160; 360&#160;&#160; 360&#160;&#160; 358&#160;&#160; &#160;
4&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; 157&#160;&#160; 346&#160;&#160; 461&#160;&#160; 465&#160;&#160; 450&#160;&#160; &#160;
8&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; 155&#160;&#160; 389&#160;&#160; 580&#160;&#160; 602&#160;&#160; 604&#160;&#160; &#160;
16&#160;&#160;&#160;&#160;&#160;&#160;&#160; 158&#160;&#160; 384&#160;&#160; 614&#160;&#160; 782&#160;&#160; 791&#160;&#160; &#160;
32&#160;&#160;&#160;&#160;&#160;&#160;&#160; 152&#160;&#160; 386&#160;&#160; 600&#160;&#160; 878&#160;&#160; 972&#160;&#160; &#160;
64&#160;&#160;&#160;&#160;&#160;&#160;&#160; 158&#160;&#160; 386&#160;&#160; 597&#160;&#160; 858&#160; 1100&#160;&#160; &#160;
128&#160;&#160;&#160;&#160;&#160;&#160; 155&#160;&#160; 390&#160;&#160; 603&#160;&#160; 863&#160;&#160; 948&#160;&#160; &#160;
256&#160;&#160;&#160;&#160;&#160;&#160; 160&#160;&#160; 382&#160;&#160; 602&#160;&#160; 859&#160;&#160; 934&#160;&#160; &#160;
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;There are probably two approaches to the problem:&lt;/p&gt;
&lt;ol&gt;
	&lt;li&gt;make ZFS zfetch smarter, such that it can cope with the pseudo-randomly ordered read requests from Lustre.&lt;/li&gt;
	&lt;li&gt;change the Lustre client such that it has only one RPC in flight to a particular OST object. This would present an acceptable pattern to zfetch and lead to ~1GB/s for the single stream read.&lt;/li&gt;
&lt;/ol&gt;


&lt;p&gt;This ticket is about implementing option 2. I prepared patches for tracking the number of read requests per osc_object but have difficulties to limit/enforce them in osc_cache.c. I hope for some hints...&lt;/p&gt;</description>
                <environment>ZFS based OSTs&lt;br/&gt;
Lustre 2.9.0 or newer, EL 3.0, EL 3.1</environment>
        <key id="44572">LU-9194</key>
            <summary>single stream read performance with ZFS OSTs</summary>
                <type id="4" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11310&amp;avatarType=issuetype">Improvement</type>
                                            <priority id="3" iconUrl="https://jira.whamcloud.com/images/icons/priorities/major.svg">Major</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="jgmitter">Joseph Gmitter</assignee>
                                    <reporter username="efocht">Erich Focht</reporter>
                        <labels>
                            <label>patch</label>
                            <label>performance</label>
                            <label>zfs</label>
                    </labels>
                <created>Tue, 7 Mar 2017 23:47:30 +0000</created>
                <updated>Fri, 6 Jul 2018 14:29:39 +0000</updated>
                            <resolved>Fri, 6 Jul 2018 14:29:39 +0000</resolved>
                                    <version>Lustre 2.9.0</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>6</watches>
                                                                            <comments>
                            <comment id="187421" author="efocht" created="Wed, 8 Mar 2017 00:41:04 +0000"  >&lt;p&gt;Another (simple) way of getting read requests for a particular ost object issued in order would be to schedule requests from a particular client for a particular object to the same ll_ost_io* thread. I wonder if that&apos;s not actually something the NRS is designed to do. The ORR policy sounds a bit like that.&lt;/p&gt;</comment>
                            <comment id="187496" author="adilger" created="Wed, 8 Mar 2017 18:49:03 +0000"  >&lt;p&gt;Erich, it will not be a good long term solution to limit the RPCs in flight to 1 for a single client, since this will mean no pipelining is happening to cover the network RPC latency (e.g. WAN links with high latency). &lt;/p&gt;

&lt;p&gt;The NRS ORR policy is indeed the right way to handle this case. That allows the OSS to order the RPCs based on offset to optimize disk ordering. This also allows the OSS to reorder RPCs submitted from different cllients. The difficulty is that ZFS doesn&apos;t expose the disk offset information to upper levels, so the best that ORR can do on osd-zfs is to submit the reads in file offset order, not in disk offset order when using osd-ldiskfs. &lt;/p&gt;</comment>
                            <comment id="187555" author="efocht" created="Wed, 8 Mar 2017 22:46:29 +0000"  >&lt;p&gt;Hi Andreas, thanks for commenting, I&apos;ll forget about the one rpc in flight. The more I look at NRS/ORR, the more appropriate it seems, though the parallelism is still spoiling the order of read requests. Many/several ll_ost_io* kthreads pick up nicely sorted requests (even when only one rpc is in flight) which probably get issued in slightly different order than they are picked up. Performance with ORR is not different to what I was seeing before. I&apos;d like to try to serialize the requests for one object, to see if the performance with ZFS OSTs actually changes.&lt;/p&gt;</comment>
                            <comment id="187618" author="jay" created="Thu, 9 Mar 2017 07:21:57 +0000"  >&lt;p&gt;NRS ORR with ZFS prefetch should be the right way to go.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The difficulty is that ZFS doesn&apos;t expose the disk offset information to upper levels, so the best that ORR can do on osd-zfs is to submit the reads in file offset order, not in disk offset order when using osd-ldiskfs.&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;Even though it&apos;ll be difficult to sort the requests by disk offset, the requests can still be sorted by file offset and then &lt;tt&gt;dmu_prefetch()&lt;/tt&gt; can be enabled, and then we can catch the read speed of native zfs.&lt;/p&gt;</comment>
                            <comment id="187720" author="efocht" created="Thu, 9 Mar 2017 22:29:31 +0000"  >&lt;p&gt;Errr, my measurements were with an OSS running IEEL 3.1 and zfs 0.6.5.7. Turns out that zfs 0.7.0rc3 has a significantly different &lt;em&gt;dmu_prefetch()&lt;/em&gt;, which behaves totally different and &lt;b&gt;much better&lt;/b&gt;! I measured with lustre-2.9.0 on top of zfs-0.7.0rc3 on the OSS side and get:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; single OST stream dd read bs=1M
&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; OSS, Client: lustre-2.9.0;&#160;&#160; OSS: zfs-0.7.0rc3;
&#160;rpcs&#160; |----------------------------------------------------
&#160; in&#160;&#160; |&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; llite.*.max_read_ahead_per_file_mb
flight |&#160;&#160;&#160; 1&#160;&#160;&#160;&#160;&#160;&#160; 4&#160;&#160;&#160;&#160;&#160;&#160;&#160; 8&#160;&#160;&#160;&#160;&#160;&#160; 16&#160;&#160;&#160;&#160;&#160;&#160; 64&#160;&#160;&#160;&#160;&#160; 256
-------|----------------------------------------------------
&#160; 1&#160;&#160;&#160; |&#160;&#160; 394&#160;&#160;&#160;&#160; 665&#160;&#160;&#160;&#160;&#160; 806&#160;&#160;&#160;&#160;&#160; 724&#160;&#160;&#160;&#160; 1300&#160;&#160;&#160;&#160; 1200
&#160; 2&#160;&#160;&#160; |&#160;&#160; 319&#160;&#160;&#160;&#160; 735&#160;&#160;&#160;&#160; 1000&#160;&#160;&#160;&#160; 1100&#160;&#160;&#160;&#160; 1800&#160;&#160;&#160;&#160; 1700
&#160; 4&#160;&#160;&#160; |&#160;&#160; 330&#160;&#160;&#160;&#160; 700&#160;&#160;&#160;&#160;&#160; 797&#160;&#160;&#160;&#160;&#160; 933&#160;&#160;&#160;&#160; 1200&#160;&#160;&#160;&#160; 1500
&#160; 8&#160;&#160;&#160; |&#160;&#160; 333&#160;&#160;&#160;&#160; 690&#160;&#160;&#160;&#160;&#160; 628&#160;&#160;&#160;&#160;&#160; 817&#160;&#160;&#160;&#160; 1100&#160;&#160;&#160;&#160; 1400&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; &#160;
&#160;16&#160;&#160;&#160; |&#160;&#160; 382&#160;&#160;&#160;&#160; 749&#160;&#160;&#160;&#160;&#160; 657&#160;&#160;&#160;&#160;&#160; 638&#160;&#160;&#160;&#160; 1100&#160;&#160;&#160;&#160; 1300&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; &#160;
&#160;32&#160;&#160;&#160; |&#160;&#160; 323&#160;&#160;&#160;&#160; 703&#160;&#160;&#160;&#160;&#160; 618&#160;&#160;&#160;&#160;&#160; 601&#160;&#160;&#160;&#160; 1100&#160;&#160;&#160;&#160; 1300&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; &#160;
&#160;64&#160;&#160;&#160; |&#160;&#160; 371&#160;&#160;&#160;&#160; 682&#160;&#160;&#160;&#160;&#160; 625&#160;&#160;&#160;&#160;&#160; 606&#160;&#160;&#160;&#160; 1100&#160;&#160;&#160;&#160; 1300&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; &#160;
128&#160;&#160;&#160; |&#160;&#160; 320&#160;&#160;&#160;&#160; 719&#160;&#160;&#160;&#160;&#160; 609&#160;&#160;&#160;&#160;&#160; 603&#160;&#160;&#160;&#160; 1100&#160;&#160;&#160;&#160; 1200&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; &#160;
256&#160;&#160;&#160; |&#160;&#160; 364&#160;&#160;&#160;&#160; 671&#160;&#160;&#160;&#160;&#160; 643&#160;&#160;&#160;&#160;&#160; 617&#160;&#160;&#160;&#160; 1000&#160;&#160;&#160;&#160; 1200&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; &#160;
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;&#160;&lt;br/&gt;
It is still possible to do smart things to improve performance, given that there is a peak at 2 rpcs in flight. Right now I&apos;d say that limiting the number of worker threads per object that pull from the ORR binheap after the NRS &quot;sorting&quot; could give us an optimum independent of the value of the &lt;em&gt;max_rpcs_in_flight&lt;/em&gt; tunable.&lt;/p&gt;</comment>
                            <comment id="230008" author="jgmitter" created="Fri, 6 Jul 2018 14:29:39 +0000"  >&lt;p&gt;Closing the ticket as we have moved well beyond these versions in performance testing and production usage.&lt;/p&gt;</comment>
                    </comments>
                    <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                    <customfield id="customfield_10030" key="com.atlassian.jira.plugin.system.customfieldtypes:labels">
                        <customfieldname>Epic/Theme</customfieldname>
                        <customfieldvalues>
                                        <label>Performance</label>
            <label>patch</label>
            <label>zfs</label>
    
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzz67j:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                                                                                </customfields>
    </item>
</channel>
</rss>