<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 02:52:31 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-12429] Single client buffered SSF write is slower than O_DIRECT</title>
                <link>https://jira.whamcloud.com/browse/LU-12429</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;Single client&apos;s SSF doesn&apos;t scale by nubmer of process&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;# mpirun --allow-run-as-root -np X /work/tools/bin/ior -w -t 16m -b $((32/X))g -e -o file

NP     Write(MB/s)
  1     1594
  2     2525
  4     1892
  8     2032
 16     1812
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;A framegraph output at ior with NP=16 pointed out huge amount cost of spin_lock in add_to_page_cache_lru() and set_page_dirty(). At the resutls, Buffered SSF write on single client is slower than SSF with O_DIRECT. Here is my quick test resutls of single client SSF with/without O_DIRECT.&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;# mpirun -np 16 --allow-run-as-root /work/tools/bin/ior -w -t 16m -b 4g -e -o /scratch0/stripe/file 
Max Write: 1806.31 MiB/sec (1894.06 MB/sec)

# mpirun -np 16 --allow-run-as-root /work/tools/bin/ior -w -t 16m -b 4g -e -o /scratch0/stripe/file -B
Max Write: 5547.13 MiB/sec (5816.58 MB/sec)
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</description>
                <environment></environment>
        <key id="55934">LU-12429</key>
            <summary>Single client buffered SSF write is slower than O_DIRECT</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="4" iconUrl="https://jira.whamcloud.com/images/icons/priorities/minor.svg">Minor</priority>
                        <status id="1" iconUrl="https://jira.whamcloud.com/images/icons/statuses/open.png" description="The issue is open and ready for the assignee to start work on it.">Open</status>
                    <statusCategory id="2" key="new" colorName="default"/>
                                    <resolution id="-1">Unresolved</resolution>
                                        <assignee username="dongyang">Dongyang Li</assignee>
                                    <reporter username="sihara">Shuichi Ihara</reporter>
                        <labels>
                    </labels>
                <created>Wed, 12 Jun 2019 08:10:23 +0000</created>
                <updated>Tue, 21 Jan 2020 09:29:44 +0000</updated>
                                            <version>Lustre 2.13.0</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>6</watches>
                                                                            <comments>
                            <comment id="249122" author="pfarrell" created="Wed, 12 Jun 2019 13:54:20 +0000"  >&lt;p&gt;Ihara,&lt;/p&gt;

&lt;p&gt;So we abandoned this patch because it&apos;s not useful for FPP loads, but given where you&apos;re reporting contention, it should be worth a try for SSF loads - which was the original goal.&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://review.whamcloud.com/#/c/28711/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/#/c/28711/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The lock you&apos;re contending on here is the mapping-&amp;gt;tree_lock, which is exactly what this patch helps address contention on.&lt;/p&gt;

&lt;p&gt;Back in the past, I reported a 25% improvement with 8 writers in the SSF case.&#160; You would probably see as much or more with more writers.&lt;/p&gt;

&lt;p&gt;I&apos;ll see if I can rebase it right now...&lt;/p&gt;

&lt;p&gt;Note that we rejected it because it requires re-implementing a certain amount of kernel functionality in a way that is not very pleasing...&#160; But if there&apos;s a big benefit, it&apos;s not necessarily off the table.&lt;/p&gt;</comment>
                            <comment id="249124" author="pfarrell" created="Wed, 12 Jun 2019 14:03:26 +0000"  >&lt;p&gt;Please see&lt;br/&gt;
&lt;a href=&quot;https://review.whamcloud.com/35206&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/35206&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;For rebased copy.&#160; Rebase was trivial (one line of comment was the only diff), but had to push to a new Gerrit because the old patch was abandoned.&lt;/p&gt;

&lt;p&gt;Let&apos;s see how much benefit this gets you and we can consider reviving it.&lt;/p&gt;

&lt;p&gt;FWIW, full node shared file direct i/o is probably always going to be faster than buffered...&lt;/p&gt;</comment>
                            <comment id="249129" author="pfarrell" created="Wed, 12 Jun 2019 15:04:38 +0000"  >&lt;p&gt;By the way, the contention here is two sided - It&apos;s adding pages to the mapping tree/lru, and it&apos;s marking them dirty.&#160; (For some reason, removing them from the radix tree doesn&apos;t show up in here.&#160; Possibly because it&apos;s already optimized with pagevecs or possibly because the test didn&apos;t run long enough?)&lt;/p&gt;

&lt;p&gt;Anyway, naturally we would like to optimize the adding side as well.&#160; Unfortunately, the way Linux does writing makes that quite hard.&#160; Adding a page to the cache happens in ll_write_begin, which is called on each page as part of generic_file_buffered_write in the kernel.&#160; It is required that after that call the page be inserted in to the radix tree for the file being written.&lt;/p&gt;

&lt;p&gt;This means that there&apos;s not really any way to batch this at this step.&lt;/p&gt;

&lt;p&gt;If we wanted, we could potentially try adding the requisite pages in batch &lt;b&gt;before&lt;/b&gt; we got there - I would think in vvp_io_write_start - but that would still require open-coding batch functionality for adding pages to the cache.&lt;/p&gt;

&lt;p&gt;Specifically, we&apos;d have to open code:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;__add_to_page_cache_locked
add_to_page_cache_lru
grab_cache_page_nowait &lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;This would let us just find already added pages in ll_write_begin, which requires no locking and is quite fast.&lt;/p&gt;

&lt;p&gt;Which is a fair bit of kernel internal functionality.&#160; Yuck.&#160; It&apos;s something you&apos;d want to upstream first, ideally...&#160; (As is &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-9920&quot; title=&quot;Use pagevec for marking pages dirty&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-9920&quot;&gt;&lt;del&gt;LU-9920&lt;/del&gt;&lt;/a&gt; to be honest...)&lt;/p&gt;</comment>
                            <comment id="249160" author="sihara" created="Wed, 12 Jun 2019 23:33:38 +0000"  >&lt;p&gt;&lt;a href=&quot;https://review.whamcloud.com/35206&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/35206&lt;/a&gt; improved SSF write by 25%, but still big gap against non-buffered IO. &lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Max Write: 2278.23 MiB/sec (2388.90 MB/sec)
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Will check with newer linux kernel to compare.&lt;/p&gt;</comment>
                            <comment id="249215" author="pfarrell" created="Thu, 13 Jun 2019 16:24:54 +0000"  >&lt;p&gt;That patch probably doesn&apos;t work with newer kernels - The mapping-&amp;gt;tree_lock has been renamed.&#160; I need to fix that, and will do shortly...&#160; But you shouldn&apos;t expect much benefit, there have not been many changes in that area.&#160; Just some reshuffling.&lt;/p&gt;</comment>
                            <comment id="249216" author="pfarrell" created="Thu, 13 Jun 2019 16:28:24 +0000"  >&lt;p&gt;I&apos;m glad the patch improves things by 25%.&#160; I&apos;m pretty sure a new flame graph would basically show more time shifting to the contention on page allocation rather than page dirtying, but still those two hot spots.&#160; It would be interesting to see, though.&lt;/p&gt;

&lt;p&gt;Backing up:&lt;br/&gt;
Packed node direct i/o with reasonable sizes is always going to be better than buffered i/o.&#160; We&apos;re not going to be able to fix that unless we were to convert direct to buffered in that scenario.&lt;/p&gt;

&lt;p&gt;I also don&apos;t have any other good ideas for improvements - That contention we&apos;re facing is in the page cache itself, and Lustre isn&apos;t contributing to it.&#160; Unless we want to do something radical like try to convert from buffered to direct when we run in to trouble, there will always be a gap.&#160; (I don&apos;t like that idea of switching when the node is busy for a variety of reasons, FYI)&lt;/p&gt;

&lt;p&gt;So I think we have to decide what the goal is for this ticket, as the implied goal of making them the same is, unfortunately, not realistic.&lt;/p&gt;</comment>
                            <comment id="249303" author="sihara" created="Fri, 14 Jun 2019 23:09:01 +0000"  >&lt;p&gt;&lt;a href=&quot;https://review.whamcloud.com/#/c/28711/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/#/c/28711/&lt;/a&gt; (latest patchset 8) doesn&apos;t help very much either.&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;# mpirun -np 16 --allow-run-as-root /work/tools/bin/ior -w -t 16m -b 4g -e -o /cache1/stripe/file 
Max Write: 2109.99 MiB/sec (2212.49 MB/sec)
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="257045" author="sihara" created="Thu, 24 Oct 2019 23:11:15 +0000"  >&lt;p&gt;DY, attached is an framegraph of Lustre client when single thread IOR write on it. it might be related, but differnt workload (e.g. buffered IO vs O_DIRECT, single thread vs single client). I wonder if I should open new ticket for it?&lt;/p&gt;</comment>
                            <comment id="257047" author="dongyang" created="Fri, 25 Oct 2019 00:03:55 +0000"  >&lt;p&gt;I agree, this ticket is more about the page cache overhead for multi-thread buffered write.&lt;/p&gt;</comment>
                            <comment id="261557" author="adilger" created="Tue, 21 Jan 2020 09:29:44 +0000"  >&lt;p&gt;Does it make sense to just automatically bypass the page cache on the client for &lt;tt&gt;read()&lt;/tt&gt; and/or &lt;tt&gt;write()&lt;/tt&gt; calls that are large enough and aligned (essentially use &lt;tt&gt;O_DIRECT&lt;/tt&gt; automatically)?  For example, read/write over 16MB if single-threaded, or over 4MB if multi-threaded?  That would totally avoid the overhead of the page cache for those syscalls.&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                        <issuelink>
            <issuekey id="47978">LU-9920</issuekey>
        </issuelink>
                            </outwardlinks>
                                                                <inwardlinks description="is related to">
                                        <issuelink>
            <issuekey id="10666">LU-247</issuekey>
        </issuelink>
                            </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                            <attachment id="32772" name="lustre-ssf.svg" size="161674" author="sihara" created="Wed, 12 Jun 2019 07:54:39 +0000"/>
                            <attachment id="33712" name="single-thread.svg" size="495221" author="sihara" created="Thu, 24 Oct 2019 23:11:29 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|i00i3z:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>