<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 02:18:14 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-8515] OSC: Send RPCs with full extents</title>
                <link>https://jira.whamcloud.com/browse/LU-8515</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;In Lustre 2.7 and newer, single node multi-process single-shared-file write performance is significantly slower than in Lustre 2.5.  This is due to a problem in deciding when to make an RPC. (IE, the decisions made in osc_makes_rpc)&lt;/p&gt;

&lt;p&gt;Currently, Lustre decides to send an RPC under a number of&lt;br/&gt;
conditions (such as memory pressure or lock cancellcation);&lt;br/&gt;
one of the conditions it looks for is &quot;enough dirty pages&lt;br/&gt;
to fill an RPC&quot;. This worked fine when only one process&lt;br/&gt;
could be dirtying pages at a time, but in newer Lustre&lt;br/&gt;
versions, more than one process can write to the same&lt;br/&gt;
file (and the same osc object) at once.&lt;/p&gt;

&lt;p&gt;In this case, the &quot;count dirty pages method&quot; will see there&lt;br/&gt;
are enough dirty pages to fill an RPC, but since the dirty&lt;br/&gt;
pages are being created by multiple writers, they are not&lt;br/&gt;
contiguous and will not fit in to one RPC.  This resulted in&lt;br/&gt;
many RPCs of less than full size being sent, despite a&lt;br/&gt;
good I/O pattern.  (Earlier versions of Lustre usually&lt;br/&gt;
send only full RPCs when presented with this pattern.)&lt;/p&gt;

&lt;p&gt;Instead, we remove this check and add extents to a special&lt;br/&gt;
full extent list when they reach max pages per RPC, then&lt;br/&gt;
send from that list. (This is similar to high priority&lt;br/&gt;
and urgent extents.)&lt;/p&gt;

&lt;p&gt;With a good I/O pattern, like usually used in benchmarking,&lt;br/&gt;
it should be possible to send only full size RPCs. This&lt;br/&gt;
patch achieves that without degrading performance in other&lt;br/&gt;
cases.&lt;/p&gt;

&lt;p&gt;In IOR tests with multiple writers to a single  file,&lt;br/&gt;
this patch improves performance by several times, and&lt;br/&gt;
returns performance to equal levels (single striped files)&lt;br/&gt;
or much greater levels (very high speed OSTs, files&lt;br/&gt;
with many stripes) vs earlier versions.&lt;/p&gt;

&lt;p&gt;Here&apos;s some specific data:&lt;br/&gt;
On this machine and storage system, the best bandwidth we can get to a single stripe from one node is about 330 MB/s.  This occurs with one writer.  All tests are run on a newly created, singly striped file, except where a higher stripe count is specified.&lt;/p&gt;

&lt;p&gt;IOR: aprun -n 1 $(IOR) -w -t 4m -b 16g -C -e -E -k -u -v&lt;br/&gt;
(1 thread, 4 MiB transfer size, 16GB per thread.)&lt;/p&gt;

&lt;p&gt;Unmodified:&lt;br/&gt;
write         334.12     334.12      334.12      0.00       83.53      83.53    &lt;br/&gt;
write         329.34     329.34      329.34      0.00       82.33      82.33  &lt;br/&gt;
write         329.37     329.37      329.37      0.00       82.34      82.34    &lt;/p&gt;

&lt;p&gt;Modified (full extent):&lt;br/&gt;
write         329.47     329.47      329.47      0.00       82.37      82.37    &lt;br/&gt;
write         339.33     339.33      339.33      0.00       84.83      84.83        &lt;br/&gt;
write         323.18     323.18      323.18      0.00       80.80      80.80    &lt;/p&gt;

&lt;p&gt;Here&apos;s an example of the improvement available.  We&apos;re using 8 threads and 1 GB of data per thread.  (Results are similar with a larger amount of data per thread.)&lt;br/&gt;
IOR: aprun -n 8 $(IOR) -w -t 4m -b 1g -C -e -E -k -u -v&lt;br/&gt;
Unmodified:&lt;br/&gt;
write          87.24      87.24       87.24      0.00       21.81      21.81   &lt;br/&gt;
write          89.26      89.26       89.26      0.00       22.31      22.31     &lt;br/&gt;
write          90.45      90.45       90.45      0.00       22.61      22.61    &lt;/p&gt;

&lt;p&gt;Modified:&lt;br/&gt;
write         345.72     345.72      345.72      0.00       86.43      86.43    &lt;br/&gt;
write         334.14     334.14      334.14      0.00       83.53      83.53    &lt;br/&gt;
write         351.03     351.03      351.03      0.00       87.76      87.76    &lt;/p&gt;

&lt;p&gt;Note the above is actually a shade higher than the single thread performance, despite being at essentially the limit for the target (from this node, with these settings).&lt;/p&gt;

&lt;p&gt;2 stripes:&lt;/p&gt;

&lt;p&gt;1 thread, unmodified:&lt;br/&gt;
write         614.48     614.48      614.48      0.00      153.62     153.62    &lt;br/&gt;
write         626.98     626.98      626.98      0.00      156.75     156.75    &lt;br/&gt;
write         610.14     610.14      610.14      0.00      152.53     152.53    &lt;/p&gt;

&lt;p&gt;1 thread, modified:&lt;br/&gt;
write         627.86     627.86      627.86      0.00      156.97     156.97    &lt;br/&gt;
write         625.68     625.68      625.68      0.00      156.42     156.42    &lt;br/&gt;
write         625.47     625.47      625.47      0.00      156.37     156.37    &lt;/p&gt;

&lt;p&gt;8 threads, unmodified:&lt;br/&gt;
write         172.24     172.24      172.24      0.00       43.06      43.06    &lt;br/&gt;
write         180.02     180.02      180.02      0.00       45.01      45.01    &lt;br/&gt;
write         186.17     186.17      186.17      0.00       46.54      46.54    &lt;/p&gt;

&lt;p&gt;8 threads, modified:&lt;br/&gt;
write         614.53     614.53      614.53      0.00      153.63     153.63    &lt;br/&gt;
write         604.05     604.05      604.05      0.00      151.01     151.01    &lt;br/&gt;
write         616.77     616.77      616.77      0.00      154.19     154.19    &lt;/p&gt;

&lt;p&gt;8 stripes:&lt;br/&gt;
Note - These tests were run with 4 or 8 GB of data per thread, otherwise they completed too quickly for me to be comfortable (though the numbers were similar).  Performance numbers were the same across all total amounts of data tested.  Numbers given below are representative - I repeated each test several times, but didn&apos;t want to put in that much data.&lt;/p&gt;

&lt;p&gt;1 thread, unmodified:&lt;br/&gt;
write        1270.16    1270.16     1270.16      0.00      317.54     317.54    &lt;/p&gt;

&lt;p&gt;1 thread, modified:&lt;br/&gt;
write        1256.26    1256.26     1256.26      0.00      314.06     314.06    &lt;/p&gt;

&lt;p&gt;8 threads, unmodified:&lt;br/&gt;
write         712.33     712.33      712.33      0.00      178.08     178.08    &lt;/p&gt;

&lt;p&gt;8 threads, modified:&lt;br/&gt;
write        1949.85    1949.85     1949.85      0.00      487.46     487.46    &lt;/p&gt;


&lt;p&gt;16 stripes:&lt;/p&gt;

&lt;p&gt;8 threads, unmodified:  &lt;br/&gt;
write        1461.83    1461.83     1461.83      0.00      365.46     365.46 &lt;/p&gt;

&lt;p&gt;8 threads, modified:  &lt;br/&gt;
write        3082.42    3082.42     3082.42      0.00      770.61     770.61&lt;/p&gt;</description>
                <environment></environment>
        <key id="38965">LU-8515</key>
            <summary>OSC: Send RPCs with full extents</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="3" iconUrl="https://jira.whamcloud.com/images/icons/priorities/major.svg">Major</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="paf">Patrick Farrell</assignee>
                                    <reporter username="paf">Patrick Farrell</reporter>
                        <labels>
                    </labels>
                <created>Thu, 18 Aug 2016 21:47:10 +0000</created>
                <updated>Thu, 8 Mar 2018 05:39:01 +0000</updated>
                            <resolved>Sat, 17 Dec 2016 14:15:57 +0000</resolved>
                                                    <fixVersion>Lustre 2.10.0</fixVersion>
                                        <due></due>
                            <votes>0</votes>
                                    <watches>7</watches>
                                                                            <comments>
                            <comment id="162461" author="gerrit" created="Thu, 18 Aug 2016 21:51:53 +0000"  >&lt;p&gt;Patrick Farrell (paf@cray.com) uploaded a new patch: &lt;a href=&quot;http://review.whamcloud.com/22012&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/22012&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-8515&quot; title=&quot;OSC: Send RPCs with full extents&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-8515&quot;&gt;&lt;del&gt;LU-8515&lt;/del&gt;&lt;/a&gt; osc: Send RPCs when extents are full&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: eee670db16607a21331c265a1cb041b8654cc586&lt;/p&gt;</comment>
                            <comment id="162498" author="jay" created="Fri, 19 Aug 2016 06:06:08 +0000"  >&lt;p&gt;Were these tests running with 4MB transfer size, and the max_pages_per_rpc is set to the default value that is 256?&lt;/p&gt;</comment>
                            <comment id="162533" author="paf" created="Fri, 19 Aug 2016 15:19:30 +0000"  >&lt;p&gt;Yes, they were all done with 4 MiB transfer size and 1 MiB RPCs (So, 256 max_pages_per_rpc).  I can&apos;t increase the RPC size on this system, but I can reduce the transfer size to 1 MiB, so they&apos;re matched.  (Note that stripe size is 1 MiB.)&lt;/p&gt;

&lt;p&gt;Here&apos;s (a few of) those tests repeated with 1 MiB transfer size.  If you have specific tests you&apos;d like run, let me know and I can try to get time later.  I can also get some rpc_stats data for a few tests if needed.&lt;/p&gt;

&lt;p&gt;For the modified version, the results are the same.  For the unmodified version, speed is up a bit across the board, but still a &lt;b&gt;lot&lt;/b&gt; slower than the modified version.  This is expected, since with 4 MiB transfers, every write touches multiple stripes (all but guaranteeing multiple active extents in each osc object at the same time, which causes us to send small RPCs).  With 1 MiB transfers, this doesn&apos;t happen all the time - But it does happen some of the time.  (When writers, stripe sizes, and stripe counts match up, like in the 8 stripe case, I&apos;m not clear on why we still we have problems.  I would except the writers not to interfere in that case - But the results below show they seem to.  Probably needs separate investigation.)&lt;/p&gt;

&lt;p&gt;Singly striped:&lt;/p&gt;

&lt;p&gt;1 thread, 1 MiB transfer size.  (These results should be the same.  They tend to be, within the margin of error.)&lt;/p&gt;

&lt;p&gt;Unmodified:&lt;br/&gt;
write         337.20     337.20      337.20      0.00      337.20     337.20      337.20      0.00  24.29393   1 1 1 0 1 1 0 0 1 0 1048576 8589934592 -1 POSIX EXCEL&lt;/p&gt;

&lt;p&gt;Modified:&lt;br/&gt;
write         360.45     360.45      360.45      0.00      360.45     360.45      360.45      0.00  22.72692   1 1 1 0 1 1 0 0 1 0 1048576 8589934592 -1 POSIX EXCEL&lt;/p&gt;

&lt;p&gt;Here&apos;s where we see the difference.&lt;/p&gt;

&lt;p&gt;8 threads, 1 MiB transfer size:&lt;br/&gt;
Unmodified:&lt;br/&gt;
write         103.66     103.66      103.66      0.00      103.66     103.66      103.66      0.00  79.02997   8 8 1 0 1 1 0 0 1 1073741824 1048576 8589934592 -1 POSIX EXCEL&lt;/p&gt;

&lt;p&gt;Modified:&lt;br/&gt;
write         342.51     342.51      342.51      0.00      342.51     342.51      342.51      0.00  23.91763   8 8 1 0 1 1 0 0 1 1073741824 1048576 8589934592 -1 POSIX EXCEL&lt;/p&gt;

&lt;p&gt;8 stripes:&lt;/p&gt;

&lt;p&gt;8 threads, 1 MiB transfer size:&lt;br/&gt;
Unmodified:&lt;br/&gt;
write         869.89     869.89      869.89      0.00      869.89     869.89      869.89      0.00   9.41728   8 8 1 0 1 1 0 0 1 1073741824 1048576 8589934592 -1 POSIX EXCEL&lt;/p&gt;

&lt;p&gt;Modified:&lt;br/&gt;
write        1882.86    1882.86     1882.86      0.00     1882.86    1882.86     1882.86      0.00   4.35084   8 8 1 0 1 1 0 0 1 1073741824 1048576 8589934592 -1 POSIX EXCEL&lt;/p&gt;

&lt;p&gt;16 stripes:&lt;/p&gt;

&lt;p&gt;8 threads, 1 MiB transfer size:&lt;br/&gt;
Unmodified:&lt;br/&gt;
write        2062.42    2062.42     2062.42      0.00     2062.42    2062.42     2062.42      0.00  31.77620   8 8 1 0 1 1 0 0 1 0 1048576 68719476736 -1 POSIX EXCEL&lt;/p&gt;

&lt;p&gt;Modified:&lt;br/&gt;
write        3092.02    3092.02     3092.02      0.00     3092.02    3092.02     3092.02      0.00  21.19523   8 8 1 0 1 1 0 0 1 0 1048576 68719476736 -1 POSIX EXCEL&lt;/p&gt;
</comment>
                            <comment id="162539" author="paf" created="Fri, 19 Aug 2016 15:53:10 +0000"  >&lt;p&gt;By the way, from &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-1669&quot; title=&quot;lli-&amp;gt;lli_write_mutex (single shared file performance)&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-1669&quot;&gt;&lt;del&gt;LU-1669&lt;/del&gt;&lt;/a&gt;:&lt;br/&gt;
&lt;a href=&quot;https://jira.hpdd.intel.com/browse/LU-1669?focusedCommentId=57794&amp;amp;page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-57794&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://jira.hpdd.intel.com/browse/LU-1669?focusedCommentId=57794&amp;amp;page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-57794&lt;/a&gt;&lt;/p&gt;
&lt;blockquote&gt;&lt;/blockquote&gt;
&lt;p&gt;SIDE NOTE:&lt;/p&gt;

&lt;p&gt;As an aside, I ran into what appeared to be poor RPC formation when writing to a single shared file from many different threads, and simultaneously hitting the per OSC dirty page limit. During &quot;Test 5&quot; (and to a lesser extent, &quot;Test 2&quot; as well) I began to see many non 1M RPCs being sent to the OSS nodes, whereas with the other tests, nearly all of the RPCs were 1M in size. This affect got worse as the number of tasks increased.&lt;/p&gt;

&lt;p&gt;What I think was happening is this&lt;/p&gt;


&lt;ul class=&quot;alternate&quot; type=&quot;square&quot;&gt;
	&lt;li&gt;As the client pushes data fast enough to the server, it bumps up&lt;br/&gt;
   against the per OSC dirty limit, thus RPCs are forcefully flushed out        &lt;/li&gt;
&lt;/ul&gt;


&lt;ul class=&quot;alternate&quot; type=&quot;square&quot;&gt;
	&lt;li&gt;As this is happening, threads are continuously trying to write data&lt;br/&gt;
   to there specific region of the file. Some tasks are able to fill a          &lt;br/&gt;
   full 1M buffer before the dirty limit forces a flush, but some tasks         &lt;br/&gt;
   are not.                                                                     &lt;/li&gt;
&lt;/ul&gt;


&lt;ul class=&quot;alternate&quot; type=&quot;square&quot;&gt;
	&lt;li&gt;Buffers to non-contiguous regions of a file are not joined together,&lt;br/&gt;
   so the smaller non-1M buffers are forced out in non-optimal small            &lt;br/&gt;
   RPCs.                                                                        &lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;I believe this affect was only apparent in Test 2 and Test 5, because the other tests just weren&apos;t able to push data to the server fast enough to bump up against the dirty limit.&lt;/p&gt;

&lt;p&gt;It would be nice if the RPC formation engine would keep these small buffers around, waiting for them to reach a full 1M, before flushing them out to the server. This is especially harmful on ZFS backends because it can force read-modify-write operations, as opposed to only performing writes when the RPC is properly aligned at 1M.&lt;/p&gt;
{/quote}

&lt;p&gt;He was partly right.  It turns out the main reason for non-optimal RPC sizes was actually this choice about when to send in osc_makes_rpc.&lt;/p&gt;</comment>
                            <comment id="162557" author="jay" created="Fri, 19 Aug 2016 18:19:05 +0000"  >&lt;p&gt;The client chose to send RPC earlier is because there is no more grant, or dirty pages has reached its limit therefore it can&apos;t cache more dirty data.&lt;/p&gt;

&lt;p&gt;The problem with your patch is that it may cause livelock -  server is in short of space and there is no chance for this client to make full RPC, and then individual threads will hold their own partial extents, and then nobody can move forward.&lt;/p&gt;</comment>
                            <comment id="162560" author="paf" created="Fri, 19 Aug 2016 18:40:02 +0000"  >&lt;p&gt;Jinshan,&lt;/p&gt;

&lt;p&gt;Hmmm.&lt;/p&gt;

&lt;p&gt;I don&apos;t think this change affects the behavior of the client in the case where we&apos;re out of grant/dirty pages has reached its limit.  Those cases are handled separately in osc_makes_rpc.  In that case, we send an RPC due to cl_cache_waiters.&lt;/p&gt;

&lt;p&gt;The check I replaced is this one:&lt;br/&gt;
                if (atomic_read(&amp;amp;osc-&amp;gt;oo_nr_writes) &amp;gt;=&lt;br/&gt;
                    cli-&amp;gt;cl_max_pages_per_rpc)&lt;/p&gt;

&lt;p&gt;Which is just to send an RPC when we have enough dirty pages for one.  We already decide to send an RPC when there are cache waiters.&lt;/p&gt;

&lt;p&gt;Are you saying there is a case where we &lt;b&gt;must&lt;/b&gt; write out some data, but we do not generate hp exts, urgent exts, or have cache waiters?  I think that&apos;s what would be required to get a livelock.  If so, then the existing check won&apos;t make us safe either.  It&apos;s a per-object check that won&apos;t fire unless there are &amp;gt;= cl_max_pages_per_rpc in that object.  So with the current code, a large number of objects could cause the livelock you describe, if all of them had a small amount of data written to them.  (Since this check wouldn&apos;t catch that.)&lt;/p&gt;</comment>
                            <comment id="162562" author="jay" created="Fri, 19 Aug 2016 19:20:08 +0000"  >&lt;blockquote&gt;
&lt;p&gt; We already decide to send an RPC when there are cache waiters.&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;The function osc_makes_rpc() just tells the I/O engine it &lt;em&gt;could&lt;/em&gt; be possible to issue an BRW RPC, and then the I/O engine scans the OSC page cache to try to compose an RPC. It doesn&apos;t guarantee that it can make one.&lt;/p&gt;

&lt;p&gt;With this in mind, I think we should keep the code:&lt;/p&gt;

&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;&lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt; (atomic_read(&amp;amp;osc-&amp;gt;oo_nr_writes) &amp;gt;= cli-&amp;gt;cl_max_pages_per_rpc)
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;but it&apos;s a really good idea to have a full extent list and compose RPCs from that list first.&lt;/p&gt;

&lt;p&gt;Anyway, I think your observation is valuable, and your patch makes a lot of sense to me, just need to tweak a little bit.&lt;/p&gt;</comment>
                            <comment id="162563" author="jay" created="Fri, 19 Aug 2016 19:22:10 +0000"  >&lt;p&gt;btw, is it possible to write a few test cases based on your findings so that we won&apos;t break this in the future?&lt;/p&gt;</comment>
                            <comment id="162567" author="paf" created="Fri, 19 Aug 2016 19:37:50 +0000"  >&lt;p&gt;About test cases:&lt;br/&gt;
Yes, I think it&apos;s probably possible to do that.  I&apos;ll have to think about it - I think it would be something like &quot;run an IOR job with a good I/O pattern and check the rpc sizes&quot;.  Make sure, say, 90 or 95% of RPCs are maximum size.  (I get 99%-100% in my testing.)&lt;/p&gt;</comment>
                            <comment id="162569" author="paf" created="Fri, 19 Aug 2016 19:44:11 +0000"  >&lt;blockquote&gt;
&lt;p&gt;The function osc_makes_rpc() just tells the I/O engine it could be possible to issue an BRW RPC, and then the I/O engine scans the OSC page cache to try to compose an RPC. It doesn&apos;t guarantee that it can make one.&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;Right.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;With this in mind, I think we should keep the code:&lt;/p&gt;

&lt;p&gt;if (atomic_read(&amp;amp;osc-&amp;gt;oo_nr_writes) &amp;gt;= cli-&amp;gt;cl_max_pages_per_rpc)&lt;/p&gt;

&lt;p&gt;but it&apos;s a really good idea to have a full extent list and compose RPCs from that list first.&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;I don&apos;t think we can do that.  I can try adding it back in and running some benchmarks (let me know if you want me to do that), but we call osc_makes_rpc for every page we dirty.  So we&apos;ll try to make an RPC, and (in get_write_extents) if there are no full extents, we&apos;ll still send some data.  So we&apos;ll send our extents before they get to full size.&lt;/p&gt;</comment>
                            <comment id="162572" author="jay" created="Fri, 19 Aug 2016 20:18:43 +0000"  >&lt;blockquote&gt;
&lt;p&gt;but we call osc_makes_rpc for every page we dirty.&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;It&apos;s a bug if this happens - we should only call this whenever it&apos;s possible to make an RPC, for example, releasing a osc_extent, brw_interpret(), or other urgent cases.&lt;/p&gt;

&lt;p&gt;Yes, please do the tests with my suggestions, it will help me understand the code better.&lt;/p&gt;</comment>
                            <comment id="162586" author="paf" created="Fri, 19 Aug 2016 21:19:36 +0000"  >&lt;p&gt;Ah, right, sorry.  I misread the code (there are a lot of ways to call osc_makes_rpc...).  You&apos;re right about when we call it.&lt;/p&gt;

&lt;p&gt;So it probably is safe (in terms of performance) to add back that check, as long as we track full extents and try to send them first in get_write_extents.&lt;/p&gt;

&lt;p&gt;I&apos;ll try it - it would just replace checking list_empty in osc_makes_rpc.  If it works, then it doesn&apos;t really matter which check we have there, the current one or checking the full_ext list directly.&lt;/p&gt;</comment>
                            <comment id="162686" author="paf" created="Mon, 22 Aug 2016 16:22:36 +0000"  >&lt;p&gt;With this check, I think we&apos;re probably getting a few more non-optimally sized RPCs.&lt;/p&gt;

&lt;p&gt;This is because at the end of brw_interpret, osc_io_unplug is called, and that results in calling osc_check_rpcs.  So each time a ptlrpcd thread completes a brw write, it calls osc_check_rpcs, which calls osc_makes_rpc.&lt;/p&gt;

&lt;p&gt;So that&apos;s one case we call osc_makes_rpc when we don&apos;t know if anything is ready for us.  So I think with the oo_nr_writes check, we will sometimes send small RPCs with incomplete extents (that we&apos;re still writing to).&lt;/p&gt;

&lt;p&gt;Also, I think the oo_nr_writes &amp;gt;= cl_max_pages_per_rpc check is specifically trying to send when we expect to send out a full size RPC.  So I think it&apos;s better to just check for &apos;full extents&apos; - I think it describes what we&apos;re really doing better.&lt;/p&gt;

&lt;p&gt;But if we do keep the oo_nr_writes check, I think it will work fine. &lt;img class=&quot;emoticon&quot; src=&quot;https://jira.whamcloud.com/images/icons/emoticons/smile.png&quot; height=&quot;16&quot; width=&quot;16&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/p&gt;</comment>
                            <comment id="162720" author="paf" created="Mon, 22 Aug 2016 19:30:57 +0000"  >&lt;p&gt;Sorry about deleting that comment.  It turns out I ran the tests above in FPP node, not SSF mode.  Ugh.&lt;/p&gt;

&lt;p&gt;Rerunning the tests, I see a significant difference in favor of not using the oo_nr_writes check.  Until we hit the bandwidth limit of the node (16 stripes), !list_empty(full_ext) is &lt;b&gt;significantly&lt;/b&gt; faster.&lt;/p&gt;

&lt;p&gt;My reason for why is still the call to osc_makes_rpc we get every time we call brw_interpret.  I think that&apos;s generating a &lt;b&gt;lot&lt;/b&gt; of non-optimal RPCs.&lt;/p&gt;

&lt;p&gt;Here&apos;s some data, using 1 MiB transfer sizes (4 MiB was very similar):&lt;br/&gt;
1 stripe, 8 process:&lt;br/&gt;
oo_nr_writes&lt;br/&gt;
write          90.97      90.97       90.97      0.00       90.97      90.97       90.97      0.00  45.02345   8 8 1 0 1 1 0 0 1 536870912 1048576 4294967296 -1 POSIX EXCEL&lt;br/&gt;
write          90.97      90.97       90.97      0.00       90.97      90.97       90.97      0.00  45.02626   8 8 1 0 1 1 0 0 1 536870912 1048576 4294967296 -1 POSIX EXCEL&lt;br/&gt;
write          90.29      90.29       90.29      0.00       90.29      90.29       90.29      0.00  45.36592   8 8 1 0 1 1 0 0 1 536870912 1048576 4294967296 -1 POSIX EXCEL&lt;br/&gt;
^-- This test gives about 70 MB/s with no changes at all.&lt;/p&gt;

&lt;p&gt;!list_empty(full_ext):&lt;br/&gt;
write         346.29     346.29      346.29      0.00      346.29     346.29      346.29      0.00  11.82837   8 8 1 0 1 1 0 0 1 536870912 1048576 4294967296 -1 POSIX EXCEL&lt;br/&gt;
write         350.43     350.43      350.43      0.00      350.43     350.43      350.43      0.00  11.68849   8 8 1 0 1 1 0 0 1 536870912 1048576 4294967296 -1 POSIX EXCEL&lt;br/&gt;
write         350.00     350.00      350.00      0.00      350.00     350.00      350.00      0.00  11.70296   8 8 1 0 1 1 0 0 1 536870912 1048576 4294967296 -1 POSIX EXCEL&lt;/p&gt;

&lt;p&gt;8 stripes, 8 processes:&lt;br/&gt;
oo_nr_writes&lt;br/&gt;
write         691.72     691.72      691.72      0.00      691.72     691.72      691.72      0.00   5.92146   8 8 1 0 1 1 0 0 1 536870912 1048576 4294967296 -1 POSIX EXCEL&lt;br/&gt;
write         642.44     642.44      642.44      0.00      642.44     642.44      642.44      0.00   6.37567   8 8 1 0 1 1 0 0 1 536870912 1048576 4294967296 -1 POSIX EXCEL&lt;br/&gt;
write         602.00     602.00      602.00      0.00      602.00     602.00      602.00      0.00   6.80393   8 8 1 0 1 1 0 0 1 536870912 1048576 4294967296 -1 POSIX EXCEL&lt;/p&gt;

&lt;p&gt;!list_empty(full_ext):&lt;br/&gt;
write        1844.41    1844.41     1844.41      0.00     1844.41    1844.41     1844.41      0.00   2.22076   8 8 1 0 1 1 0 0 1 536870912 1048576 4294967296 -1 POSIX EXCEL&lt;br/&gt;
write        1939.05    1939.05     1939.05      0.00     1939.05    1939.05     1939.05      0.00   2.11238   8 8 1 0 1 1 0 0 1 536870912 1048576 4294967296 -1 POSIX EXCEL&lt;br/&gt;
write        1866.54    1866.54     1866.54      0.00     1866.54    1866.54     1866.54      0.00   2.19443   8 8 1 0 1 1 0 0 1 536870912 1048576 4294967296 -1 POSIX EXCEL&lt;/p&gt;

&lt;p&gt;16 stripes, 8 processes:&lt;br/&gt;
oo_nr_writes:&lt;br/&gt;
write        2619.29    2619.29     2619.29      0.00     2619.29    2619.29     2619.29      0.00   1.56378   8 8 1 0 1 1 0 0 1 536870912 1048576 4294967296 -1 POSIX EXCEL&lt;br/&gt;
write        3091.95    3091.95     3091.95      0.00     3091.95    3091.95     3091.95      0.00   1.32473   8 8 1 0 1 1 0 0 1 536870912 1048576 4294967296 -1 POSIX EXCEL&lt;br/&gt;
write        3039.58    3039.58     3039.58      0.00     3039.58    3039.58     3039.58      0.00   1.34756   8 8 1 0 1 1 0 0 1 536870912 1048576 4294967296 -1 POSIX EXCEL&lt;br/&gt;
write        3171.61    3171.61     3171.61      0.00     3171.61    3171.61     3171.61      0.00   2.58292   8 8 1 0 1 1 0 0 1 1073741824 1048576 8589934592 -1 POSIX EXCEL&lt;br/&gt;
write        3154.39    3154.39     3154.39      0.00     3154.39    3154.39     3154.39      0.00   2.59701   8 8 1 0 1 1 0 0 1 1073741824 1048576 8589934592 -1 POSIX EXCEL&lt;br/&gt;
write        3138.18    3138.18     3138.18      0.00     3138.18    3138.18     3138.18      0.00   2.61043   8 8 1 0 1 1 0 0 1 1073741824 1048576 8589934592 -1 POSIX EXCEL&lt;/p&gt;

&lt;p&gt;!list_empty(full_ext):&lt;br/&gt;
write        2896.65    2896.65     2896.65      0.00     2896.65    2896.65     2896.65      0.00   1.41405   8 8 1 0 1 1 0 0 1 536870912 1048576 4294967296 -1 POSIX EXCEL&lt;br/&gt;
write        2974.45    2974.45     2974.45      0.00     2974.45    2974.45     2974.45      0.00   1.37706   8 8 1 0 1 1 0 0 1 536870912 1048576 4294967296 -1 POSIX EXCEL&lt;br/&gt;
write        3064.65    3064.65     3064.65      0.00     3064.65    3064.65     3064.65      0.00   1.33653   8 8 1 0 1 1 0 0 1 536870912 1048576 4294967296 -1 POSIX EXCEL&lt;br/&gt;
write        2814.37    2814.37     2814.37      0.00     2814.37    2814.37     2814.37      0.00   2.91078   8 8 1 0 1 1 0 0 1 1073741824 1048576 8589934592 -1 POSIX EXCEL&lt;br/&gt;
write        3078.64    3078.64     3078.64      0.00     3078.64    3078.64     3078.64      0.00   2.66092   8 8 1 0 1 1 0 0 1 1073741824 1048576 8589934592 -1 POSIX EXCEL&lt;br/&gt;
write        3211.55    3211.55     3211.55      0.00     3211.55    3211.55     3211.55      0.00   2.55079   8 8 1 0 1 1 0 0 1 1073741824 1048576 8589934592 -1 POSIX EXCEL&lt;/p&gt;

&lt;p&gt;16 stripes, 16 processes:&lt;br/&gt;
oo_nr_writes:&lt;br/&gt;
write        3145.50    3145.50     3145.50      0.00     3145.50    3145.50     3145.50      0.00   2.60436   8 8 1 0 1 1 0 0 1 1073741824 1048576 8589934592 -1 POSIX EXCEL&lt;br/&gt;
write        3137.90    3137.90     3137.90      0.00     3137.90    3137.90     3137.90      0.00   2.61066   8 8 1 0 1 1 0 0 1 1073741824 1048576 8589934592 -1 POSIX EXCEL&lt;br/&gt;
write        3233.13    3233.13     3233.13      0.00     3233.13    3233.13     3233.13      0.00   2.53377   8 8 1 0 1 1 0 0 1 1073741824 1048576 8589934592 -1 POSIX EXCEL&lt;/p&gt;

&lt;p&gt;!list_empty(full_ext):&lt;br/&gt;
write        2940.08    2940.08     2940.08      0.00     2940.08    2940.08     2940.08      0.00   2.78631   8 8 1 0 1 1 0 0 1 1073741824 1048576 8589934592 -1 POSIX EXCEL&lt;br/&gt;
write        3149.34    3149.34     3149.34      0.00     3149.34    3149.34     3149.34      0.00   2.60118   8 8 1 0 1 1 0 0 1 1073741824 1048576 8589934592 -1 POSIX EXCEL&lt;br/&gt;
write        3132.14    3132.14     3132.14      0.00     3132.14    3132.14     3132.14      0.00   2.61546   8 8 1 0 1 1 0 0 1 1073741824 1048576 8589934592 -1 POSIX EXCEL&lt;/p&gt;

</comment>
                            <comment id="166503" author="adilger" created="Tue, 20 Sep 2016 08:55:34 +0000"  >&lt;p&gt;Patrick, do you have any before/after testing done with clients doing sub-RPC write size (e.g. interleaved 256KB per client so that they don&apos;t merge into a single full RPC)?  One problem that we had in the past was that sub-sized dirty pages filling up memory without being flushed in a timely manner, so I just want to make sure that the new code that is deferring RPCs until they are full is not over-zealous in delaying dirty data that don&apos;t make up a full RPC. &lt;/p&gt;</comment>
                            <comment id="166513" author="paf" created="Tue, 20 Sep 2016 10:08:28 +0000"  >&lt;p&gt;No, but I will try to get some.  Should be quick once machines are available today.  (I am pretty confident this should be OK.  The new code only waits longer than the old code in the case of multiple writers per stripe.  I don&apos;t know what ensures pages go out in a timely manner, but I don&apos;t think I&apos;ve modified it.)&lt;/p&gt;

&lt;p&gt;Can you clarify what you mean by interleaved?  Other than by doing direct and buffered I/O, I can&apos;t think of how to prevent extents in that situation being packaged in to one RPC.  I suppose I/O from two different clients could sort of do that, but I think we&apos;d actually get data written out via ldlm lock cancellation.  (I could use group locks to avoid that.)&lt;/p&gt;</comment>
                            <comment id="166525" author="adilger" created="Tue, 20 Sep 2016 13:04:38 +0000"  >&lt;p&gt;By interleaved writes I mean having client A write [0,256KB), client B write [256KB,512KB), ... so that the writes cannot be merged on the client into a single 1MB RPC.  That doesn&apos;t test the issue I&apos;m wondering about if these writes are just being done by different threads on the same client.&lt;/p&gt;</comment>
                            <comment id="166559" author="paf" created="Tue, 20 Sep 2016 15:03:26 +0000"  >&lt;p&gt;OK.  My concern with that is that due to ldlm lock exchange (think lock ahead), those bytes will be written out by lock cancellation, so they won&apos;t have a chance to hang around anyway.&lt;/p&gt;

&lt;p&gt;I&apos;ll see about running such a job and checking the results.&lt;/p&gt;</comment>
                            <comment id="166584" author="jay" created="Tue, 20 Sep 2016 16:34:55 +0000"  >&lt;blockquote&gt;
&lt;p&gt;One problem that we had in the past was that sub-sized dirty pages filling up memory without being flushed in a timely manner&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;In that case, write back daemon should be started to writeback pages, and those pages should appear in urgent list that shouldn&apos;t be affected by this patch.&lt;/p&gt;</comment>
                            <comment id="178192" author="gerrit" created="Sat, 17 Dec 2016 05:37:27 +0000"  >&lt;p&gt;Oleg Drokin (oleg.drokin@intel.com) merged in patch &lt;a href=&quot;https://review.whamcloud.com/22012/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/22012/&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-8515&quot; title=&quot;OSC: Send RPCs with full extents&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-8515&quot;&gt;&lt;del&gt;LU-8515&lt;/del&gt;&lt;/a&gt; osc: Send RPCs when extents are full&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: &lt;br/&gt;
Commit: ecb6712a19fa836ecdba41ccda80de0a10b1336a&lt;/p&gt;</comment>
                            <comment id="178247" author="pjones" created="Sat, 17 Dec 2016 14:15:57 +0000"  >&lt;p&gt;Landed for 2.10&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                                        </outwardlinks>
                                                        </issuelinktype>
                    </issuelinks>
                <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzylaf:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>