<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:59:37 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-6370] Read performance degrades with increasing read block size.</title>
                <link>https://jira.whamcloud.com/browse/LU-6370</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;We&apos;re finding substantial read performance degradations with increasing read block sizes.  This has been observed in eslogin nodes as well as in internal login nodes.  Data provided below was gathered on an external node.&lt;/p&gt;

&lt;p&gt;ext7:/lustre # dd if=/dev/urandom of=3gurandomdata bs=4M count=$((256*3))&lt;br/&gt;
ext7:/lustre # for i in 4K 1M 4M 16M 32M 64M 128M 256M 512M 1G 2G 3G ; do echo -en &quot;$i\t&quot; ; dd if=3gurandomdata bs=${i} of=/dev/null 2&amp;gt;&amp;amp;1 | egrep copied  ; done&lt;/p&gt;

&lt;p&gt; 4K      3221225472 bytes (3.2 GB) copied, 13.9569 s, 231 MB/s&lt;br/&gt;
 1M      3221225472 bytes (3.2 GB) copied, 4.94163 s, 652 MB/s&lt;br/&gt;
 4M      3221225472 bytes (3.2 GB) copied, 6.24378 s, 516 MB/s&lt;br/&gt;
 16M     3221225472 bytes (3.2 GB) copied, 5.24595 s, 614 MB/s&lt;br/&gt;
 32M     3221225472 bytes (3.2 GB) copied, 5.48208 s, 588 MB/s&lt;br/&gt;
 64M     3221225472 bytes (3.2 GB) copied, 5.36964 s, 600 MB/s&lt;br/&gt;
 128M    3221225472 bytes (3.2 GB) copied, 5.12867 s, 628 MB/s&lt;br/&gt;
 256M    3221225472 bytes (3.2 GB) copied, 5.1467 s, 626 MB/s&lt;br/&gt;
 512M    3221225472 bytes (3.2 GB) copied, 5.31232 s, 606 MB/s&lt;br/&gt;
 1G      3221225472 bytes (3.2 GB) copied, 12.4088 s, 260 MB/s&lt;br/&gt;
 2G      3221225472 bytes (3.2 GB) copied, 339.646 s, 9.5 MB/s&lt;br/&gt;
 3G      3221225472 bytes (3.2 GB) copied, 350.071 s, 9.2 MB/s&lt;/p&gt;

&lt;p&gt;This shows up on 1008 striped file system but on smaller systems the impact is not nearly so substantial. On our 56 OST system we get&lt;br/&gt;
3G      3221225472 bytes (3.2 GB) copied, 4.77246 s, 675 MB/s&lt;/p&gt;

&lt;p&gt;Another test case was used with C code rather than dd that provided similar results based on an fread call&lt;/p&gt;

&lt;p&gt;int read_size = 256*1024*1024*2;&lt;br/&gt;
fread(buffer, sizeof(float), read_size, fp_in);&lt;/p&gt;

&lt;p&gt;Also, file striping information on production and tds filesystems:&lt;br/&gt;
 ext8:/lustre # lfs getstripe 3gurandomdata&lt;br/&gt;
 3gurandomdata&lt;br/&gt;
 lmm_stripe_count:   4&lt;br/&gt;
 lmm_stripe_size:    1048576&lt;br/&gt;
 lmm_pattern:        1&lt;br/&gt;
 lmm_layout_gen:     0&lt;br/&gt;
 lmm_stripe_offset:  833&lt;br/&gt;
         obdidx           objid           objid           group&lt;br/&gt;
            833         5978755       0x5b3a83                0&lt;br/&gt;
            834         5953949       0x5ad99d                0&lt;br/&gt;
            835         5958818       0x5aeca2                0&lt;br/&gt;
            836         5966400       0x5b0a40                0&lt;/p&gt;

&lt;p&gt; ext8:/lustretds # lfs getstripe 3gurandomdata&lt;br/&gt;
 3gurandomdata&lt;br/&gt;
 lmm_stripe_count:   4&lt;br/&gt;
 lmm_stripe_size:    1048576&lt;br/&gt;
 lmm_pattern:        1&lt;br/&gt;
 lmm_layout_gen:     0&lt;br/&gt;
 lmm_stripe_offset:  51&lt;br/&gt;
         obdidx           objid           objid           group&lt;br/&gt;
             51         1451231       0x1624df                0&lt;br/&gt;
             52         1452258       0x1628e2                0&lt;br/&gt;
             53         1450278       0x162126                0&lt;br/&gt;
             54         1444772       0x160ba4                0&lt;/p&gt;

&lt;p&gt;So this appears to only be happening on wide-stripe file systems.  Here&apos;s the output from &apos;perf top&apos; while a &apos;bad&apos; dd is running:&lt;/p&gt;

&lt;ul class=&quot;alternate&quot; type=&quot;square&quot;&gt;
	&lt;li&gt;8.74% &lt;span class=&quot;error&quot;&gt;&amp;#91;kernel&amp;#93;&lt;/span&gt; &lt;span class=&quot;error&quot;&gt;&amp;#91;k&amp;#93;&lt;/span&gt; _spin_lock - _spin_lock&lt;/li&gt;
	&lt;li&gt;22.23% osc_ap_completion&lt;br/&gt;
                osc_extent_finish&lt;br/&gt;
                brw_interpret &lt;br/&gt;
                ptlrpc_check_set &lt;br/&gt;
                ptlrpcd_check &lt;br/&gt;
                ptlrpcd &lt;br/&gt;
                kthread &lt;br/&gt;
               child_rip &lt;br/&gt;
 + 13.76% cl_env_put &lt;br/&gt;
 + 12.37% cl_env_get &lt;br/&gt;
 + 7.10% vvp_write_complete &lt;br/&gt;
 + 6.51% kfree&lt;br/&gt;
 + 4.62% osc_teardown_async_page &lt;br/&gt;
 + 3.96% osc_page_delete &lt;br/&gt;
 + 3.89% osc_lru_add_batch &lt;br/&gt;
 + 2.69% kmem_cache_free &lt;br/&gt;
 + 2.23% osc_page_init &lt;br/&gt;
 + 1.71% sptlrpc_import_sec_ref &lt;br/&gt;
 + 1.64% osc_page_transfer_add &lt;br/&gt;
 + 1.57% osc_io_submit &lt;br/&gt;
 + 1.43% cfs_percpt_lock &lt;/li&gt;
&lt;/ul&gt;


</description>
                <environment>Clients running a Cray 2.5 which contains backport of CLIO changes for 2.6. Problem also observed with vanilla 2.6 and 2.7 clients,</environment>
        <key id="29110">LU-6370</key>
            <summary>Read performance degrades with increasing read block size.</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="3" iconUrl="https://jira.whamcloud.com/images/icons/priorities/major.svg">Major</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="jay">Jinshan Xiong</assignee>
                                    <reporter username="simmonsja">James A Simmons</reporter>
                        <labels>
                            <label>clio</label>
                    </labels>
                <created>Mon, 16 Mar 2015 20:45:15 +0000</created>
                <updated>Wed, 12 Aug 2015 22:07:45 +0000</updated>
                            <resolved>Tue, 26 May 2015 23:32:19 +0000</resolved>
                                    <version>Lustre 2.6.0</version>
                    <version>Lustre 2.7.0</version>
                    <version>Lustre 2.8.0</version>
                                    <fixVersion>Lustre 2.8.0</fixVersion>
                                        <due></due>
                            <votes>0</votes>
                                    <watches>15</watches>
                                                                            <comments>
                            <comment id="109791" author="ezell" created="Mon, 16 Mar 2015 21:05:31 +0000"  >&lt;p&gt;While the dd process is running, &apos;top&apos; shows 100% CPU for a core.&lt;/p&gt;</comment>
                            <comment id="109794" author="yujian" created="Mon, 16 Mar 2015 22:01:10 +0000"  >&lt;p&gt;Hi Jinshan,&lt;/p&gt;

&lt;p&gt;Could you please take a look at this performance degradation issue? Thank you.&lt;/p&gt;</comment>
                            <comment id="109811" author="adilger" created="Tue, 17 Mar 2015 03:41:23 +0000"  >&lt;p&gt;It seems to me that the performance only goes badly when the read() syscall size is very large (e.g. 1GB or more).  This is causing of extra overhead to allocate and double-buffer the pages in both the kernel and userspace.  How much RAM is on this client?&lt;/p&gt;</comment>
                            <comment id="109813" author="ezell" created="Tue, 17 Mar 2015 04:10:48 +0000"  >&lt;blockquote&gt;&lt;p&gt;How much RAM is on this client?&lt;/p&gt;&lt;/blockquote&gt;
&lt;p&gt;The 2.7GA client was a dedicated spare node with 64GB ram.&lt;br/&gt;
Our external login nodes (running Cray 2.5+CLIO changes) have 256GB.  We&apos;ve seen the issue when only root is logged into the machine (should be no memory pressure).&lt;br/&gt;
Our production compute nodes on Titan (where the problem was originally reported, running Cray 2.5+CLIO changes) have 32GB of ram.&lt;/p&gt;

&lt;p&gt;I agree this might be a bit of a pathological case.  A workaround would be to tell users to split up read calls&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;&lt;span class=&quot;code-keyword&quot;&gt;while&lt;/span&gt; (bytes_read &amp;lt; bytes_needed)
  bytes_read += read(...)
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;but it seems that we have existing applications that do not do this.  From a programmer&apos;s point of view, if they need to load 500 million floats from a file, it&apos;s straightforward to do&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;fread(buffer, sizeof(&lt;span class=&quot;code-object&quot;&gt;float&lt;/span&gt;), 500000000, fp_in);&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;This works OK in vanilla 2.5, so it seems like a regression from the CLIO simplification re-work.&lt;/p&gt;</comment>
                            <comment id="109844" author="paf" created="Tue, 17 Mar 2015 15:28:11 +0000"  >&lt;p&gt;This definitely seems to be file system size dependent.  I&apos;ve tested various Intel client versions and an almost identical Cray client in a VM environment and while I (sometimes) saw a slight drop off at higher read sizes, it&apos;s on the order of 5-20%; nothing like that reported by ORNL.&lt;/p&gt;

&lt;p&gt;I&apos;ve found similar results on our in house hardware.  Not surprisingly, we don&apos;t have anything in house that approaches the size of the file systems connected to Titan.&lt;/p&gt;

&lt;p&gt;I&apos;d be very curious to see actual tests results from a 2.x client version that does not have this issue so we can confirm this as a regression from a particular point.  I&apos;ve been told that&apos;s the case, but I have yet to see data.&lt;/p&gt;</comment>
                            <comment id="109850" author="jay" created="Tue, 17 Mar 2015 16:21:52 +0000"  >&lt;p&gt;From the problem can be seen on vanilla 2.6, then it&apos;s not related to CLIO simplification work. Maybe some changes between 2.5 and 2.6 caused this problem.&lt;/p&gt;

&lt;p&gt;From the test results, there is huge performance degradation when the read size is increasing from 1G to 2G. Will you run those two tests again and use perf to collect performance data?&lt;/p&gt;</comment>
                            <comment id="109853" author="paf" created="Tue, 17 Mar 2015 16:31:15 +0000"  >&lt;p&gt;Jinshan - They&apos;re referring to &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3321&quot; title=&quot;2.x single thread/process throughput degraded from 1.8&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3321&quot;&gt;&lt;del&gt;LU-3321&lt;/del&gt;&lt;/a&gt;.  I wouldn&apos;t call it CLIO work either. &lt;img class=&quot;emoticon&quot; src=&quot;https://jira.whamcloud.com/images/icons/emoticons/smile.png&quot; height=&quot;16&quot; width=&quot;16&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/p&gt;</comment>
                            <comment id="109880" author="lewisj" created="Tue, 17 Mar 2015 18:03:59 +0000"  >&lt;p&gt;perf results from bs=&lt;/p&gt;
{1G,2G}
&lt;p&gt; testing&lt;/p&gt;

&lt;p&gt;ext8:/lustre # (perf record dd if=3gurandomdata of=/dev/null bs=1G ; rm perf.data) &amp;gt;dev/null 2&amp;gt;&amp;amp;1&lt;br/&gt;
ext8:/lustre # perf record dd if=3gurandomdata of=/dev/null bs=1G ; mv perf.data perf.1G&lt;br/&gt;
3+0 records in&lt;br/&gt;
3+0 records out&lt;br/&gt;
3221225472 bytes (3.2 GB) copied, 10.8859 s, 296 MB/s&lt;br/&gt;
[ perf record: Woken up 1 times to write data ]&lt;br/&gt;
[ perf record: Captured and wrote 0.423 MB perf.data (~18500 samples) ]&lt;br/&gt;
titan-ext8:/lustre/atlas/scratch/lewisj/ven004/read_test_dd # ls -lrt&lt;br/&gt;
total 3145745&lt;br/&gt;
&lt;del&gt;rw-r&lt;/del&gt;&lt;del&gt;r&lt;/del&gt;- 1 root root 3221225472 Mar 16 14:21 3gurandomdata&lt;br/&gt;
&lt;del&gt;rw&lt;/del&gt;------ 1 root root     445452 Mar 17 13:05 perf.1G&lt;br/&gt;
ext8:/lustre # perf record dd if=3gurandomdata of=/dev/null bs=2G ; mv perf.data perf.2G&lt;br/&gt;
0+2 records in&lt;br/&gt;
0+2 records out&lt;br/&gt;
3221225472 bytes (3.2 GB) copied, 416.124 s, 7.7 MB/s&lt;br/&gt;
[ perf record: Woken up 61 times to write data ]&lt;br/&gt;
[ perf record: Captured and wrote 15.869 MB perf.data (~693322 samples) ]&lt;/p&gt;</comment>
                            <comment id="109884" author="jay" created="Tue, 17 Mar 2015 18:14:09 +0000"  >&lt;p&gt;I can&apos;t parse the results on my node:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[root@mercury ~]# perf report -v -i perf.1G 
legacy perf.data format
magic/endian check failed
incompatible file format (rerun with -v to learn more)[root@mercury ~]# 
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Can you do it on your side and post the result here?&lt;/p&gt;

&lt;p&gt;Obviously there are much more perf samples collected for 2G block size. Please try to read only 1 block with block size 1G and 2G specifically and see what we can get. If we can reproduce the problem with only 1 block, then please turn on debug and run it again, then post the lustre log here. Thanks,&lt;/p&gt;</comment>
                            <comment id="109999" author="lewisj" created="Wed, 18 Mar 2015 15:37:55 +0000"  >&lt;p&gt;Jinshan, please see attached reports.  count=1 was used for both of these files.&lt;/p&gt;</comment>
                            <comment id="110095" author="lewisj" created="Thu, 19 Mar 2015 14:50:00 +0000"  >&lt;p&gt;We&apos;ve found some interesting results while running permutations on file parameters.  While the problem is very repeatable in a given configuration, it&apos;s not at all stable as we change parameters.  Specifically, changing the stripe count from 4 to &lt;/p&gt;
{1,16,56}
&lt;p&gt; alleviates the huge performance degradation.&lt;/p&gt;

&lt;p&gt;The following table lists time (in seconds, rounded to nearest second) to copy 3 gb of data from lustre to dev null using dd.  Column headers indicate the block size used, row headers indicate the stripe count of the file.  Tests were performed on two separate filesystems, a dedicated 56 OST TDS filesystem as well as a shared (not quiet) 1008 OST filesystem.  Timings across multiple runs vary, but are stable to within approximately 10%.&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;1008 OST fs
                bs=512M bs=1G   bs=2G
Count 1         15      16      23
Count 4         5       12      384
Count 16        4       2       2
Count 56        4       2       2

56 OST fs
                bs=512M bs=1G   bs=2G
Count 1         2       2       3
Count 4         2       2       2
Count 16        2       2       2
Count 56        2       2       4
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Results above were with stripe widths of 1M.  4M widths were also tested, but have not been listed as they tracked the 1M results.  This testing was performed on a whitebox IB lustre client using a 2.5 Cray client.  Tests with other client builds are planned.&lt;/p&gt;</comment>
                            <comment id="110181" author="jay" created="Fri, 20 Mar 2015 00:12:56 +0000"  >&lt;p&gt;Can you read just 1 block and collect Lustre log with block size as 1G and 2G specifically?&lt;/p&gt;</comment>
                            <comment id="110720" author="yujian" created="Thu, 26 Mar 2015 06:38:21 +0000"  >&lt;blockquote&gt;&lt;p&gt;Can you read just 1 block and collect Lustre log with block size as 1G and 2G specifically?&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;Hi John,&lt;br/&gt;
With stripe count of 4 on an 1008 OST filesystem, read performance degraded substantially when read block size was increased from 1G to 2G. You&#8217;ve reproduced the problem with only 1 block, and collected the performance data with 1G and 2G block size separately. Could you please reproduce the issue again and gather Lustre debug logs as per Jinshan&#8217;s suggestion above so as to help him do further investigation? Thank you! &lt;/p&gt;</comment>
                            <comment id="110726" author="lewisj" created="Thu, 26 Mar 2015 11:13:21 +0000"  >&lt;p&gt;Jinshan, what tracing options would you like to see, and for how many seconds?  With &quot;-1&quot;, I expect a bs=2G count=1 to take approximately 1 hour and to overflow any rational buffer many times over.&lt;/p&gt;</comment>
                            <comment id="110748" author="jay" created="Thu, 26 Mar 2015 16:43:48 +0000"  >&lt;p&gt;In that case, I just need the log after it has been running for a few minutes.&lt;/p&gt;</comment>
                            <comment id="110822" author="lewisj" created="Fri, 27 Mar 2015 13:45:50 +0000"  >&lt;p&gt;My hosts won&apos;t be quiet and available until tomorrow, I&apos;m coordinating with Dustin to hopefully get these uploaded today.&lt;/p&gt;</comment>
                            <comment id="110858" author="yujian" created="Fri, 27 Mar 2015 16:22:10 +0000"  >&lt;p&gt;Thank you John and Dustin.&lt;/p&gt;</comment>
                            <comment id="111091" author="dustb100" created="Tue, 31 Mar 2015 13:26:53 +0000"  >&lt;p&gt;I have gathered the requested debug data for LU6370. &lt;/p&gt;

&lt;p&gt;I used &quot;-1&quot; for LNET debugging, and ran the dd test with 1G and 2G blocksizes on a lustre-2.7 client for 180 seconds. &lt;/p&gt;

&lt;p&gt;Each output file is roughly 30MB, compressed. &lt;/p&gt;

&lt;p&gt;Thanks,&lt;br/&gt;
Dustin &lt;/p&gt;</comment>
                            <comment id="111259" author="jay" created="Wed, 1 Apr 2015 18:56:42 +0000"  >&lt;p&gt;Hi Dustin,&lt;/p&gt;

&lt;p&gt;Thanks for the debug information. From the 2G output, I can see some strange things were happening.&lt;/p&gt;

&lt;p&gt;The reading process is 11235.&lt;/p&gt;

&lt;p&gt;At the time 1427805640.741591, this process just finished a reading, and then it was preempted and rescheduled to processor 15.&lt;/p&gt;

&lt;p&gt;From time 1427805640.741714 to 1427805644.468771, kiblnd_cq_completion(), which is running inside softirq context, took over the CPU for over 3 seconds. &lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;00000080:00000001:13.0:1427805640.741591:0:11235:0:(rw.c:1140:ll_readpage()) Process leaving (rc=0 : 0 : 0)
00000800:00000200:15.2:1427805640.741714:0:11235:0:(o2iblnd_cb.c:3325:kiblnd_cq_completion()) conn[ffff88081ed6e000] (130)++
....
00000800:00000200:15.2:1427805643.543220:0:11235:0:(o2iblnd_cb.c:3325:kiblnd_cq_completion()) conn[ffff88081ed6e000] (130)++
00000800:00000200:15.2:1427805643.543271:0:11235:0:(o2iblnd_cb.c:3325:kiblnd_cq_completion()) conn[ffff88081ed6e000] (131)++
00000800:00000200:15.2:1427805643.543458:0:11235:0:(o2iblnd_cb.c:3325:kiblnd_cq_completion()) conn[ffff88081ed6e000] (131)++
00000800:00000200:15.2:1427805643.543499:0:11235:0:(o2iblnd_cb.c:3325:kiblnd_cq_completion()) conn[ffff88081ed6e000] (131)++
00000800:00000200:15.2:1427805643.543588:0:11235:0:(o2iblnd_cb.c:3325:kiblnd_cq_completion()) conn[ffff88081ed6e000] (130)++
00000800:00000200:15.2:1427805644.468771:0:11235:0:(o2iblnd_cb.c:3325:kiblnd_cq_completion()) conn[ffff88081ed6ea00] (131)++
00000020:00000001:15.0:1427805644.468787:0:11235:0:(cl_io.c:930:cl_page_list_del()) Process leaving
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This happened quite often, from time 1427805640 to 1427805644, and then from 1427805644 to 1427805647. Roughly speaking, the pattern was to run 1 second, and then stall for 3 seconds. I didn&apos;t see this was happening for 1G reading but I don&apos;t know why.&lt;/p&gt;

&lt;p&gt;Have you ever run this kind of reading 2G block size job before on pre 2.6 client?&lt;/p&gt;

&lt;p&gt;I&apos;m still reading the log and trying to dig more things. Stay tuned.&lt;/p&gt;</comment>
                            <comment id="111263" author="dustb100" created="Wed, 1 Apr 2015 19:11:36 +0000"  >&lt;p&gt;Jinshan, &lt;br/&gt;
      We did the same testing on the 2.5.3 client and do not see the issue. &lt;/p&gt;

&lt;p&gt;The data I gave you was from a client that was IB attached, but we are seeing this on our Cray systems as well (GNI attached). I wanted to give you a heads up in case that helps you with tracking down the issue. &lt;/p&gt;

&lt;p&gt;-Dustin &lt;/p&gt;</comment>
                            <comment id="111265" author="jay" created="Wed, 1 Apr 2015 19:21:25 +0000"  >&lt;p&gt;After taking a further look, the kiblnd_cq_completion() was running on CPU 15 for even longer time, before the reading process was rescheduled to that CPU. It had been using idle process&apos;s stack so kernel scheduler considered that CPU as idle, this is why the reading process was scheduled. If a task can take so long time, it should really be using process context.&lt;/p&gt;

&lt;p&gt;It&apos;s still unclear why this is not happening in 1G block size. Not sure if this is related to LRU management.&lt;/p&gt;

&lt;p&gt;Hi Dustin,&lt;/p&gt;

&lt;p&gt;If you have another chance to run the test, can you collect some log for CPU usage info and the log for Lustre LRU stats(the output of `lctl get_param llite.&amp;#42;.max_cached_mb and osc.&amp;#42;.osc_cached_mb&apos;).&lt;/p&gt;

&lt;p&gt;Thanks,&lt;/p&gt;</comment>
                            <comment id="111276" author="jay" created="Wed, 1 Apr 2015 20:07:35 +0000"  >&lt;p&gt;btw, Dustin, you don&apos;t need to wait the test to finish, just clear the log beforehand and let it run for a few minutes, then collect the log. Thanks again.&lt;/p&gt;</comment>
                            <comment id="111279" author="jay" created="Wed, 1 Apr 2015 20:12:29 +0000"  >&lt;p&gt;Can you see the same problem for even bigger block size, for example 4G?&lt;/p&gt;</comment>
                            <comment id="111355" author="dustb100" created="Thu, 2 Apr 2015 12:51:43 +0000"  >&lt;p&gt;I do not see the same issue with a block size of 4G on the 2.5.3 client (takes ~6 seconds), but the problem still exists on the 2.7 client (let it run for about 30min then killed it).&lt;/p&gt;

&lt;p&gt;I am working on gathering the requested CPU and LRU info. I will have it for you soon. &lt;/p&gt;

&lt;p&gt;-Dustin &lt;/p&gt;</comment>
                            <comment id="111360" author="paf" created="Thu, 2 Apr 2015 13:25:49 +0000"  >&lt;p&gt;Sorry if I&apos;ve missed this in the shuffle, but have you tested 2.6, Dustin?&lt;/p&gt;

&lt;p&gt;The latest Cray clients on your mainframe have the performance improvements from &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3321&quot; title=&quot;2.x single thread/process throughput degraded from 1.8&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3321&quot;&gt;&lt;del&gt;LU-3321&lt;/del&gt;&lt;/a&gt; back ported in to their version of 2.5, so if you see it not in Intel&apos;s 2.5, but in Cray&apos;s 2.5 and Intel&apos;s 2.6 and 2.7, that would suggest a place to look.&lt;/p&gt;</comment>
                            <comment id="111368" author="dustb100" created="Thu, 2 Apr 2015 14:48:28 +0000"  >&lt;p&gt;Jinshan, &lt;br/&gt;
      I have gathered the CPU data during the run using sar. The output is kinda messy, but it is all there. I also gathered the mac_cached_mb data you requested. &lt;/p&gt;

&lt;p&gt;Please let me know if you need additional info. &lt;/p&gt;

&lt;p&gt;Thanks,&lt;br/&gt;
Dustin&lt;/p&gt;</comment>
                            <comment id="111400" author="jay" created="Thu, 2 Apr 2015 17:16:10 +0000"  >&lt;p&gt;Hi Dustin,&lt;/p&gt;

&lt;p&gt;Thanks for the debug info. You mentioned that you&apos;ve back ported &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3321&quot; title=&quot;2.x single thread/process throughput degraded from 1.8&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3321&quot;&gt;&lt;del&gt;LU-3321&lt;/del&gt;&lt;/a&gt;, is it possible for me to get access to the code?&lt;/p&gt;</comment>
                            <comment id="111402" author="ezell" created="Thu, 2 Apr 2015 17:18:53 +0000"  >&lt;p&gt;&amp;gt; You mentioned that you&apos;ve back ported &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3321&quot; title=&quot;2.x single thread/process throughput degraded from 1.8&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3321&quot;&gt;&lt;del&gt;LU-3321&lt;/del&gt;&lt;/a&gt;, is it possible for me to get access to the code?&lt;/p&gt;

&lt;p&gt;Cray (Patrick?) would need to provide this.  We can either track this down with Cray&apos;s client that includes &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3321&quot; title=&quot;2.x single thread/process throughput degraded from 1.8&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3321&quot;&gt;&lt;del&gt;LU-3321&lt;/del&gt;&lt;/a&gt;, or with b2_7/master that also includes this patch (plus many more things).&lt;/p&gt;</comment>
                            <comment id="111405" author="paf" created="Thu, 2 Apr 2015 17:37:45 +0000"  >&lt;p&gt;We can certainly do that (the GPL requires we do it &lt;img class=&quot;emoticon&quot; src=&quot;https://jira.whamcloud.com/images/icons/emoticons/wink.png&quot; height=&quot;16&quot; width=&quot;16&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt; ), but I can&apos;t personally without permission from my boss (who&apos;s not available for a few days).  As our customer, ORNL has the legal right to have the source, but I think Cray would want some sort of formal request...?  If the ORNL staff feel it&apos;s important, I can look in to getting permission.&lt;/p&gt;

&lt;p&gt;So, at least, I can&apos;t do it right away.  So for this bug, it&apos;s likely going to be easier to proceed with examining it from a 2.7 perspective.&lt;/p&gt;

&lt;p&gt;Matt/Dustin - Are you able to confirm this happens in 2.6?  (Sorry if you already have and I missed it)  If it doesn&apos;t happen in 2.6 (but does in Cray&apos;s 2.5 and Intel&apos;s 2.7), then &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3321&quot; title=&quot;2.x single thread/process throughput degraded from 1.8&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3321&quot;&gt;&lt;del&gt;LU-3321&lt;/del&gt;&lt;/a&gt; is unlikely to be the culprit.&lt;/p&gt;</comment>
                            <comment id="111416" author="dustb100" created="Thu, 2 Apr 2015 18:05:17 +0000"  >&lt;p&gt;Patrick: I did not have a lustre-2.6 client laying around and had to build on and test it. The answer to your question though is that the problem &lt;b&gt;is&lt;/b&gt; reproducible in lustre-2.6. &lt;/p&gt;

&lt;p&gt;Thanks,&lt;br/&gt;
Dustin &lt;/p&gt;</comment>
                            <comment id="111420" author="jay" created="Thu, 2 Apr 2015 18:12:33 +0000"  >&lt;p&gt;Sooner or later I will need the code base for a fix.&lt;/p&gt;

&lt;p&gt;Dustin, can you please help me do one more thing - while the read is going on, run `echo t &amp;gt; /proc/sysrq-trigger; dmesg &amp;gt; dmesg-$(date +%s).log&apos; on console so that it will generate stack traces for all processes running on the system right now. You may do it a couple of times with 1 second interval to make sure it will collect sufficient log. Once this is done, please send me all dmesg-xyz.log.&lt;/p&gt;

&lt;p&gt;Thanks,&lt;/p&gt;</comment>
                            <comment id="111428" author="paf" created="Thu, 2 Apr 2015 18:45:16 +0000"  >&lt;p&gt;Jinshan - &lt;/p&gt;

&lt;p&gt;Since the problem is in 2.6/2.7, Cray should be able to handle back porting a patch done against a newer version.  In the end, our client is our responsibility.  If it does prove problematic and ORNL would like to ask for your help in getting that fix on Cray&apos;s 2.5 (and you feel that&apos;s covered by ORNL&apos;s arrangements with Intel), I assume we&apos;d be able to provide the code base (and would be happy to have the help if it&apos;s needed).&lt;/p&gt;

&lt;ul class=&quot;alternate&quot; type=&quot;square&quot;&gt;
	&lt;li&gt;Patrick&lt;/li&gt;
&lt;/ul&gt;
</comment>
                            <comment id="111461" author="jay" created="Fri, 3 Apr 2015 00:06:06 +0000"  >&lt;p&gt;I&apos;ve known the root cause of this issue - and at last I can reproduce it by adding some tricks in the code. The problem boils down to the inconsistency of LRU management and read ahead algorithm. The end result is that read ahead brings a lot of pages into the memory but LRU drops them due to tightness of per-OSC LRU budget.&lt;/p&gt;

&lt;p&gt;It will take huge effort to make a general policy to fit every I/O cases. However, I can make a hot fix to solve the problem you&apos;re experiencing. If that&apos;s okay for you, I can make a patch for master.&lt;/p&gt;</comment>
                            <comment id="111463" author="ezell" created="Fri, 3 Apr 2015 01:19:11 +0000"  >&lt;p&gt;Jinshan - good news that you understand the root cause of the issue.&lt;/p&gt;

&lt;p&gt;Is readahead not bounded by:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;# lctl get_param llite.*.max_read_ahead_mb
llite.atlas1-ffff8817e6a85000.max_read_ahead_mb=40
llite.atlas2-ffff880fef58fc00.max_read_ahead_mb=40
llite.linkfarm-ffff8827ef28c000.max_read_ahead_mb=40
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Is there a different way that we can tune readahead?&lt;/p&gt;

&lt;p&gt;Anyway, we would appreciate a &quot;hot fix&quot; against master.  We can probably trivially backport it to b2_7 for our non-Cray clients.  James Simmons and Cray can figure out how easy it will be to backport to Cray&apos;s 2.5 branch.&lt;/p&gt;

&lt;p&gt;Thanks!&lt;/p&gt;</comment>
                            <comment id="111476" author="gerrit" created="Fri, 3 Apr 2015 05:57:50 +0000"  >&lt;p&gt;Jinshan Xiong (jinshan.xiong@intel.com) uploaded a new patch: &lt;a href=&quot;http://review.whamcloud.com/14347&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/14347&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-6370&quot; title=&quot;Read performance degrades with increasing read block size.&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-6370&quot;&gt;&lt;del&gt;LU-6370&lt;/del&gt;&lt;/a&gt; osc: disable to control per-OSC LRU budget&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: 19ac637dc7256ba64544df5ab3d2c176c364a27e&lt;/p&gt;</comment>
                            <comment id="111506" author="paf" created="Fri, 3 Apr 2015 16:08:08 +0000"  >&lt;p&gt;Matt - &lt;/p&gt;

&lt;p&gt;The code in question is effectively identical in Cray&apos;s 2.5, so porting the patch is trivial.&lt;/p&gt;

&lt;p&gt;Jinshan -&lt;/p&gt;

&lt;p&gt;I&apos;ve been thinking about this and had a suggestion.&lt;/p&gt;

&lt;p&gt;First, I&apos;ll explain my understanding in case I&apos;ve missed something.&lt;br/&gt;
The current code first checks to see if we&apos;re running out of LRU slots.  If so, decides to free pages (more or less of them depending on whether or not the OSC is over budget).&lt;/p&gt;

&lt;p&gt;Also, separately, it checks to see if the OSC in question is at more than 2*budget, and if so, decides to free pages.&lt;/p&gt;

&lt;p&gt;The problem is that these large reads are overflowing that 2*budget limit for a particular OSC, so your patch comments out that limit, which would allow a particular OSC to consume any amount of cache, as long as LRU slots are available.&lt;/p&gt;

&lt;p&gt;The reason this is a particular issue for ORNL is the per OSC cache budget is calculated by taking the total budget and dividing by the number of OSCs.  Since ORNL has a very large number of OSTs, this means the budget for each OSC could be quite small.&lt;/p&gt;

&lt;p&gt;In general, freeing down to max_cache/&lt;span class=&quot;error&quot;&gt;&amp;#91;number_of_OSCs&amp;#93;&lt;/span&gt; when low on LRU pages seems correct, but we&apos;d like to let a particular OSC use a larger portion of cache if it&apos;s available, but...  Probably not ALL of it.&lt;/p&gt;

&lt;p&gt;And without your patch, the limit on that larger OSC is 2*budget.  How about instead making it a % of total cache?  That would cover the case when budget is small due to the number of OSCs without letting a single OSC totally dominate the cache (which is the downside to your quick fix).  It could perhaps be made a tunable - &quot;max_single_osc_cache_percent&quot; or similar?&lt;/p&gt;</comment>
                            <comment id="111517" author="jay" created="Fri, 3 Apr 2015 18:20:59 +0000"  >&lt;p&gt;the ultimate solution would be a self adaptive policy that an OSC can use as many LRU slots as it wants, if there is no competition from other OSCs.  However, once other OSCs are starting consuming LRU slots, the over budget OSC should release slots with a faster speed to maintain fairness.&lt;/p&gt;

&lt;p&gt;There is no difference to use percentage of slots an OSC can use in maximum - from my point of view. &lt;/p&gt;</comment>
                            <comment id="111523" author="paf" created="Fri, 3 Apr 2015 19:33:00 +0000"  >&lt;p&gt;Jinshan - Ah, OK.  Makes sense.  I guess doing a % would only help a little in certain edge cases.&lt;/p&gt;

&lt;p&gt;Matt - Cray can make sure the patch passes sanity testing before sending it your way, but it&apos;s tricky for us to verify it solves the problem with a large OST/OSC count.&lt;/p&gt;

&lt;p&gt;It would be very helpful if ORNL can verify it clears up the performance problem on your non-Cray clients before we proceed with the patch - And it would be more convincing than attempting to replicate it here, since we don&apos;t have physical hardware on the scale of Titan available for testing.  We can use various tricks to get OST count up on limited hardware, but it&apos;s not nearly the same.&lt;/p&gt;</comment>
                            <comment id="111770" author="dustb100" created="Wed, 8 Apr 2015 20:00:47 +0000"  >&lt;p&gt;I was able to test this on our TDS this afternoon and it appears to have fixed the problem. Our next step is to test this on a Cray client. We will send out the results. &lt;/p&gt;

&lt;p&gt;Below is the output from my run:&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;root@atlas-tds-mds2 leverman&amp;#93;&lt;/span&gt;# dd if=10GB.out of=10GB.out.test bs=2G count=1&lt;br/&gt;
0+1 records in&lt;br/&gt;
0+1 records out&lt;br/&gt;
2147479552 bytes (2.1 GB) copied, 6.37421 s, 337 MB/s&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;root@atlas-tds-mds2 leverman&amp;#93;&lt;/span&gt;# rpm -qa| grep lustre&lt;br/&gt;
lustre-client-2.7.0-2.6.322.6.322.6.322.6.32_431.17.1.el6.wc.x86_64.x86_64&lt;br/&gt;
lustre-client-modules-2.7.0-2.6.322.6.322.6.322.6.32_431.17.1.el6.wc.x86_64.x86_64&lt;/p&gt;</comment>
                            <comment id="111794" author="gerrit" created="Thu, 9 Apr 2015 03:23:10 +0000"  >&lt;p&gt;Oleg Drokin (oleg.drokin@intel.com) merged in patch &lt;a href=&quot;http://review.whamcloud.com/14347/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/14347/&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-6370&quot; title=&quot;Read performance degrades with increasing read block size.&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-6370&quot;&gt;&lt;del&gt;LU-6370&lt;/del&gt;&lt;/a&gt; osc: disable to control per-OSC LRU budget&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: &lt;br/&gt;
Commit: 642dd7e50b4ff39e91c5fd0a771a26c59b5b6637&lt;/p&gt;</comment>
                            <comment id="116465" author="pjones" created="Tue, 26 May 2015 23:32:19 +0000"  >&lt;p&gt;Landed for 2.8&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                                        </outwardlinks>
                                                        </issuelinktype>
                    </issuelinks>
                <attachments>
                            <attachment id="17405" name="LU6370_1GB_BS.lctldk.out.gz" size="250" author="dustb100" created="Tue, 31 Mar 2015 13:26:46 +0000"/>
                            <attachment id="17406" name="LU6370_2GB_BS.lctldk.out.gz" size="250" author="dustb100" created="Tue, 31 Mar 2015 13:26:46 +0000"/>
                            <attachment id="17421" name="LU6370_cpu_log_20150402.out.gz" size="20378" author="dustb100" created="Thu, 2 Apr 2015 14:55:41 +0000"/>
                            <attachment id="17422" name="LU6370_max_cached_mb_20150402.out.gz" size="533057" author="dustb100" created="Thu, 2 Apr 2015 14:55:41 +0000"/>
                            <attachment id="17315" name="lu-6370-perf.tgz" size="4230083" author="lewisj" created="Tue, 17 Mar 2015 18:03:59 +0000"/>
                            <attachment id="17319" name="lu-6370_perf_data.tgz" size="5369" author="lewisj" created="Wed, 18 Mar 2015 15:37:55 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                    <customfield id="customfield_10030" key="com.atlassian.jira.plugin.system.customfieldtypes:labels">
                        <customfieldname>Epic/Theme</customfieldname>
                        <customfieldvalues>
                                        <label>Performance</label>
    
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzx8lr:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>