<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:32:54 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-3321] 2.x single thread/process throughput degraded from 1.8</title>
                <link>https://jira.whamcloud.com/browse/LU-3321</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;Single thread/process throughput on tag 2.3.64 is degraded from 1.8.9 and significantly degraded when the client hits its caching limit (llite.*.max_cached_mb).  Attached graph shows lnet stats sampled every second for a single dd writing 2 - 64 GB files followed by a dropping cache and reading the same two files.  The tests were not done simultaenously but the graph has them starting from the same point.  It also takes a significant amount of time to drop the cache on 2.3.64.&lt;/p&gt;

&lt;p&gt;Lustre 2.3.64&lt;br/&gt;
Write (dd if=/dev/zero of=testfile bs=1M)&lt;br/&gt;
 68719476736 bytes (69 GB) copied, 110.459 s, 622 MB/s&lt;br/&gt;
 68719476736 bytes (69 GB) copied, 147.935 s, 465 MB/s&lt;/p&gt;

&lt;p&gt;Drop caches (echo 1 &amp;gt; /proc/sys/vm/drop_caches)&lt;br/&gt;
 real	0m43.075s&lt;/p&gt;

&lt;p&gt;Read (dd if=testfile of=/dev/null bs=1M)&lt;br/&gt;
 68719476736 bytes (69 GB) copied, 99.2963 s, 692 MB/s&lt;br/&gt;
 68719476736 bytes (69 GB) copied, 142.611 s, 482 MB/s&lt;/p&gt;


&lt;p&gt;Lustre 1.8.9&lt;br/&gt;
Write (dd if=/dev/zero of=testfile bs=1M)&lt;br/&gt;
 68719476736 bytes (69 GB) copied, 63.3077 s, 1.1 GB/s&lt;br/&gt;
 68719476736 bytes (69 GB) copied, 67.4487 s, 1.0 GB/s&lt;/p&gt;

&lt;p&gt;Drop caches (echo 1 &amp;gt; /proc/sys/vm/drop_caches)&lt;br/&gt;
 real	0m9.189s&lt;/p&gt;

&lt;p&gt;Read (dd if=testfile of=/dev/null bs=1M)&lt;br/&gt;
 68719476736 bytes (69 GB) copied, 46.4591 s, 1.5 GB/s&lt;br/&gt;
 68719476736 bytes (69 GB) copied, 52.3635 s, 1.3 GB/s&lt;/p&gt;</description>
                <environment>Tested on 2.3.64 and 1.8.9 clients with 4 OSS x 3 - 32 GB OST ramdisks</environment>
        <key id="18906">LU-3321</key>
            <summary>2.x single thread/process throughput degraded from 1.8</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="1" iconUrl="https://jira.whamcloud.com/images/icons/priorities/blocker.svg">Blocker</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="jay">Jinshan Xiong</assignee>
                                    <reporter username="jfilizetti">Jeremy Filizetti</reporter>
                        <labels>
                            <label>HB</label>
                    </labels>
                <created>Mon, 13 May 2013 01:56:39 +0000</created>
                <updated>Wed, 3 May 2017 20:03:55 +0000</updated>
                            <resolved>Thu, 6 Feb 2014 07:11:51 +0000</resolved>
                                    <version>Lustre 2.4.0</version>
                                    <fixVersion>Lustre 2.6.0</fixVersion>
                                        <due></due>
                            <votes>0</votes>
                                    <watches>23</watches>
                                                                            <comments>
                            <comment id="58309" author="pjones" created="Mon, 13 May 2013 19:15:57 +0000"  >&lt;p&gt;Jeremy&lt;/p&gt;

&lt;p&gt;Could you please repeat these tests with 2.1.5 and 2.3 to see whether the behaviour is consistent with 2.3.65?&lt;/p&gt;


&lt;p&gt;Also, could you please check whether the patches in gerrit for &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-2139&quot; title=&quot;Tracking unstable pages&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-2139&quot;&gt;&lt;del&gt;LU-2139&lt;/del&gt;&lt;/a&gt; help improve this situation?&lt;/p&gt;

&lt;p&gt;Thanks&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="58528" author="jfilizetti" created="Wed, 15 May 2013 03:03:13 +0000"  >&lt;p&gt;Based on some feedback from Jinshan I retried this test with &lt;a href=&quot;http://review.whamcloud.com/#change,5446&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#change,5446&lt;/a&gt; from &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-2622&quot; title=&quot;All CPUs spinning on cl_envs_guard lock under ll_releasepage during memory reclaim&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-2622&quot;&gt;&lt;del&gt;LU-2622&lt;/del&gt;&lt;/a&gt;.  This seems to have a 50% gain in performance when the cache limit is hit but still quite a drop from 1.8.9.  I also moved to 2.3.65 (from 2.3.64) but verified without the patch that it was the same as 2.6.4.  When I get a chance I will try to test again with 2.3 and &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-2139&quot; title=&quot;Tracking unstable pages&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-2139&quot;&gt;&lt;del&gt;LU-2139&lt;/del&gt;&lt;/a&gt; patches.&lt;/p&gt;

&lt;p&gt;Here is the times from this last test (and graph):&lt;br/&gt;
1.8.9&lt;br/&gt;
 Write test  &lt;br/&gt;
 68719476736 bytes (69 GB) copied, 60.8574 s, 1.1 GB/s&lt;br/&gt;
 68719476736 bytes (69 GB) copied, 70.3939 s, 976 MB/s&lt;br/&gt;
 Dropping cache &lt;br/&gt;
 real   0m12.435s&lt;br/&gt;
 Read test  &lt;br/&gt;
 68719476736 bytes (69 GB) copied, 48.445 s, 1.4 GB/s&lt;br/&gt;
 68719476736 bytes (69 GB) copied, 47.7765 s, 1.4 GB/s&lt;/p&gt;

&lt;p&gt;2.3.65 + &lt;a href=&quot;http://review.whamcloud.com/#change,5446&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#change,5446&lt;/a&gt;&lt;br/&gt;
 Write test  &lt;br/&gt;
 68719476736 bytes (69 GB) copied, 87.1735 s, 788 MB/s&lt;br/&gt;
 68719476736 bytes (69 GB) copied, 98.4901 s, 698 MB/s&lt;br/&gt;
 Dropping cache &lt;br/&gt;
 real   0m16.063s&lt;br/&gt;
 Read test  &lt;br/&gt;
 68719476736 bytes (69 GB) copied, 77.9799 s, 881 MB/s&lt;br/&gt;
 68719476736 bytes (69 GB) copied, 93.733 s, 733 MB/s&lt;/p&gt;
</comment>
                            <comment id="60978" author="pjones" created="Fri, 21 Jun 2013 12:36:16 +0000"  >&lt;p&gt;Jinshan&lt;/p&gt;

&lt;p&gt;Will this situation be improved with the CLIO simplification work that has been proposed?&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="61014" author="jay" created="Fri, 21 Jun 2013 17:25:06 +0000"  >&lt;p&gt;Hi Peter,&lt;/p&gt;

&lt;p&gt;Yes, I think so.&lt;/p&gt;</comment>
                            <comment id="68680" author="jay" created="Wed, 9 Oct 2013 17:24:28 +0000"  >&lt;p&gt;Please check patch for master at:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://review.whamcloud.com/7888&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/7888&lt;/a&gt;&lt;br/&gt;
&lt;a href=&quot;http://review.whamcloud.com/7889&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/7889&lt;/a&gt;&lt;br/&gt;
&lt;a href=&quot;http://review.whamcloud.com/7890&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/7890&lt;/a&gt;&lt;br/&gt;
&lt;a href=&quot;http://review.whamcloud.com/7891&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/7891&lt;/a&gt;&lt;br/&gt;
&lt;a href=&quot;http://review.whamcloud.com/7892&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/7892&lt;/a&gt;&lt;br/&gt;
&lt;a href=&quot;http://review.whamcloud.com/7893&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/7893&lt;/a&gt;&lt;br/&gt;
&lt;a href=&quot;http://review.whamcloud.com/7894&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/7894&lt;/a&gt;&lt;br/&gt;
&lt;a href=&quot;http://review.whamcloud.com/7895&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/7895&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;and patch for b2_4 at:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://review.whamcloud.com/7896&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/7896&lt;/a&gt;&lt;br/&gt;
&lt;a href=&quot;http://review.whamcloud.com/7897&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/7897&lt;/a&gt;&lt;br/&gt;
&lt;a href=&quot;http://review.whamcloud.com/7898&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/7898&lt;/a&gt;&lt;br/&gt;
&lt;a href=&quot;http://review.whamcloud.com/7899&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/7899&lt;/a&gt;&lt;br/&gt;
&lt;a href=&quot;http://review.whamcloud.com/7900&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/7900&lt;/a&gt;&lt;br/&gt;
&lt;a href=&quot;http://review.whamcloud.com/7901&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/7901&lt;/a&gt;&lt;br/&gt;
&lt;a href=&quot;http://review.whamcloud.com/7902&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/7902&lt;/a&gt;&lt;br/&gt;
&lt;a href=&quot;http://review.whamcloud.com/7903&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/7903&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="68782" author="jay" created="Fri, 11 Oct 2013 00:33:33 +0000"  >&lt;p&gt;Let me share a test result from Jeremy which showed performance improvement. &lt;/p&gt;</comment>
                            <comment id="68783" author="jay" created="Fri, 11 Oct 2013 00:34:01 +0000"  >&lt;p&gt;And the test I did in our lab&lt;/p&gt;</comment>
                            <comment id="70830" author="efocht" created="Wed, 6 Nov 2013 10:56:19 +0000"  >&lt;p&gt;Jinshan, a quick question: are the results you&apos;ve uploaded on Oct. 11 measured on a striped file? How many stripes? Were these multiple threads on one striped file or each thread with it&apos;s own file?&lt;/p&gt;

&lt;p&gt;I&apos;ve seen very poor performance with striped files and no increase in performance when increasing the number of stripes with Lustre 2.x, just want to make sure whether your measurements are &quot;my&quot; use case or not.&lt;/p&gt;

&lt;p&gt;Thanks,&lt;br/&gt;
Erich&lt;/p&gt;</comment>
                            <comment id="70952" author="jay" created="Thu, 7 Nov 2013 06:01:13 +0000"  >&lt;p&gt;Hi Erich,&lt;/p&gt;

&lt;p&gt;I didn&apos;t use multiple striped files, and each thread wrote their own files in multiple threads testing.&lt;/p&gt;

&lt;p&gt;Did you use my performance patch to do the test?&lt;/p&gt;

&lt;p&gt;I assume you&apos;re doing benchmark on multiple striped file with single thread. In my experience, if your OSTw are fast, it won&apos;t help to get better performance by striping files to multiple OSTs, because the bottleneck is on the client CPU. You can take a look at the rpc_stats of OSC: lctl get_param osc.*.rpc_stats, if the value of rpcs_in_flight is low, which implies client can&apos;t generate data fast enough to saturate OSTs.&lt;/p&gt;</comment>
                            <comment id="71185" author="morrone" created="Sat, 9 Nov 2013 00:28:25 +0000"  >&lt;p&gt;We are rather disappointed by the revert of commit 93fe562.  While I can understand not liking the performance impact, basic functionality trumps performance.  On nodes with high processor counts, the lustre client thrashes to the point of being unusable without that patch.&lt;/p&gt;

&lt;p&gt;I think this ticket is now a blocker for 2.6, because we can&apos;t operate with the tree in its current state.&lt;/p&gt;</comment>
                            <comment id="71195" author="jay" created="Sat, 9 Nov 2013 06:35:29 +0000"  >&lt;p&gt;No worry, we&apos;ve already had a patch for percpu cl_env cache. It turns out that the overhead of allocating cl_env is really high, so caching cl_env is necessary, but we just need a smart way to cache them.&lt;/p&gt;</comment>
                            <comment id="74088" author="jlevi" created="Thu, 26 Dec 2013 15:03:09 +0000"  >&lt;p&gt;Once &lt;a href=&quot;http://review.whamcloud.com/#/c/8523/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#/c/8523/&lt;/a&gt; lands this ticket can be closed&lt;/p&gt;</comment>
                            <comment id="75034" author="jfc" created="Wed, 15 Jan 2014 21:51:07 +0000"  >&lt;p&gt;Can anyone tell me about the status of the final patch for this issue? Looks like some recent testing has been successful, but I don&apos;t know the other tools well enough to know if the patch is ready to land, or even already landed. Thanks.&lt;/p&gt;</comment>
                            <comment id="75037" author="paf" created="Wed, 15 Jan 2014 22:00:50 +0000"  >&lt;p&gt;John - There are a few components to the review/landing process.&lt;/p&gt;

&lt;p&gt;The patch is built automatically by Jenkins, which contributes a +1 if it builds correctly (that&apos;s already true for &lt;a href=&quot;http://review.whamcloud.com/#/c/8523/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#/c/8523/&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;Maloo then runs several sets of tests, currently I think that&apos;s three.  Once all the sets of tests have passed, Maloo contributes a +1.  It looks like two of the three test sets have completed for the patch in question.&lt;/p&gt;

&lt;p&gt;Separately, human reviewers contribute code reviews.  A positive review gives a +1.  The general standard is two +1s before landing a patch.&lt;/p&gt;

&lt;p&gt;Then, finally, someone in the gatekeeper role - I believe that&apos;s currently Oleg Drokin and Andreas Dilger - approves the patch, which appears as +2.  Then the patch is cherry-picked on to master (also by a gatekeeper).&lt;/p&gt;

&lt;p&gt;Once the patch has been cherry-picked, it has landed.&lt;/p&gt;

&lt;p&gt;This one is almost ready to land.  The tests need to complete, then it should be approved and cherry-picked quickly.  (Since this ticket is a blocker for 2.6, it&apos;ll definitely be in before that release.) &lt;/p&gt;</comment>
                            <comment id="75062" author="jay" created="Thu, 16 Jan 2014 06:18:34 +0000"  >&lt;p&gt;the maloo test failed due to an unrelated bug. I will retrigger the test.&lt;/p&gt;</comment>
                            <comment id="76330" author="jay" created="Thu, 6 Feb 2014 07:11:45 +0000"  >&lt;p&gt;phew, the last patch was landed.&lt;/p&gt;</comment>
                            <comment id="76352" author="rjh" created="Thu, 6 Feb 2014 15:15:09 +0000"  >&lt;p&gt;does this also fix &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-2032&quot; title=&quot;small random read i/o performance regression&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-2032&quot;&gt;LU-2032&lt;/a&gt; ?&lt;/p&gt;</comment>
                            <comment id="76358" author="jay" created="Thu, 6 Feb 2014 17:06:10 +0000"  >&lt;p&gt;Hi Robin,&lt;/p&gt;

&lt;p&gt;No, they are not related.&lt;/p&gt;</comment>
                            <comment id="77598" author="pichong" created="Fri, 21 Feb 2014 15:02:49 +0000"  >&lt;p&gt;Could you provide the details of the configuration where you made your measurements (client node socket, memory size, network interface, max_cached_mb setting, other client tuning, OSS node and OST storage, file striping, io size, RPC size) ?&lt;/p&gt;

&lt;p&gt;Does the dd test include a final fsync of data to storage ?&lt;/p&gt;

&lt;p&gt;Do you have performance results with a version of Lustre after all the patches have been landed ?&lt;/p&gt;

&lt;p&gt;thanks.&lt;/p&gt;</comment>
                            <comment id="78619" author="jay" created="Thu, 6 Mar 2014 18:56:52 +0000"  >&lt;p&gt;Hi Pichon,&lt;/p&gt;

&lt;p&gt;Now that you&apos;re asking, I assume the performance number didn&apos;t reach your expectation. Therefore I performed the test again with latest master to make sure everything is fine. Please collect statistic data on your node if this is the case, and I will take a look.&lt;/p&gt;

&lt;p&gt;I just performed the performance testing again on opensfs nodes with the following hardware configuration:&lt;/p&gt;

&lt;p&gt;Client nodes:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[root@c01 lustre]# free
             total       used       free     shared    buffers     cached
Mem:      32870020   26477056    6392964          0     147936   21561448
-/+ buffers/cache:    4767672   28102348
Swap:     16506872          0   16506872
[root@c01 lustre]# lscpu 
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                8
On-line CPU(s) list:   0-7
Thread(s) per core:    2
Core(s) per socket:    4
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 44
Stepping:              2
CPU MHz:               1600.000
BogoMIPS:              4800.10
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              12288K
NUMA node0 CPU(s):     0-7
[root@c01 lustre]# lspci |grep InfiniBand
03:00.0 InfiniBand: Mellanox Technologies MT26428 [ConnectX VPI PCIe 2.0 5GT/s - IB QDR / 10GigE] (rev b0)
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;So the client node has 32G memory size and 4 Cores with 2 threads on each core. The network is Infiniband with 40Gb/s throughput.&lt;/p&gt;

&lt;p&gt;Server node is another client node with patch &lt;a href=&quot;http://review.whamcloud.com/5164&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/5164&lt;/a&gt; applied. I used ramdisk as OST because we don&apos;t have fast disk array. Jeremy saw real performance improvement on their real disk storage. I disabled writethrough_cache_enable on the OST to avoid consuming too much memory on caching data.&lt;/p&gt;

&lt;p&gt;Here is the test result:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[root@c01 lustre]# dd if=/dev/zero of=/mnt/lustre/testfile bs=1M count=40960
40960+0 records in
40960+0 records out
42949672960 bytes (43 GB) copied, 39.5263 s, 1.1 GB/s
[root@c01 lustre]# lfs getstripe /mnt/lustre/testfile 
/mnt/lustre/testfile
lmm_stripe_count:   1
lmm_stripe_size:    1048576
lmm_pattern:        1
lmm_layout_gen:     0
lmm_stripe_offset:  0
	obdidx		 objid		 objid		 group
	     0	             2	          0x2	             0
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;I didn&apos;t do any configuration on the client node, even disable checksum. Also the snapshot of `collect -scml&apos;:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[root@c01 ~]# collectl -scml
waiting for 1 second sample...
#&amp;lt;----CPU[HYPER]-----&amp;gt;&amp;lt;-----------Memory-----------&amp;gt;&amp;lt;--------Lustre Client--------&amp;gt;
#cpu sys inter  ctxsw Free Buff Cach Inac Slab  Map  KBRead  Reads  KBWrite Writes
  20  20  5500  26162  18G 144M  10G   9G   1G  45M       0      0  1110016   1084
  24  24  5513  23691  17G 144M  11G  10G   1G  45M       0      0  1025024   1001
  20  20  5657  26083  15G 144M  12G  11G   2G  45M       0      0  1112064   1086
  21  21  5434  25963  14G 144M  13G  12G   2G  45M       0      0  1110016   1084
  20  20  5690  26326  13G 144M  14G  13G   2G  45M       0      0  1104896   1079
  21  21  5646  26094  11G 144M  15G  14G   2G  45M       0      0  1105920   1080
  21  21  5466  24678  10G 144M  16G  15G   3G  45M       0      0  1046528   1022
  20  20  5634  25563   9G 144M  17G  16G   3G  45M       0      0  1097728   1072
  20  20  5818  26008   8G 144M  18G  17G   3G  45M       0      0  1111040   1085
  20  20  5673  26467   6G 144M  20G  18G   3G  45M       0      0  1104896   1079
  24  24  6346  25027   6G 144M  20G  19G   4G  45M       0      0  1060864   1036
  33  32  7162  21258   6G 144M  20G  19G   4G  45M       0      0   960512    938
  28  28  7021  22865   6G 144M  20G  19G   4G  45M       0      0  1042432   1018
  28  28  7177  23890   6G 144M  20G  19G   4G  45M       0      0  1039360   1015
  28  28  7326  24888   6G 144M  20G  19G   4G  45M       0      0  1090560   1065
  28  28  7465  24162   6G 144M  20G  19G   4G  45M       0      0  1029120   1005
  31  31  7382  22865   6G 144M  20G  19G   4G  45M       0      0   980992    958
  28  28  7263  24392   6G 144M  20G  19G   4G  45M       0      0  1075200   1050
  28  28  7278  24312   6G 144M  20G  19G   4G  45M       0      0  1080320   1055
  28  28  7252  25150   6G 144M  20G  19G   4G  45M       0      0  1059840   1035
  28  28  7241  25082   6G 144M  20G  19G   4G  45M       0      0  1076224   1051
  33  32  7343  22373   6G 144M  20G  19G   4G  45M       0      0   966656    944
#&amp;lt;----CPU[HYPER]-----&amp;gt;&amp;lt;-----------Memory-----------&amp;gt;&amp;lt;--------Lustre Client--------&amp;gt;
#cpu sys inter  ctxsw Free Buff Cach Inac Slab  Map  KBRead  Reads  KBWrite Writes
  28  28  7340  24704   6G 144M  20G  19G   4G  45M       0      0  1091584   1066
  27  27  7212  24694   6G 144M  20G  19G   4G  45M       0      0  1055744   1031
  28  28  7191  24909   6G 144M  20G  19G   4G  45M       0      0  1073152   1048
  28  28  7257  25058   6G 144M  20G  19G   4G  45M       0      0  1037312   1013
  33  33  7435  22787   6G 144M  20G  19G   4G  45M       0      0   988160    965
  28  28  6961  23635   6G 144M  20G  19G   4G  45M       0      0  1044480   1020
  27  27  7129  24866   6G 144M  20G  19G   4G  45M       0      0  1045504   1021
  28  27  7024  24380   6G 144M  20G  19G   4G  45M       0      0  1053666   1029
  28  28  7058  24489   6G 144M  20G  19G   4G  45M       0      0  1041426   1017
  33  33  7234  22235   6G 144M  20G  19G   4G  45M       0      0   970752    948
  27  27  7127  24555   6G 144M  20G  19G   4G  45M       0      0  1067008   1042
  28  28  7189  24215   6G 144M  20G  19G   4G  45M       0      0  1082368   1057
  28  28  7201  24734   6G 144M  20G  19G   4G  45M       0      0  1064960   1040
  27  27  7046  24564   6G 144M  20G  19G   4G  44M       0      0  1040384   1016
   0   0    67    110   6G 144M  20G  19G   4G  44M       0      0        0      0
   0   0    63    113   6G 144M  20G  19G   4G  44M       0      0        0      0
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;


&lt;blockquote&gt;
&lt;p&gt;Does the dd test include a final fsync of data to storage ?&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;No I didn&apos;t but I don&apos;t think this will affect the result because I ran dd with 1M block so the dirty data will be sent out immediately.&lt;/p&gt;</comment>
                            <comment id="81739" author="pichong" created="Wed, 16 Apr 2014 15:19:37 +0000"  >&lt;p&gt;Hi Jinshan,&lt;/p&gt;

&lt;p&gt;Here are the results of the performance measurements I have done.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Configuration&lt;/b&gt;&lt;br/&gt;
Client is a node with 2 Ivybridge sockets (24 cores, 2.7GHz), 32GB memory, 1 FDR Infiniband adapter.&lt;br/&gt;
OSS is a node with 2 Sandybridge sockets (16 cores, 2.2GHZ), 32GB memory, 1 FDR Infiniband adapter, with 5 OSTs devices from a disk array and 1 OST ramdisk device.&lt;br/&gt;
Each disk array OST reaches 900 MiB/s write and 1100 MiB/s read with obdfilter-survey.&lt;/p&gt;

&lt;p&gt;Two Lustre versions have been tested: 2.5.57 and 1.8.8-wc1&lt;br/&gt;
OSS cache is disabled (writethrough_cache_enable=0 and read_cache_enable=0)&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Benchmark&lt;/b&gt;&lt;br/&gt;
IOR with following options:&lt;br/&gt;
api=POSIX&lt;br/&gt;
filePerProc=1&lt;br/&gt;
blockSize=64G&lt;br/&gt;
transferSize=1M&lt;br/&gt;
numTasks=1&lt;br/&gt;
fsync=1&lt;/p&gt;

&lt;p&gt;Server and client system cache is cleared before each write test and read test&lt;br/&gt;
Tests are repeated 3 times, average value is computed.&lt;br/&gt;
Tests are run as a standard user.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Results&lt;/b&gt;&lt;br/&gt;
With disk array OSTs, best results are achieved with a stripecount of 3.&lt;/p&gt;
&lt;div class=&apos;table-wrap&apos;&gt;
&lt;table class=&apos;confluenceTable&apos;&gt;&lt;tbody&gt;
&lt;tr&gt;
&lt;th class=&apos;confluenceTh&apos;&gt;lustre version&lt;/th&gt;
&lt;th class=&apos;confluenceTh&apos;&gt;write&lt;/th&gt;
&lt;th class=&apos;confluenceTh&apos;&gt;read&lt;/th&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;lustre 2.5.57&lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;886 MiB/s&lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;1020 MiB/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;lustre 1.8.8-wc1&lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;823 MiB/s&lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;1135 MiB/s&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;/div&gt;


&lt;p&gt;The write performance is under the 1GiB/s performance which was my goal. Do you think this is a performance we could achieve ? What tuning would you recommand ? I will provide monitoring data in attachment for one of the lustre 2.5.57 runs.&lt;/p&gt;


&lt;p&gt;As an element of comparison, the results with the ram device OST &lt;/p&gt;
&lt;div class=&apos;table-wrap&apos;&gt;
&lt;table class=&apos;confluenceTable&apos;&gt;&lt;tbody&gt;
&lt;tr&gt;
&lt;th class=&apos;confluenceTh&apos;&gt;lustre version&lt;/th&gt;
&lt;th class=&apos;confluenceTh&apos;&gt;write&lt;/th&gt;
&lt;th class=&apos;confluenceTh&apos;&gt;read&lt;/th&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;lustre 2.5.57&lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;856 MiB/s&lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;941 MiB/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;lustre 1.8.8-wc1&lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;919 MiB/s&lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;1300 MiB/s&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;/div&gt;


&lt;p&gt;Various tuning have been tested but give no improvement:&lt;br/&gt;
IOR transferSize, llite max_cached_mb, OSS cache enabled.&lt;/p&gt;


&lt;p&gt;What makes a significant difference with lustre 2.5.57 is the write performance when test is run as root user, since it reaches 926 MiB/s (+4,5% compared to standard user). Should I open a separate ticket to track this difference ?&lt;/p&gt;

&lt;p&gt;Greg.&lt;/p&gt;</comment>
                            <comment id="81741" author="pichong" created="Wed, 16 Apr 2014 15:24:56 +0000"  >&lt;p&gt;in attachment the monitoring data from one of the lustre 2.5.57 run, as standard user&lt;/p&gt;</comment>
                            <comment id="81787" author="jay" created="Thu, 17 Apr 2014 00:03:52 +0000"  >&lt;p&gt;For the different between root and normal user, I guess this is due to quota check. Though you may not enable quota for this normal user, it will still check if quota is enabled for this specific user for each IO.&lt;/p&gt;

&lt;p&gt;What&apos;s the write speed you can get by 1 stripe? Please be sure to disable debug log by `lctl set_param debug=0&apos; on the client side. Also please monitor the CPU usage on the client side when writing a stripe file and multi-striped file specifically.&lt;/p&gt;

&lt;p&gt;In general, if CPU is the bottleneck on the client side, it won&apos;t help improve IO speed by adding more stripes.&lt;/p&gt;</comment>
                            <comment id="81821" author="pichong" created="Thu, 17 Apr 2014 12:04:41 +0000"  >&lt;p&gt;I agree with quota beeing the potential culprit. But why is it so expensive to check if user has quota enabled ? Even if it&apos;s done for each IO, single thread has no concurrency problem.&lt;/p&gt;


&lt;p&gt;The performance with a stripecount of 1 is: 720 MiB/s write and 896 MiB/s read (lustre 2.5.57).&lt;/p&gt;

&lt;p&gt;In attachment monitoring data for a run with stripecount of 1 (lu-3321-singlethreadperf2.tgz)&lt;/p&gt;


&lt;p&gt;All runs have been launched with /proc/sys/lnet/debug set to 0 on both client and server.&lt;/p&gt;

&lt;p&gt;Monitoring data for client CPU usage is in client/mo85/cpustat directory:&lt;/p&gt;
&lt;ul class=&quot;alternate&quot; type=&quot;square&quot;&gt;
	&lt;li&gt;cpustat-global and cpustat-global.png give all cpus usage&lt;/li&gt;
	&lt;li&gt;cpustat-cpu0 cpustat-cpu0.png give CPU0 usage (IOR is bound to that core).&lt;/li&gt;
&lt;/ul&gt;



&lt;p&gt;Since IOR performance is close to OST device raw performance, using multiple stripes might help exceed this limit.&lt;/p&gt;</comment>
                            <comment id="82062" author="jay" created="Mon, 21 Apr 2014 17:09:56 +0000"  >&lt;p&gt;From the cpu stat, it clearly showed that the cpu usage is around 80% for single thread and single stripe write, this is why you can see a slight performance improvement with multiple striped file. CLIO is still CPU intensive and your CPU can only drive ~900MB/s IO on the client side. As a comparison, the CPU on OpenSFS cluster can drive ~1.2GB/s.&lt;/p&gt;

&lt;p&gt;Can you please provide me the test script you&apos;re using to collect data and generate diagram, therefore I can reproduce this on OpenSFS cluster?&lt;/p&gt;</comment>
                            <comment id="82095" author="paf" created="Mon, 21 Apr 2014 21:46:23 +0000"  >&lt;p&gt;Just as a favor to anyone else interested, this is a complete list of patches landed against &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3321&quot; title=&quot;2.x single thread/process throughput degraded from 1.8&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3321&quot;&gt;&lt;del&gt;LU-3321&lt;/del&gt;&lt;/a&gt;:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://review.whamcloud.com/#/c/7888&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#/c/7888&lt;/a&gt;&lt;br/&gt;
&lt;a href=&quot;http://review.whamcloud.com/#/c/7890&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#/c/7890&lt;/a&gt;&lt;br/&gt;
&lt;a href=&quot;http://review.whamcloud.com/#/c/7891&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#/c/7891&lt;/a&gt;&lt;br/&gt;
&lt;a href=&quot;http://review.whamcloud.com/#/c/7892&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#/c/7892&lt;/a&gt;&lt;br/&gt;
&lt;a href=&quot;http://review.whamcloud.com/#/c/8174&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#/c/8174&lt;/a&gt;&lt;br/&gt;
&lt;a href=&quot;http://review.whamcloud.com/#/c/7893&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#/c/7893&lt;/a&gt;&lt;br/&gt;
&lt;a href=&quot;http://review.whamcloud.com/#/c/7894&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#/c/7894&lt;/a&gt;&lt;br/&gt;
&lt;a href=&quot;http://review.whamcloud.com/#/c/7895&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#/c/7895&lt;/a&gt;&lt;br/&gt;
&lt;a href=&quot;http://review.whamcloud.com/#/c/8523&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#/c/8523&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;7889, listed in Jinshan&apos;s earlier list of patches, was abandoned.&lt;/p&gt;</comment>
                            <comment id="82126" author="pichong" created="Tue, 22 Apr 2014 07:54:14 +0000"  >&lt;p&gt;Jinshan,&lt;/p&gt;

&lt;p&gt;In attachment the script that generates the CPU usage graphs with gnuplot. File &quot;filename&quot; contains the data where each line has the following format:&lt;br/&gt;
&lt;em&gt;time user system idle iowait&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;This can be obtained with vmstat command for global CPU usage, or from /proc/stat file for per-CPU usage.&lt;/p&gt;


&lt;p&gt;What model of CPU is present on OpenSFS cluster ?&lt;/p&gt;
</comment>
                            <comment id="82198" author="jay" created="Tue, 22 Apr 2014 19:37:19 +0000"  >&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[root@c01 ~]# lscpu 
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                8
On-line CPU(s) list:   0-7
Thread(s) per core:    2
Core(s) per socket:    4
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 44
Stepping:              2
CPU MHz:               1600.000
BogoMIPS:              4800.65
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              12288K
NUMA node0 CPU(s):     0-7

[root@c01 ~]# cat /proc/cpuinfo 
processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 44
model name	: Intel(R) Xeon(R) CPU           E5620  @ 2.40GHz
stepping	: 2
cpu MHz		: 1600.000
cache size	: 12288 KB
physical id	: 0
siblings	: 8
core id		: 0
cpu cores	: 4
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 11
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc aperfmperf pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 popcnt lahf_lm ida arat epb dts tpr_shadow vnmi flexpriority ept vpid
bogomips	: 4800.65
clflush size	: 64
cache_alignment	: 64
address sizes	: 40 bits physical, 48 bits virtual
power management:
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                        <issuelink>
            <issuekey id="12054">LU-744</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="15971">LU-2139</issuekey>
        </issuelink>
                            </outwardlinks>
                                                                <inwardlinks description="is related to">
                                        <issuelink>
            <issuekey id="16129">LU-2032</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="21828">LU-4201</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="17179">LU-2622</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="23785">LU-4786</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="35571">LU-7912</issuekey>
        </issuelink>
                            </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                            <attachment id="14763" name="cpustat.scr" size="550" author="pichong" created="Tue, 22 Apr 2014 07:54:14 +0000"/>
                            <attachment id="12711" name="dd_throughput_comparison.png" size="6338" author="jfilizetti" created="Mon, 13 May 2013 01:56:39 +0000"/>
                            <attachment id="12822" name="dd_throughput_comparison_with_change_5446.png" size="6960" author="jfilizetti" created="Wed, 15 May 2013 03:03:49 +0000"/>
                            <attachment id="14718" name="lu-3321-singlethreadperf.tgz" size="400599" author="pichong" created="Wed, 16 Apr 2014 15:24:56 +0000"/>
                            <attachment id="14724" name="lu-3321-singlethreadperf2.tgz" size="577616" author="pichong" created="Thu, 17 Apr 2014 12:04:40 +0000"/>
                            <attachment id="13612" name="mcm8_wcd.png" size="8980" author="jay" created="Fri, 11 Oct 2013 00:33:33 +0000"/>
                            <attachment id="13613" name="perf3.png" size="105368" author="jay" created="Fri, 11 Oct 2013 00:34:01 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                    <customfield id="customfield_10030" key="com.atlassian.jira.plugin.system.customfieldtypes:labels">
                        <customfieldname>Epic/Theme</customfieldname>
                        <customfieldvalues>
                                        <label>Performance</label>
    
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzvqzr:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>8259</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>