<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:41:05 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-4257] parallel dds are slower than serial dds</title>
                <link>https://jira.whamcloud.com/browse/LU-4257</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;Sanger has an interesting test in which they read from the same file from 20 processes. They first run in parallel and then run serially (after flushing cache). Their expected result is that the serial and parallel runs should take about the same amount of time. What they see however is that parallel reads are about 50% slower than serial reads:&lt;/p&gt;

&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;client1# cat readfile.sh
#!/bin/sh

dd &lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt;=/lustre/scratch110/sanger/jb23/test/delete bs=4M of=/dev/&lt;span class=&quot;code-keyword&quot;&gt;null&lt;/span&gt;

client1# &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; i in `seq -w 1 20 `
&lt;span class=&quot;code-keyword&quot;&gt;do&lt;/span&gt;
  (time $LOC/readfile.sh )  &amp;gt; $LOC/results/${i}_out 2&amp;gt;&amp;amp;1 &amp;amp;
done
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;In parallel &lt;/p&gt;

&lt;p&gt;01_out:real     3m36.228s &lt;br/&gt;
02_out:real     3m36.227s &lt;br/&gt;
03_out:real     3m36.226s &lt;br/&gt;
04_out:real     3m36.224s &lt;br/&gt;
05_out:real     3m36.224s &lt;br/&gt;
06_out:real     3m36.224s &lt;br/&gt;
07_out:real     3m36.222s &lt;br/&gt;
08_out:real     3m36.221s &lt;br/&gt;
09_out:real     3m36.228s &lt;br/&gt;
10_out:real     3m36.222s &lt;br/&gt;
11_out:real     3m36.220s &lt;br/&gt;
12_out:real     3m36.220s &lt;br/&gt;
13_out:real     3m36.228s &lt;br/&gt;
14_out:real     3m36.219s &lt;br/&gt;
15_out:real     3m36.217s &lt;br/&gt;
16_out:real     3m36.218s &lt;br/&gt;
17_out:real     3m36.214s &lt;br/&gt;
18_out:real     3m36.214s &lt;br/&gt;
19_out:real     3m36.211s &lt;br/&gt;
20_out:real     3m36.212s &lt;/p&gt;

&lt;p&gt;A serial read ( I expect all the time to be in the first read ).&lt;/p&gt;

&lt;p&gt;grep -i real *_serial &lt;br/&gt;
01_out_serial:real      2m31.372s &lt;br/&gt;
02_out_serial:real      0m1.190s &lt;br/&gt;
03_out_serial:real      0m0.654s &lt;br/&gt;
04_out_serial:real      0m0.562s &lt;br/&gt;
05_out_serial:real      0m0.574s &lt;br/&gt;
06_out_serial:real      0m0.570s &lt;br/&gt;
07_out_serial:real      0m0.574s &lt;br/&gt;
08_out_serial:real      0m0.461s &lt;br/&gt;
09_out_serial:real      0m0.456s &lt;br/&gt;
10_out_serial:real      0m0.462s &lt;br/&gt;
11_out_serial:real      0m0.475s &lt;br/&gt;
12_out_serial:real      0m0.473s &lt;br/&gt;
13_out_serial:real      0m0.582s &lt;br/&gt;
14_out_serial:real      0m0.580s &lt;br/&gt;
15_out_serial:real      0m0.569s &lt;br/&gt;
16_out_serial:real      0m0.679s &lt;br/&gt;
17_out_serial:real      0m0.565s &lt;br/&gt;
18_out_serial:real      0m0.573s &lt;br/&gt;
19_out_serial:real      0m0.579s &lt;br/&gt;
20_out_serial:real      0m0.472s &lt;/p&gt;

&lt;p&gt;And try the same experiment with nfs &lt;/p&gt;

&lt;p&gt;Serial access. &lt;/p&gt;

&lt;p&gt;root@farm3-head4:~/tmp/test/results# grep -i real * &lt;br/&gt;
results/01_out_serial:real      0m19.923s &lt;br/&gt;
results/02_out_serial:real      0m1.373s &lt;br/&gt;
results/03_out_serial:real      0m1.237s &lt;br/&gt;
results/04_out_serial:real      0m1.276s &lt;br/&gt;
results/05_out_serial:real      0m1.289s &lt;br/&gt;
results/06_out_serial:real      0m1.297s &lt;br/&gt;
results/07_out_serial:real      0m1.265s &lt;br/&gt;
results/08_out_serial:real      0m1.278s &lt;br/&gt;
results/09_out_serial:real      0m1.224s &lt;br/&gt;
results/10_out_serial:real      0m1.225s &lt;br/&gt;
results/11_out_serial:real      0m1.221s &lt;br/&gt;
...&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;So the question is:&lt;br/&gt;
Why is the access slower if we are accessing the file in parallel and it is not in the cache ?&lt;/p&gt;

&lt;p&gt;Is there some lock contention going on with multiple readers? Or is the Lustre client sending multiple RPCs for the same data, even though there is already an outstanding request? They have tried this on 1.8.x clients as well as 2.5.0. &lt;/p&gt;

&lt;p&gt;Thanks.&lt;/p&gt;</description>
                <environment></environment>
        <key id="22045">LU-4257</key>
            <summary>parallel dds are slower than serial dds</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="4" iconUrl="https://jira.whamcloud.com/images/icons/priorities/minor.svg">Minor</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="jay">Jinshan Xiong</assignee>
                                    <reporter username="ihara">Shuichi Ihara</reporter>
                        <labels>
                    </labels>
                <created>Fri, 15 Nov 2013 15:43:16 +0000</created>
                <updated>Tue, 16 Jun 2020 09:00:53 +0000</updated>
                            <resolved>Fri, 3 Jun 2016 20:47:39 +0000</resolved>
                                    <version>Lustre 2.5.0</version>
                                    <fixVersion>Lustre 2.9.0</fixVersion>
                                        <due></due>
                            <votes>0</votes>
                                    <watches>29</watches>
                                                                            <comments>
                            <comment id="71663" author="pjones" created="Fri, 15 Nov 2013 19:50:36 +0000"  >&lt;p&gt;Bobijam&lt;/p&gt;

&lt;p&gt;Could you please advise on this one?&lt;/p&gt;

&lt;p&gt;Thanks&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="71683" author="adilger" created="Fri, 15 Nov 2013 21:11:09 +0000"  >&lt;p&gt;How large are the files being tested?  One stripe or multiple stripes per file?  It surprises me that Lustre is slower than NFS, but I guess that isn&apos;t directly relevant for the problem itself.&lt;/p&gt;

&lt;p&gt;It would be useful to collect +rpctrace +dlmtrace +vfstrace +reada debug logs on the client for the parallel and serial runs.  That would allow us to see if there are duplicate RPCs being sent.  This would be most useful with a 2.5.0 client.  It may be that Jinshan&apos;s client-side single-threaded performance patches would help improve the overall performance, but it isn&apos;t totally clear if they would affect the parallel vs. serial behaviour.&lt;/p&gt;</comment>
                            <comment id="71686" author="kitwestneat" created="Fri, 15 Nov 2013 21:23:08 +0000"  >&lt;p&gt;They are 2GB, single-striped files. The client nodes have 4GB of RAM and are mounting with the localflock option. &lt;/p&gt;

&lt;p&gt;I&apos;ll work on getting the debug logs.&lt;/p&gt;</comment>
                            <comment id="71702" author="adilger" created="Fri, 15 Nov 2013 23:59:53 +0000"  >&lt;p&gt;It seems really terrible performance to be reading a 2GB file in 151s (=13.5MB/s) in the serial case, or even 216s (=9.5MB/s) in the parallel case.  It seems like something else is broken in this filesystem that is causing everything to run slowly?  Even my 8-year-old home system runs this fast.&lt;/p&gt;</comment>
                            <comment id="71763" author="james beal" created="Mon, 18 Nov 2013 12:08:06 +0000"  >&lt;p&gt;Note when I filtered on +rpctrace +dlmtrace +vfstrace +reada I got no events so I have given the full logs....&lt;/p&gt;

&lt;p&gt;Sorry.&lt;/p&gt;</comment>
                            <comment id="71764" author="james beal" created="Mon, 18 Nov 2013 12:08:26 +0000"  >&lt;p&gt;This is using my desktop which is a lustre 2.5.1 client. &lt;/p&gt;

&lt;p&gt;root@deskpro21498:/tmp/results# grep copied /tmp/results/&lt;b&gt;ser&lt;/b&gt;&lt;br/&gt;
/tmp/results/01_out_ser:2147483648 bytes (2.1 GB) copied, 19.4224 s, 111 MB/s&lt;br/&gt;
/tmp/results/02_out_ser:2147483648 bytes (2.1 GB) copied, 1.1007 s, 2.0 GB/s&lt;br/&gt;
/tmp/results/03_out_ser:2147483648 bytes (2.1 GB) copied, 1.03958 s, 2.1 GB/s&lt;br/&gt;
/tmp/results/04_out_ser:2147483648 bytes (2.1 GB) copied, 1.03101 s, 2.1 GB/s&lt;br/&gt;
/tmp/results/05_out_ser:2147483648 bytes (2.1 GB) copied, 1.03045 s, 2.1 GB/s&lt;br/&gt;
/tmp/results/06_out_ser:2147483648 bytes (2.1 GB) copied, 1.03191 s, 2.1 GB/s&lt;br/&gt;
/tmp/results/07_out_ser:2147483648 bytes (2.1 GB) copied, 1.0305 s, 2.1 GB/s&lt;br/&gt;
/tmp/results/08_out_ser:2147483648 bytes (2.1 GB) copied, 1.03164 s, 2.1 GB/s&lt;br/&gt;
/tmp/results/09_out_ser:2147483648 bytes (2.1 GB) copied, 1.03211 s, 2.1 GB/s&lt;br/&gt;
/tmp/results/10_out_ser:2147483648 bytes (2.1 GB) copied, 1.0314 s, 2.1 GB/s&lt;br/&gt;
/tmp/results/11_out_ser:2147483648 bytes (2.1 GB) copied, 1.03315 s, 2.1 GB/s&lt;br/&gt;
/tmp/results/12_out_ser:2147483648 bytes (2.1 GB) copied, 1.03305 s, 2.1 GB/s&lt;br/&gt;
/tmp/results/13_out_ser:2147483648 bytes (2.1 GB) copied, 1.03047 s, 2.1 GB/s&lt;br/&gt;
/tmp/results/14_out_ser:2147483648 bytes (2.1 GB) copied, 1.03324 s, 2.1 GB/s&lt;br/&gt;
/tmp/results/15_out_ser:2147483648 bytes (2.1 GB) copied, 1.03284 s, 2.1 GB/s&lt;br/&gt;
/tmp/results/16_out_ser:2147483648 bytes (2.1 GB) copied, 1.03156 s, 2.1 GB/s&lt;br/&gt;
/tmp/results/17_out_ser:2147483648 bytes (2.1 GB) copied, 1.03013 s, 2.1 GB/s&lt;br/&gt;
/tmp/results/18_out_ser:2147483648 bytes (2.1 GB) copied, 1.03125 s, 2.1 GB/s&lt;br/&gt;
/tmp/results/19_out_ser:2147483648 bytes (2.1 GB) copied, 1.03179 s, 2.1 GB/s&lt;br/&gt;
/tmp/results/20_out_ser:2147483648 bytes (2.1 GB) copied, 1.02893 s, 2.1 GB/s&lt;br/&gt;
root@deskpro21498:/tmp/results# grep copied /tmp/results/&lt;b&gt;par&lt;/b&gt;&lt;br/&gt;
/tmp/results/01_out_par:2147483648 bytes (2.1 GB) copied, 28.5889 s, 75.1 MB/s&lt;br/&gt;
/tmp/results/02_out_par:2147483648 bytes (2.1 GB) copied, 28.535 s, 75.3 MB/s&lt;br/&gt;
/tmp/results/03_out_par:2147483648 bytes (2.1 GB) copied, 28.5606 s, 75.2 MB/s&lt;br/&gt;
/tmp/results/04_out_par:2147483648 bytes (2.1 GB) copied, 28.5644 s, 75.2 MB/s&lt;br/&gt;
/tmp/results/05_out_par:2147483648 bytes (2.1 GB) copied, 28.5557 s, 75.2 MB/s&lt;br/&gt;
/tmp/results/06_out_par:2147483648 bytes (2.1 GB) copied, 28.5399 s, 75.2 MB/s&lt;br/&gt;
/tmp/results/07_out_par:2147483648 bytes (2.1 GB) copied, 28.5578 s, 75.2 MB/s&lt;br/&gt;
/tmp/results/08_out_par:2147483648 bytes (2.1 GB) copied, 28.5544 s, 75.2 MB/s&lt;br/&gt;
/tmp/results/09_out_par:2147483648 bytes (2.1 GB) copied, 28.5705 s, 75.2 MB/s&lt;br/&gt;
/tmp/results/10_out_par:2147483648 bytes (2.1 GB) copied, 28.5436 s, 75.2 MB/s&lt;br/&gt;
/tmp/results/11_out_par:2147483648 bytes (2.1 GB) copied, 28.5596 s, 75.2 MB/s&lt;br/&gt;
/tmp/results/12_out_par:2147483648 bytes (2.1 GB) copied, 28.5437 s, 75.2 MB/s&lt;br/&gt;
/tmp/results/13_out_par:2147483648 bytes (2.1 GB) copied, 28.5611 s, 75.2 MB/s&lt;br/&gt;
/tmp/results/14_out_par:2147483648 bytes (2.1 GB) copied, 28.5543 s, 75.2 MB/s&lt;br/&gt;
/tmp/results/15_out_par:2147483648 bytes (2.1 GB) copied, 28.5604 s, 75.2 MB/s&lt;br/&gt;
/tmp/results/16_out_par:2147483648 bytes (2.1 GB) copied, 28.5546 s, 75.2 MB/s&lt;br/&gt;
/tmp/results/17_out_par:2147483648 bytes (2.1 GB) copied, 28.5358 s, 75.3 MB/s&lt;br/&gt;
/tmp/results/18_out_par:2147483648 bytes (2.1 GB) copied, 28.5552 s, 75.2 MB/s&lt;br/&gt;
/tmp/results/19_out_par:2147483648 bytes (2.1 GB) copied, 28.5478 s, 75.2 MB/s&lt;br/&gt;
/tmp/results/20_out_par:2147483648 bytes (2.1 GB) copied, 28.5414 s, 75.2 MB/s&lt;/p&gt;</comment>
                            <comment id="71945" author="james beal" created="Wed, 20 Nov 2013 10:11:17 +0000"  >
&lt;p&gt;Can I add any information ?&lt;/p&gt;

&lt;p&gt;Can the problem be reproduced else where ?&lt;/p&gt;</comment>
                            <comment id="72066" author="kitwestneat" created="Thu, 21 Nov 2013 20:34:11 +0000"  >&lt;p&gt;Hi James,&lt;/p&gt;

&lt;p&gt;Can you confirm how large the original file was? Also can you double check your debug settings? When I run it on a Lustre 2.1.5 VM I get a ton of debug data. &lt;/p&gt;

&lt;p&gt;sysctl lnet.debug=&quot;+rpctrace +dlmtrace +vfstrace +reada&quot;&lt;br/&gt;
lnet.debug = ioctl neterror warning dlmtrace error emerg ha rpctrace vfstrace reada config console&lt;/p&gt;

&lt;p&gt;Here&apos;s the script I used:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;#!/bin/bash

lctl clear
lctl debug_daemon start /scratch/debug_file 2048
echo 3 &amp;gt; /proc/sys/vm/drop_caches
lctl mark parallel read start
&lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; x in 1 2 3 4 5; &lt;span class=&quot;code-keyword&quot;&gt;do&lt;/span&gt; dd &lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt;=/lustre/pfs/client/testfile bs=4M of=/dev/&lt;span class=&quot;code-keyword&quot;&gt;null&lt;/span&gt; &amp;amp; done
wait
lctl mark parallel read done
echo 3 &amp;gt; /proc/sys/vm/drop_caches
lctl mark serial read start
&lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; x in 1 2 3 4 5; &lt;span class=&quot;code-keyword&quot;&gt;do&lt;/span&gt; dd &lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt;=/lustre/pfs/client/testfile bs=4M of=/dev/&lt;span class=&quot;code-keyword&quot;&gt;null&lt;/span&gt; ; done
lctl mark serial read done
lctl debug_daemon stop
lctl df /scratch/debug_file &amp;gt; /scratch/debug_file.out
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;I was able to reproduce the slow down you mentioned. I&apos;ll attach the debug logs that I got. &lt;/p&gt;

&lt;p&gt;parallel:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;62+1 records in
62+1 records out
262144000 bytes (262 MB) copied, 4.40159 s, 59.6 MB/s
62+1 records in
62+1 records out
262144000 bytes (262 MB) copied, 4.40117 s, 59.6 MB/s
62+1 records in
62+1 records out
262144000 bytes (262 MB) copied, 4.39914 s, 59.6 MB/s
62+1 records in
62+1 records out
262144000 bytes (262 MB) copied, 4.39789 s, 59.6 MB/s
62+1 records in
62+1 records out
262144000 bytes (262 MB) copied, 4.39852 s, 59.6 MB/s
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;serial:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;62+1 records in
62+1 records out
262144000 bytes (262 MB) copied, 4.57006 s, 57.4 MB/s
62+1 records in
62+1 records out
262144000 bytes (262 MB) copied, 0.0536567 s, 4.9 GB/s
62+1 records in
62+1 records out
262144000 bytes (262 MB) copied, 0.046017 s, 5.7 GB/s
62+1 records in
62+1 records out
262144000 bytes (262 MB) copied, 0.0454 s, 5.8 GB/s
62+1 records in
62+1 records out
262144000 bytes (262 MB) copied, 0.0453329 s, 5.8 GB/s
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="72089" author="adilger" created="Fri, 22 Nov 2013 02:04:04 +0000"  >&lt;p&gt;In the most recent results with the 2.5.0 client (we don&apos;t have a 2.5.1 yet?) the parallel reads take only 28.5s, while the serial reads take 39s in total, so the parallel reads are faster in aggregate than serial reads, though the single serial read is faster (19.4s) than any of the parallel reads.  These results also are more in line with NFS (19.9s for the first serial read, 1.3s for each later read).&lt;/p&gt;

&lt;p&gt;I updated the script a bit and ran it on my 2.4.0 x86_64 2-core 4-thread client, 2.4.1 server:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;#!/bin/sh
FILE=/myth/tmp/100M  # previously created
OST=myth-OST0000

lctl set_param ldlm.namespaces.$OST*.lru_size=clear
lctl set_param llite.*.stats=0 osc.$OST*.rpc_stats=0

&lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; i in $(seq -w 1 20); &lt;span class=&quot;code-keyword&quot;&gt;do&lt;/span&gt;
	(time dd &lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt;=$FILE bs=4M of=/dev/&lt;span class=&quot;code-keyword&quot;&gt;null&lt;/span&gt; )  &amp;gt; /tmp/par_${i}.out 2&amp;gt;&amp;amp;1 &amp;amp;
done

wait
lctl get_param llite.*.stats osc.$OST*.rpc_stats | tee /tmp/par_stats.out

lctl set_param ldlm.namespaces.$OST*.lru_size=clear
lctl set_param llite.*.stats=0 osc.$OST*.rpc_stats=0

&lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; i in $(seq -w 1 20); &lt;span class=&quot;code-keyword&quot;&gt;do&lt;/span&gt;
	(time dd &lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt;=$FILE bs=4M of=/dev/&lt;span class=&quot;code-keyword&quot;&gt;null&lt;/span&gt; )  &amp;gt; /tmp/ser_${i}.out 2&amp;gt;&amp;amp;1
done
lctl get_param llite.*.stats osc.$OST*.rpc_stats | tee /tmp/par_stats.out
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This produced the following from rpc_stats, which showed that there are not multiple RPCs being sent:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;SERIAL  		read			write
pages per rpc         rpcs   % cum % |       rpcs   % cum %
64:		         2   1   1   |          0   0   0
256:		       100  98 100   |          0   0   0

PARALLEL		read			write
pages per rpc         rpcs   % cum % |       rpcs   % cum %
256:		       100 100 100   |          0   0   0
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The time profile is as I would expect, shorter single time for serial but longer total time (no parallelism), and a shorter total time for parallel (runtime happens in parallel):&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;SERIAL&lt;br/&gt;
real 0m9.657s sys 0m1.556s&lt;br/&gt;
real 0m0.284s sys 0m0.281s&lt;br/&gt;
real 0m0.244s sys 0m0.241s&lt;br/&gt;
real 0m0.377s sys 0m0.368s&lt;br/&gt;
real 0m0.202s sys 0m0.201s&lt;br/&gt;
real 0m0.207s sys 0m0.195s&lt;br/&gt;
real 0m0.200s sys 0m0.196s&lt;br/&gt;
real 0m0.210s sys 0m0.209s&lt;br/&gt;
real 0m0.198s sys 0m0.197s&lt;br/&gt;
real 0m0.197s sys 0m0.194s&lt;br/&gt;
real 0m0.198s sys 0m0.196s&lt;br/&gt;
real 0m0.202s sys 0m0.198s&lt;br/&gt;
real 0m0.295s sys 0m0.291s&lt;br/&gt;
real 0m0.164s sys 0m0.162s&lt;br/&gt;
real 0m0.163s sys 0m0.162s&lt;br/&gt;
real 0m0.162s sys 0m0.160s&lt;br/&gt;
real 0m0.162s sys 0m0.160s&lt;br/&gt;
real 0m0.162s sys 0m0.159s&lt;br/&gt;
real 0m0.164s sys 0m0.160s&lt;br/&gt;
real 0m0.162s sys 0m0.160s&lt;/p&gt;

&lt;p&gt;PARALLEL&lt;br/&gt;
real 0m12.817s sys 0m0.918s&lt;br/&gt;
real 0m12.816s sys 0m1.018s&lt;br/&gt;
real 0m12.804s sys 0m0.837s&lt;br/&gt;
real 0m12.791s sys 0m0.865s&lt;br/&gt;
real 0m12.783s sys 0m0.922s&lt;br/&gt;
real 0m12.768s sys 0m0.866s&lt;br/&gt;
real 0m12.773s sys 0m0.888s&lt;br/&gt;
real 0m12.773s sys 0m0.968s&lt;br/&gt;
real 0m12.761s sys 0m0.842s&lt;br/&gt;
real 0m12.741s sys 0m0.897s&lt;br/&gt;
real 0m12.734s sys 0m0.856s&lt;br/&gt;
real 0m12.718s sys 0m0.869s&lt;br/&gt;
real 0m12.700s sys 0m0.896s&lt;br/&gt;
real 0m12.711s sys 0m0.933s&lt;br/&gt;
real 0m12.685s sys 0m0.931s&lt;br/&gt;
real 0m12.656s sys 0m0.854s&lt;br/&gt;
real 0m12.669s sys 0m0.970s&lt;br/&gt;
real 0m12.663s sys 0m0.848s&lt;br/&gt;
real 0m12.655s sys 0m0.947s&lt;br/&gt;
real 0m12.649s sys 0m0.880s&lt;/p&gt;</comment>
                            <comment id="72358" author="james beal" created="Tue, 26 Nov 2013 23:27:01 +0000"  >&lt;p&gt;Just as a note, I think I can see the problem however I am busy upgrading a lustre file system... and I want to ensure that my communication is clear, apologies for the delay.&lt;/p&gt;</comment>
                            <comment id="72710" author="jkb" created="Tue, 3 Dec 2013 17:07:21 +0000"  >&lt;p&gt;I have a basic C program to do a while(read()) style loop until EOF on an ~800Mb file with varying buffer sizes. I tested this on a lustre, tmpfs and nfs using a mix of block sizes and 1, 4 and 32 concurrent copies. For example on lustre with 4 copies:&lt;/p&gt;

&lt;p&gt;jkb@sf-7-1-02&lt;span class=&quot;error&quot;&gt;&amp;#91;work/benchmarks&amp;#93;&lt;/span&gt; fn=/lustre/scratch110/srpipe/references/Human/default/all/bowtie/human_g1k_v37.fasta.1.ebwt; for i in `seq 8 30`;do bs=$(perl -e &quot;print 2 ** $i&quot;);echo -n &quot;$bs&quot;;(for x in `seq 1 4`;do ./linear2 $fn &amp;amp; done) | grep Took|tail -1;done&lt;br/&gt;
256 Took 243.905528 seconds&lt;br/&gt;
512 Took 121.146812 seconds&lt;br/&gt;
1024 Took 62.789902 seconds&lt;br/&gt;
2048 Took 30.558433 seconds&lt;br/&gt;
4096 Took 14.908553 seconds&lt;br/&gt;
8192 Took 7.506444 seconds&lt;br/&gt;
16384 Took 3.727981 seconds&lt;br/&gt;
32768 Took 1.864213 seconds&lt;br/&gt;
65536 Took 0.882102 seconds&lt;br/&gt;
131072 Took 0.468359 seconds&lt;br/&gt;
262144 Took 0.405715 seconds&lt;br/&gt;
524288 Took 0.411299 seconds&lt;br/&gt;
1048576 Took 0.420036 seconds&lt;br/&gt;
2097152 Took 0.470832 seconds&lt;br/&gt;
4194304 Took 0.680612 seconds&lt;br/&gt;
8388608 Took 0.776593 seconds&lt;br/&gt;
16777216 Took 0.752007 seconds&lt;br/&gt;
33554432 Took 0.765050 seconds&lt;br/&gt;
67108864 Took 0.741756 seconds&lt;br/&gt;
134217728 Took 0.712213 seconds&lt;br/&gt;
268435456 Took 0.677054 seconds&lt;br/&gt;
536870912 Took 0.753955 seconds&lt;br/&gt;
1073741824 Took 0.931833 seconds&lt;/p&gt;

&lt;p&gt;I then graphed these. Thin lines are 1 process running. The thicker ones are 4, and the thickest are 32 (also the labels lustre, lustre4 and lustre32 correspond as dot the crosses, circles and squares on the data points).&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;image-wrap&quot; style=&quot;&quot;&gt;&lt;img src=&quot;https://jira.whamcloud.com/secure/attachment/13883/13883_io.png&quot; style=&quot;border: 0px solid black&quot; /&gt;&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;Naturally for any read there will be a per system-call overhead, so obviously tiny buffers for read calls are inefficient. This is evidently clear for all file systems.&lt;/p&gt;

&lt;p&gt;Also there is clearly a per-byte overhead. Even reading from a file in cached memory, it takes time to copy from kernel space buffer cache to the user space buffer. This is a per-byte delay and will scale linearly will the data size.&lt;/p&gt;

&lt;p&gt;These two timings combine to get the net speed of the while(read()) loops. Indeed we see this for NFS. Below 1k block size there is a decrease in speed as the block size decreases, indicating that the overhead of calling more system calls is the major bottleneck. At 10k and above the time is more or less constant, so the per-system call overhead has become insignificant. Inbetween 1k and 10k we see the gradual bend in the graph where real time is more evenly spread between memory copying and per-system call costs.  Lustre single-threaded code demonstrates the same bend between 10k and 100k implying the per system-call cost is higher.&lt;/p&gt;

&lt;p&gt;Now if we have 4 processes running we could expect the per-system call overhead to be the same on all processes, and much the same delay as we get with 1 or 32 processes running (more or less).  This is visible in the NFS benchmark as the 32-process version still shows a linear decrease in performance for blocks below 1k (as for the single-process NFS stats) and the bend is &quot;more or less&quot; in the same place.&lt;/p&gt;

&lt;p&gt;However Lustre shows a marked difference. As we have more and more simultaneous processes it reveals the location of the trade-off between per-system call and per-byte costs moving from ~10K through ~100k for 4 processes and up to ~1Mb for 32 processes.  Why is this? Basically it indicates that the CPU overhead of each and every system call goes up as other processes are doing more system calls. The net effect is that even on moderate sized buffers (eg 10K) lustre becomes a couple orders of magnitude slower than NFS.&lt;/p&gt;

&lt;p&gt;To my eyes that implies there is some contention between independent system calls, possibly a thread lock. I appreciate that coherency is HARD, so I wonder if there is some way we can provide hints to the lustre client to make things easier. Eg is it possible to set the immutable flag on a file to avoid this lock?  Does opening in read-only mode help?  Does having the file itself be read-only help?&lt;/p&gt;

&lt;p&gt;James&lt;/p&gt;</comment>
                            <comment id="72713" author="kitwestneat" created="Tue, 3 Dec 2013 17:29:12 +0000"  >&lt;p&gt;Hi James,&lt;/p&gt;

&lt;p&gt;Can you confirm what version of Lustre the clients and servers are running? Are you able to get the debug logs that Andreas requested while running these tests? I had to use the debug_daemon function in order to capture all the logs. &lt;/p&gt;

&lt;p&gt;Thanks,&lt;br/&gt;
Kit&lt;/p&gt;</comment>
                            <comment id="72721" author="james beal" created="Tue, 3 Dec 2013 18:09:50 +0000"  >&lt;p&gt;I have done my test on lustre 1.8.9 clients and 2.5.0 clients however James above would have only used lustre 1.8.9. I am rather deep in a set of lustre server upgrades right now so I will not be able to look at this until Tuesday unless it is critical before then ?&lt;/p&gt;</comment>
                            <comment id="72723" author="adilger" created="Tue, 3 Dec 2013 18:27:05 +0000"  >&lt;p&gt;James, you are correct that there is per-syscall overhead in Lustre, and contention will indeed increase with multiple threads. Lustre has fully coherent locking between nodes, while NFS has no locking at all. That means if one node is writing to a file and another node is reading at the same time, or if two nodes are writing in on non-block-aligned boundaries, NFS does not guarantee anything about the state of your data at the end. NFS only ensures that local data is flushed when the file is closed, and another client can see this data when the file is opened afterward.&lt;/p&gt;

&lt;p&gt;That said, I agree the client-side locking overhead is higher than it should be, and we are working to improve this for the 2.6 release. If you are able to test prototype code on a client node, it would be possible to see if the current performance patch series is showing improvement on your system.  Most of the patches are already in the lustre-master branch in Git.&lt;/p&gt;

&lt;p&gt;The last patch in this series that I&apos;m aware of is:&lt;br/&gt;
&lt;a href=&quot;http://review.whamcloud.com/7895&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/7895&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;There are pre-built RPMs available at (click on blue circle for your arch distro combo):&lt;br/&gt;
&lt;a href=&quot;http://build.whamcloud.com/job/lustre-reviews/20005/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://build.whamcloud.com/job/lustre-reviews/20005/&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="73351" author="james beal" created="Thu, 12 Dec 2013 12:14:13 +0000"  >&lt;p&gt;This is a set of test results from a production node that has been taken out of one of our clusters, Note that the lustre client is mounted with localflock. We have an expectation that a number of processes all reading the same file should be quicker than a number of processes all reading different files as the file is in cache and we would not expect additional IO however there appears to be significant contention which we do not understand. I note that the block size that we read the file with makes a very significant difference. The problem was discovered with a program that read the file one character at a time. I will repeat the runs with a lustre 2.1 client pulled down from the lustre-reviews.&lt;/p&gt;

&lt;p&gt;A little table which summarises the results.&lt;/p&gt;

&lt;div class=&apos;table-wrap&apos;&gt;
&lt;table class=&apos;confluenceTable&apos;&gt;&lt;tbody&gt;
&lt;tr&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;&amp;nbsp;&lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt; 4M reads  &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;    4K reads &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt; slow down &lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;lustre reading multiple files concurrently &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt; 4843 secs &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;   7202 secs &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;  1.5      &lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;lustre reading a single file  concurrently &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt; 8470 secs &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt; 131779 secs &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt; 15.6      &lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;tmpfs  reading multiple files concurrently &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt; 8019 secs &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;   6142 secs &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;  0.8      &lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;tmpfs  reading a single file  concurrently &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt; 8161 secs &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;   6131 secs &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;  0.8      &lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;/div&gt;


&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;
Lustre 1.8.9 client, lustre 2.4.1 server ( 32 files each 512MB in size , reading with dd 4MB block size), mounted using localflock

Creating source files................................
Dropping cache
All threads reading the same file
Dropping cache
Reading the same file in a single thread 
Time taken using read pattern
Core_A Core_B Core_C Core_D Core_E Core_F Core_G Core_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
Pattern takes 4843.25
Time taken using read pattern
Core_A Core_B Core_C Core_D Core_E Core_F Core_G Core_H
File_A File_A File_A File_A File_A File_A File_A File_A
File_B File_B File_B File_B File_B File_B File_B File_B
File_C File_C File_C File_C File_C File_C File_C File_C
File_D File_D File_D File_D File_D File_D File_D File_D
File_E File_E File_E File_E File_E File_E File_E File_E
File_F File_F File_F File_F File_F File_F File_F File_F
File_G File_G File_G File_G File_G File_G File_G File_G
File_H File_H File_H File_H File_H File_H File_H File_H
Pattern takes 8469.83

Lustre 1.8.9 client, lustre 2.4.1 server  ( 32 files each 512MB in size , reading with dd 4KB block size), mounted using local flock

root@sf-7-1-02:/root# ./test.sh
Creating source files................................
Dropping cache
All threads reading the same file
Dropping cache
Reading the same file in a single thread
Time taken using read pattern
Core_A Core_B Core_C Core_D Core_E Core_F Core_G Core_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
Pattern takes 7202.32
Time taken using read pattern
Core_A Core_B Core_C Core_D Core_E Core_F Core_G Core_H
File_A File_A File_A File_A File_A File_A File_A File_A
File_B File_B File_B File_B File_B File_B File_B File_B
File_C File_C File_C File_C File_C File_C File_C File_C
File_D File_D File_D File_D File_D File_D File_D File_D
File_E File_E File_E File_E File_E File_E File_E File_E
File_F File_F File_F File_F File_F File_F File_F File_F
File_G File_G File_G File_G File_G File_G File_G File_G
File_H File_H File_H File_H File_H File_H File_H File_H
Pattern takes 131779

tmpfs ( 32 files each 512MB in size , reading with dd 4MB block size)

Creating source files................................
Dropping cache
All threads reading the same file
Dropping cache
Reading the same file in a single thread 
Time taken using read pattern
Core_A Core_B Core_C Core_D Core_E Core_F Core_G Core_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
Pattern takes 8018.69
Time taken using read pattern
Core_A Core_B Core_C Core_D Core_E Core_F Core_G Core_H
File_A File_A File_A File_A File_A File_A File_A File_A
File_B File_B File_B File_B File_B File_B File_B File_B
File_C File_C File_C File_C File_C File_C File_C File_C
File_D File_D File_D File_D File_D File_D File_D File_D
File_E File_E File_E File_E File_E File_E File_E File_E
File_F File_F File_F File_F File_F File_F File_F File_F
File_G File_G File_G File_G File_G File_G File_G File_G
File_H File_H File_H File_H File_H File_H File_H File_H
Pattern takes 8160.59

tmpfs ( 32 files each 512MB in size , reading with dd 4KB block size)

Creating source files................................
Dropping cache
All threads reading the same file
Dropping cache
Reading the same file in a single thread 
Time taken using read pattern
Core_A Core_B Core_C Core_D Core_E Core_F Core_G Core_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H
Pattern takes 6141.82
Time taken using read pattern
Core_A Core_B Core_C Core_D Core_E Core_F Core_G Core_H
File_A File_A File_A File_A File_A File_A File_A File_A
File_B File_B File_B File_B File_B File_B File_B File_B
File_C File_C File_C File_C File_C File_C File_C File_C
File_D File_D File_D File_D File_D File_D File_D File_D
File_E File_E File_E File_E File_E File_E File_E File_E
File_F File_F File_F File_F File_F File_F File_F File_F
File_G File_G File_G File_G File_G File_G File_G File_G
File_H File_H File_H File_H File_H File_H File_H File_H
Pattern takes 6130.82

The scripts are as follows:

cat readfile.sh 
#!/bin/sh
TYPE=$1
LOC=$2
PAR=$3
MULT=$4

if [ ${TYPE} = &quot;SEQ&quot; ] ; then 
   for i in `seq -w 1 ${PAR}`
   do
     echo ${LOC}/${i}
     for j in `seq 1 ${MULT}`
     do
       dd if=${LOC}/${i} of=/dev/null bs=4k
     done
   done
fi
if [ ${TYPE} = &quot;PAR&quot; ] ; then
  echo $LOC
  for i in `seq -w 1 ${PAR}`
  do
    echo ${LOC}
    for j in `seq 1 ${MULT}`
    do    
      dd if=${LOC} of=/dev/null bs=4k
    done
  done
fi

And the test script

cat test.sh 
#!/bin/sh

LOC=/mnt/tmp/4k
#Where to store the results
RES=`echo $LOC| sed -e &apos;s#/#-#g&apos; -e &apos;s/^-//&apos; `
# How many processes to run at one time
PAR=32
BASE=`dirname $0`
# How many times to read the file ( to multiple the caching effect )
MULT=4
mkdir -p  ${RES}

echo -n Creating source files
for i in `seq -w  1 ${PAR}`
do
  dd if=/dev/zero of=${LOC}/${i} bs=4M count=128 &amp;gt; ${RES}/${i}_create 2&amp;gt;&amp;amp;1 
  echo -n .
done

echo
echo Dropping cache
echo 3 &amp;gt; /proc/sys/vm/drop_caches
echo &quot;All threads reading the same file&quot;
for i in `seq -w 1 ${PAR}`
do
   (time $BASE/readfile.sh SEQ ${LOC} ${PAR} ${MULT})  &amp;gt; ${RES}/${i}_parallel_same_file  2&amp;gt;&amp;amp;1 &amp;amp;
echo -n
done
wait

echo Dropping cache
echo 3 &amp;gt; /proc/sys/vm/drop_caches
echo &quot;Reading the same file in a single thread &quot;
for i in `seq -w 1 ${PAR} `
do
   (time ${BASE}/readfile.sh PAR ${LOC}/${i} ${PAR} ${MULT}  )  &amp;gt; ${RES}/${i}_parallel_different_file  2&amp;gt;&amp;amp;1 &amp;amp;
done
wait

echo &quot;Time taken using read pattern&quot;
echo &quot;Core_A Core_B Core_C Core_D Core_E Core_F Core_G Core_H&quot;
echo &quot;FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H&quot;
echo &quot;FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H&quot;
echo &quot;FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H&quot;
echo &quot;FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H&quot;
echo &quot;FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H&quot;
echo &quot;FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H&quot;
echo &quot;FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H&quot;
echo &quot;FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H&quot;
echo -n &quot;Pattern takes &quot;
grep -h real ${RES}/*_parallel_different_file | sed -e &apos;s/real *[^ ]//&apos; -e &apos;s/s$//&apos; -e &apos;s/m/*60+/&apos;  | bc | awk &apos;{ sum+=$1} END {print sum}&apos;

echo &quot;Time taken using read pattern&quot;
echo &quot;Core_A Core_B Core_C Core_D Core_E Core_F Core_G Core_H&quot;
echo &quot;File_A File_A File_A File_A File_A File_A File_A File_A&quot;
echo &quot;File_B File_B File_B File_B File_B File_B File_B File_B&quot;
echo &quot;File_C File_C File_C File_C File_C File_C File_C File_C&quot;
echo &quot;File_D File_D File_D File_D File_D File_D File_D File_D&quot;
echo &quot;File_E File_E File_E File_E File_E File_E File_E File_E&quot;
echo &quot;File_F File_F File_F File_F File_F File_F File_F File_F&quot;
echo &quot;File_G File_G File_G File_G File_G File_G File_G File_G&quot;
echo &quot;File_H File_H File_H File_H File_H File_H File_H File_H&quot;
echo -n &quot;Pattern takes &quot;
grep -h real ${RES}/*_parallel_same_file | sed -e &apos;s/real *[^ ]//&apos; -e &apos;s/s$//&apos; -e &apos;s/m/*60+/&apos;  | bc | awk &apos;{ sum+=$1} END {print sum}&apos;
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="73388" author="kitwestneat" created="Thu, 12 Dec 2013 17:00:42 +0000"  >&lt;p&gt;Hi James,&lt;/p&gt;

&lt;p&gt;Can you also try with a master branch client? If you have time, a 2.5.x client would also be interesting I think.&lt;/p&gt;

&lt;p&gt;I&apos;ll try to test with different client versions as well. &lt;/p&gt;

&lt;p&gt;Thanks,&lt;br/&gt;
Kit&lt;/p&gt;</comment>
                            <comment id="73390" author="james beal" created="Thu, 12 Dec 2013 17:17:55 +0000"  >&lt;p&gt;This is from an old system that has only 16Gig of RAM, and only 8 cores. The size of each file was increased to 1G but the amount of parallelism reduced to 8.&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[   67.821266] Lustre: Lustre: Build Version: jenkins-gb641cc0-PRISTINE-2.6.32-41-server
[   68.182300] Lustre: Added LNI 172.17.96.27@tcp [8/256/0/180]
[   68.182364] Lustre: Accept secure, port 988
[   68.414016] Lustre: Lustre OSC module (ffffffffa08556e0).
[   68.538800] Lustre: Lustre LOV module (ffffffffa08ea0e0).
[   68.701672] Lustre: Lustre client module (ffffffffa09d6ca0).
[  174.074540] Lustre: MGC172.17.128.135@tcp: Reactivating import
[  174.074569] Lustre: Server MGS version (2.4.1.0) is much newer than client version. Consider upgrading client (2.1.6)
[  174.086076] LustreError: 1200:0:(mgc_request.c:247:do_config_log_add()) failed processing sptlrpc log: -2
[  174.180021] Lustre: Server lus03-MDT0000_UUID version (2.4.1.0) is much newer than client version. Consider upgrading client (2.1.6)
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;


&lt;div class=&apos;table-wrap&apos;&gt;
&lt;table class=&apos;confluenceTable&apos;&gt;&lt;tbody&gt;
&lt;tr&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;&amp;nbsp;&lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt; 4M reads &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;  4K reads &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt; slow down &lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;lustre reading multiple files concurrently &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt; 868 secs &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt; 1702 secs &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;  2.0      &lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;lustre reading a single file  concurrently &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt; 988 secs &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt; 4744 secs &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;  4.8      &lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;tmpfs  reading multiple files concurrently &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt; 424 secs &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;  207 secs &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;  0.5      &lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;tmpfs  reading a single file  concurrently &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt; 537 secs &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;  191 secs &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;  0.4      &lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;/div&gt;



&lt;p&gt;4M read, lustre&lt;/p&gt;

&lt;p&gt;root@isg-dev4:/root# bash test.sh &lt;br/&gt;
Creating source files........&lt;br/&gt;
Dropping cache&lt;br/&gt;
All threads reading the same file&lt;br/&gt;
Dropping cache&lt;br/&gt;
Reading the same file in a single thread &lt;br/&gt;
pTime taken using read pattern&lt;br/&gt;
Core_A Core_B Core_C Core_D Core_E Core_F Core_G Core_H&lt;br/&gt;
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H&lt;br/&gt;
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H&lt;br/&gt;
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H&lt;br/&gt;
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H&lt;br/&gt;
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H&lt;br/&gt;
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H&lt;br/&gt;
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H&lt;br/&gt;
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H&lt;br/&gt;
Pattern takes 867.87&lt;br/&gt;
Time taken using read pattern&lt;br/&gt;
Core_A Core_B Core_C Core_D Core_E Core_F Core_G Core_H&lt;br/&gt;
File_A File_A File_A File_A File_A File_A File_A File_A&lt;br/&gt;
File_B File_B File_B File_B File_B File_B File_B File_B&lt;br/&gt;
File_C File_C File_C File_C File_C File_C File_C File_C&lt;br/&gt;
File_D File_D File_D File_D File_D File_D File_D File_D&lt;br/&gt;
File_E File_E File_E File_E File_E File_E File_E File_E&lt;br/&gt;
File_F File_F File_F File_F File_F File_F File_F File_F&lt;br/&gt;
File_G File_G File_G File_G File_G File_G File_G File_G&lt;br/&gt;
File_H File_H File_H File_H File_H File_H File_H File_H&lt;br/&gt;
Pattern takes 988.253&lt;br/&gt;
root@isg-dev4:/root# &lt;/p&gt;

&lt;p&gt;4K read,lustre&lt;/p&gt;

&lt;p&gt;root@isg-dev4:/root# bash test.sh &lt;br/&gt;
Creating source files........&lt;br/&gt;
Dropping cache&lt;br/&gt;
All threads reading the same file&lt;br/&gt;
Dropping cache&lt;br/&gt;
Reading the same file in a single thread &lt;br/&gt;
Time taken using read pattern&lt;br/&gt;
Core_A Core_B Core_C Core_D Core_E Core_F Core_G Core_H&lt;br/&gt;
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H&lt;br/&gt;
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H&lt;br/&gt;
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H&lt;br/&gt;
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H&lt;br/&gt;
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H&lt;br/&gt;
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H&lt;br/&gt;
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H&lt;br/&gt;
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H&lt;br/&gt;
Pattern takes 1702.89&lt;br/&gt;
Time taken using read pattern&lt;br/&gt;
Core_A Core_B Core_C Core_D Core_E Core_F Core_G Core_H&lt;br/&gt;
File_A File_A File_A File_A File_A File_A File_A File_A&lt;br/&gt;
File_B File_B File_B File_B File_B File_B File_B File_B&lt;br/&gt;
File_C File_C File_C File_C File_C File_C File_C File_C&lt;br/&gt;
File_D File_D File_D File_D File_D File_D File_D File_D&lt;br/&gt;
File_E File_E File_E File_E File_E File_E File_E File_E&lt;br/&gt;
File_F File_F File_F File_F File_F File_F File_F File_F&lt;br/&gt;
File_G File_G File_G File_G File_G File_G File_G File_G&lt;br/&gt;
File_H File_H File_H File_H File_H File_H File_H File_H&lt;br/&gt;
Pattern takes 4744.43&lt;br/&gt;
root@isg-dev4:/root# &lt;/p&gt;

&lt;p&gt;4k reads, tmpfs&lt;/p&gt;

&lt;p&gt;root@isg-dev4:/root# ./test.sh &lt;br/&gt;
Creating source files........&lt;br/&gt;
Dropping cache&lt;br/&gt;
All threads reading the same file&lt;br/&gt;
Dropping cache&lt;br/&gt;
Reading the same file in a single thread &lt;br/&gt;
Time taken using read pattern&lt;br/&gt;
Core_A Core_B Core_C Core_D Core_E Core_F Core_G Core_H&lt;br/&gt;
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H&lt;br/&gt;
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H&lt;br/&gt;
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H&lt;br/&gt;
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H&lt;br/&gt;
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H&lt;br/&gt;
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H&lt;br/&gt;
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H&lt;br/&gt;
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H&lt;br/&gt;
Pattern takes 207.198&lt;br/&gt;
Time taken using read pattern&lt;br/&gt;
Core_A Core_B Core_C Core_D Core_E Core_F Core_G Core_H&lt;br/&gt;
File_A File_A File_A File_A File_A File_A File_A File_A&lt;br/&gt;
File_B File_B File_B File_B File_B File_B File_B File_B&lt;br/&gt;
File_C File_C File_C File_C File_C File_C File_C File_C&lt;br/&gt;
File_D File_D File_D File_D File_D File_D File_D File_D&lt;br/&gt;
File_E File_E File_E File_E File_E File_E File_E File_E&lt;br/&gt;
File_F File_F File_F File_F File_F File_F File_F File_F&lt;br/&gt;
File_G File_G File_G File_G File_G File_G File_G File_G&lt;br/&gt;
File_H File_H File_H File_H File_H File_H File_H File_H&lt;br/&gt;
Pattern takes 191.962&lt;br/&gt;
root@isg-dev4:/root# &lt;/p&gt;

&lt;p&gt;4M reads,tmpfs&lt;/p&gt;

&lt;p&gt;root@isg-dev4:/root# ./test.sh &lt;br/&gt;
Creating source files........&lt;br/&gt;
Dropping cache&lt;br/&gt;
All threads reading the same file&lt;br/&gt;
Dropping cache&lt;br/&gt;
Reading the same file in a single thread &lt;br/&gt;
Time taken using read pattern&lt;br/&gt;
Core_A Core_B Core_C Core_D Core_E Core_F Core_G Core_H&lt;br/&gt;
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H&lt;br/&gt;
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H&lt;br/&gt;
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H&lt;br/&gt;
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H&lt;br/&gt;
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H&lt;br/&gt;
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H&lt;br/&gt;
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H&lt;br/&gt;
FIle_A FIle_B FIle_C FIle_D FIle_E FIle_F FIle_G FIle_H&lt;br/&gt;
Pattern takes 424.451&lt;br/&gt;
Time taken using read pattern&lt;br/&gt;
Core_A Core_B Core_C Core_D Core_E Core_F Core_G Core_H&lt;br/&gt;
File_A File_A File_A File_A File_A File_A File_A File_A&lt;br/&gt;
File_B File_B File_B File_B File_B File_B File_B File_B&lt;br/&gt;
File_C File_C File_C File_C File_C File_C File_C File_C&lt;br/&gt;
File_D File_D File_D File_D File_D File_D File_D File_D&lt;br/&gt;
File_E File_E File_E File_E File_E File_E File_E File_E&lt;br/&gt;
File_F File_F File_F File_F File_F File_F File_F File_F&lt;br/&gt;
File_G File_G File_G File_G File_G File_G File_G File_G&lt;br/&gt;
File_H File_H File_H File_H File_H File_H File_H File_H&lt;br/&gt;
Pattern takes 537.192&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="73487" author="jay" created="Fri, 13 Dec 2013 16:37:35 +0000"  >&lt;p&gt;sorry I just saw this ticket today. I didn&apos;t read all the comments but the result mentioned at: &quot;James Beal added a comment - 18/Nov/13 4:08 AM&quot; seems quite reasonable to me.&lt;/p&gt;

&lt;p&gt;For serial reading, only the first read is from the server; the following readings are from client cache so this is why they are super fast.&lt;/p&gt;

&lt;p&gt;For parallel reading, things are different. Basically only one process issued the reading RPC and the other 19 processes are just waiting for the result of reading. For example, for the first 4M pages, the first process sends read RPC, then other processes try to read the same portion of data but find the pages are locked, so they will have to wait for the RPC to finish; the same thing happened to the 2nd 4M block, and so on.&lt;/p&gt;</comment>
                            <comment id="73489" author="kitwestneat" created="Fri, 13 Dec 2013 17:07:38 +0000"  >&lt;p&gt;But then wouldn&apos;t you expect the first serial process time to be about the same as the first parallel process time? You can see from the data that the first serial process completes in about 1/3 less time than any of the parallel processes. &lt;/p&gt;</comment>
                            <comment id="73493" author="jay" created="Fri, 13 Dec 2013 17:46:28 +0000"  >&lt;p&gt;Actually you can&apos;t expect that because there must be lot of page lock waiting for parallel processes, but things are much better due to read ahead for serial reading.&lt;/p&gt;</comment>
                            <comment id="73500" author="james beal" created="Fri, 13 Dec 2013 18:47:23 +0000"  >&lt;p&gt;I note that the files are only being read and the file system is mounted with flocklocal, if we compare the ratio of speeds with tmpfs we see no effective difference with tmpfs compared to a slow down by a factor of 15.6 for lustre. Note we have applications that read files one character at a time. Our current work around involves a hack with a LD_PRELOAD which changes the reads in to memory mapped access..... &lt;/p&gt;


&lt;div class=&apos;table-wrap&apos;&gt;
&lt;table class=&apos;confluenceTable&apos;&gt;&lt;tbody&gt;
&lt;tr&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;&amp;nbsp;&lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt; 4M reads  &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;    4K reads &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt; slow down &lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;lustre reading multiple files concurrently &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt; 4843 secs &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;   7202 secs &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;  1.5      &lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;lustre reading a single file  concurrently &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt; 8470 secs &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt; 131779 secs &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt; 15.6      &lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;tmpfs  reading multiple files concurrently &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt; 8019 secs &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;   6142 secs &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;  0.8      &lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;tmpfs  reading a single file  concurrently &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt; 8161 secs &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;   6131 secs &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;  0.8      &lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;/div&gt;
</comment>
                            <comment id="73519" author="adilger" created="Fri, 13 Dec 2013 22:13:48 +0000"  >&lt;p&gt;Unless there is a difference in what is being tested between the filesystems, or I&apos;m misunderstanding your results, I find it interesting that Lustre concurrent 4MB reads (@4843s) is significantly &lt;em&gt;faster&lt;/em&gt; than tmpfs 4MB reads (@~8019s)?  &lt;img class=&quot;emoticon&quot; src=&quot;https://jira.whamcloud.com/images/icons/emoticons/smile.png&quot; height=&quot;16&quot; width=&quot;16&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;  Even with 4k reads the total time for Lustre (@7202s) is less than the 4MB reads for tmpfs (@80192).&lt;/p&gt;

&lt;p&gt;It is reasonable that there is overhead for Lustre with smaller read sizes and with concurrent IO compared to tmpfs.  Lustre needs to do significantly more work per syscall than tmpfs.  Even with a single client, there is locking overhead to maintain consistency (whether it is needed in this particular use or not), overhead in being able to potentially distribute the IO across multiple servers, etc., while tmpfs is as close to a &quot;no-op&quot; as a filesystem can be.  That said, I agree the 15x slowdown for 4kB reads is excessive, and indicates there is too much contention/overhead in this part of the code.&lt;/p&gt;

&lt;p&gt;In 1.8 there was a fast path implemented for small reads, since most of the pages were prefetched via readahead and already in the page cache, so only a minimal amount of work was needed before returning the page to userspace.  I think Jinshan is working on a patch to improve this for 2.x as well.&lt;/p&gt;</comment>
                            <comment id="73524" author="james beal" created="Fri, 13 Dec 2013 22:52:19 +0000"  >&lt;p&gt;Andreas, the numbers I pasted were for a 1.8.9 client..., I will try and see what a 2.5 client gives us.&lt;/p&gt;

&lt;div class=&apos;table-wrap&apos;&gt;
&lt;table class=&apos;confluenceTable&apos;&gt;&lt;tbody&gt;
&lt;tr&gt;
&lt;td class=&apos;confluenceTd&apos;&gt; Lustre 1.8.9                                          &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt; 4M reads  &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;    4K reads &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt; slow down &lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;lustre reading multiple files concurrently &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt; 4843 secs &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;   7202 secs &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;  1.5      &lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;lustre reading a single file  concurrently &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt; 8470 secs &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt; 131779 secs &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt; 15.6      &lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;tmpfs  reading multiple files concurrently &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt; 8019 secs &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;   6142 secs &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;  0.8      &lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;tmpfs  reading a single file  concurrently &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt; 8161 secs &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;   6131 secs &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;  0.8      &lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;/div&gt;


&lt;p&gt;For all these reads the results should be &quot;in the cache&quot;, I will try and do a run with 1 character reads which is what the program which showed this issue in the first place so it is apparent how painful the difference is.&lt;/p&gt;</comment>
                            <comment id="73546" author="james beal" created="Sun, 15 Dec 2013 17:44:01 +0000"  >&lt;p&gt;The change between lustre 1.8.9 and lustre 2.5.0 looks very unpleasant....., ( The run on one character reads has been running over the weekend and it has not completed the first part yet ).... Given that we are moving to lustre 2.5 as our client on all our hosts very soon I am very wary of these numbers.&lt;/p&gt;

&lt;div class=&apos;table-wrap&apos;&gt;
&lt;table class=&apos;confluenceTable&apos;&gt;&lt;tbody&gt;
&lt;tr&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;&amp;nbsp;&lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;  4M reads  &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;     4K reads &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt; slow down  &lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;lustre 1.8.9 reading multiple files concurrently &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;  4843 secs &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;    7202 secs &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;   1.5      &lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;lustre 2.5.0 reading multiple files concurrently &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;  4578 secs &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;   55267 secs &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;  12.1      &lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;lustre 1.8.9 reading a single file  concurrently &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;  8470 secs &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;  131779 secs &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;  15.6      &lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;lustre 2.5.0 reading a single file  concurrently &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt; 18500 secs &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt; 2033260 secs &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt; 110.0      &lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;tmpfs        reading multiple files concurrently &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;  8019 secs &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;    6142 secs &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;   0.8      &lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;tmpfs        reading a single file  concurrently &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;  8161 secs &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;    6131 secs &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;   0.8      &lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;/div&gt;
</comment>
                            <comment id="73553" author="jay" created="Mon, 16 Dec 2013 06:05:54 +0000"  >&lt;p&gt;It seems bad. Let&apos;s start by investigating the single file concurrently read case. Can you please me the output of:&lt;/p&gt;

&lt;p&gt;lctl get_param llite.*.read_ahead_stats&lt;br/&gt;
lctl get_param osc.*.rpc_stats&lt;/p&gt;

&lt;p&gt;when running 1.8 and 2.5 clients specifically. Please remount the client before running the test case so that we can get accurate stats.&lt;/p&gt;

&lt;p&gt;Also, can you please run it with single thread on both clients to check how much difference it is between them?&lt;/p&gt;</comment>
                            <comment id="73565" author="james beal" created="Mon, 16 Dec 2013 13:26:57 +0000"  >&lt;p&gt;To complete the tests in a reasonable time I had to reduce the size of the file to 128MB&lt;/p&gt;

&lt;div class=&apos;table-wrap&apos;&gt;
&lt;table class=&apos;confluenceTable&apos;&gt;&lt;tbody&gt;
&lt;tr&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;&amp;nbsp;&lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;  4M reads  &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;     4K reads &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt; slow down &lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;lustre 1.8.9 reading a single file  concurrently &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;   768 secs &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;   5373 secs  &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;  7        &lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;lustre 2.5.0 reading a single file  concurrently &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;  1363 secs &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt; 132559 secs  &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt; 97        &lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;  Slow down                                      &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;  1.8       &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt; 24.7         &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;&amp;nbsp;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;/div&gt;


&lt;div class=&apos;table-wrap&apos;&gt;
&lt;table class=&apos;confluenceTable&apos;&gt;&lt;tbody&gt;
&lt;tr&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;&amp;nbsp;&lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;  4M reads  &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;   4K reads &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt; slow down  &lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;lustre 1.8.9 reading a single file                       &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;  0.45 secs &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt; 0.35  secs &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;  0.78      &lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;lustre 2.5.0 reading a single file                       &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;  0.38 secs &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt; 1.25  secs &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;  3.29      &lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td class=&apos;confluenceTd&apos;&gt; Slow down                                               &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;  0.85      &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt; 3.57       &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;&amp;nbsp;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;/div&gt;


&lt;p&gt;The single client reads with a larger file&lt;/p&gt;

&lt;p&gt;This is with a 4G files, which seems to follow the smaller file patten.&lt;/p&gt;

&lt;div class=&apos;table-wrap&apos;&gt;
&lt;table class=&apos;confluenceTable&apos;&gt;&lt;tbody&gt;
&lt;tr&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;&amp;nbsp;&lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;  4M reads  &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;   4K reads &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt; slow down  &lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;lustre 1.8.9 reading a single file                       &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;  9.1 secs  &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;  9.0  secs &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;  1.01      &lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;lustre 2.5.0 reading a single file                       &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;  8.8 secs  &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt; 36.9  secs &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;  4.19      &lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td class=&apos;confluenceTd&apos;&gt; Slow down                                               &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;  0.97      &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt; 4.1        &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;&amp;nbsp;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;/div&gt;
</comment>
                            <comment id="73571" author="james beal" created="Mon, 16 Dec 2013 14:17:56 +0000"  >&lt;p&gt;The logs have been added as attachments, along with a new copy of the test script.&lt;/p&gt;</comment>
                            <comment id="73632" author="jay" created="Mon, 16 Dec 2013 23:31:01 +0000"  >&lt;p&gt;It looks like that per IO overhead of 2.x is much higher than 1.8. I did some experiments on my local machine and cl_lock contributes most of the overhead.&lt;/p&gt;

&lt;p&gt;I hope we can solve this problem in the on going clio simplification project.&lt;/p&gt;</comment>
                            <comment id="75097" author="kitwestneat" created="Thu, 16 Jan 2014 16:57:04 +0000"  >&lt;p&gt;Hi Jinshan,&lt;/p&gt;

&lt;p&gt;Can you elaborate on the clio simplification project and what sort of impact it might have on this IO pattern?&lt;/p&gt;

&lt;p&gt;Thanks,&lt;br/&gt;
Kit&lt;/p&gt;</comment>
                            <comment id="75113" author="jay" created="Thu, 16 Jan 2014 18:38:26 +0000"  >&lt;p&gt;Clio project is funded by opensfs. We will simplify cl_lock implementation there. There will be performance tune phase at the end of the project, which will give us a chance to look at this problem.&lt;/p&gt;

&lt;p&gt;I&apos;ve ever set up an environment to check what&apos;s the major components to slow it down. And from what I have seen, cl_lock() contributes some overhead. Other components for example, cl_io data initialization, refresh layout, etc also contributes significant overhead.&lt;/p&gt;

&lt;p&gt;I&apos;m on a really tight schedule so I can&apos;t find any time to work on this recently.&lt;/p&gt;</comment>
                            <comment id="75126" author="rhenwood" created="Thu, 16 Jan 2014 19:31:38 +0000"  >&lt;p&gt;Kit: for clarity, OpenSFS have funded a design phase for CLIO Simplification. Implementation and an associated performance analysis are not within scope for the CLIO Simplification design phase.&lt;/p&gt;</comment>
                            <comment id="76034" author="jay" created="Sat, 1 Feb 2014 05:34:28 +0000"  >&lt;p&gt;I&apos;d like to share some result I&apos;ve got so far.&lt;/p&gt;

&lt;p&gt;First of all, I can reproduce the problem mentioned by James. So I worked out a patch, plus the prototype of cl_lock simplification, I have some results to share.&lt;/p&gt;

&lt;p&gt;There are two type of read in my test:&lt;br/&gt;
1. 32 threads read 32 individual files;&lt;br/&gt;
2. 32 threads read the same file.&lt;/p&gt;

&lt;p&gt;Also run the above test cases with memory in cache and not in cache specifically.&lt;/p&gt;

&lt;p&gt;Here are the results(seconds):&lt;/p&gt;

&lt;div class=&apos;table-wrap&apos;&gt;
&lt;table class=&apos;confluenceTable&apos;&gt;&lt;tbody&gt;
&lt;tr&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;&amp;nbsp;&lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt; 4M reads &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt; 4K read&lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt; Slow Down&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td class=&apos;confluenceTd&apos;&gt; ind + cache    &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt; 24.96    &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt; 198.53 &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;&amp;nbsp;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td class=&apos;confluenceTd&apos;&gt; par + cache    &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt; 20.65    &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt; 111.02 &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;&amp;nbsp;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td class=&apos;confluenceTd&apos;&gt; ind + no cache &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt; 161.8    &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt; 157.37 &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;&amp;nbsp;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td class=&apos;confluenceTd&apos;&gt; par + no cache &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt; 13.44    &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt; 115.51 &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;&amp;nbsp;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;/div&gt;


&lt;p&gt;ind: threads read individual files&lt;br/&gt;
par: threads read the same file&lt;br/&gt;
cache: before reading, the data is already in memory&lt;br/&gt;
no cache: clean up memory cache before reading&lt;/p&gt;

&lt;p&gt;Raw data is here:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[root@c01 tests]# CACHE=yes ~/test.sh 
Dropping cache
Read everything into memory cache..................................done
Creating source files................................
Block size 4k
cache: yes, Threads read individual files - wall: 16, real: 198.53 seconds

cache: yes, All threads reading the same file - wall: 43, real: 111.02 seconds
Block size 4M
cache: yes, Threads read individual files - wall: 1, real: 24.96 seconds

cache: yes, All threads reading the same file - wall: 1, real: 20.65 seconds
[root@c01 tests]# CACHE=no ~/test.sh 
Dropping cache
Creating source files................................
Block size 4k
Dropping cache
cache: no, Threads read individual files - wall: 101, real: 157.37 seconds

Dropping cache
cache: no, All threads reading the same file - wall: 43, real: 115.51 seconds
Block size 4M
Dropping cache
cache: no, Threads read individual files - wall: 94, real: 161.8 seconds

Dropping cache
cache: no, All threads reading the same file - wall: 1, real: 13.44 seconds
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;and this is my test script:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[root@c01 tests]# cat ~/test.sh 
#!/bin/bash

CACHE=${CACHE:-&quot;no&quot;}
LOC=/mnt/lustre/lustre_test
mkdir -p $LOC

#Where to store the results
RES=~/res
# How many processes to run at one time
PAR=32
BASE=`dirname $0`
# How many times to read the file ( to multiple the caching effect )
MULT=1

rm -rf ${RES} &amp;amp;&amp;amp; mkdir -p ${RES}

function remount_lustre {
  echo Dropping cache
  echo 3 &amp;gt; /proc/sys/vm/drop_caches
  umount /mnt/lustre
  mount | grep c01
  mount -t lustre -o localflock  c01@tcp0:/lustre /mnt/lustre 

  if [ &quot;$CACHE&quot; = &quot;yes&quot; ]; then
    echo -n Read everything into memory cache..
    for i in `seq -f &quot;%03.f&quot; 1 ${PAR}`; do
          dd if=${LOC}/${i} of=/dev/null bs=4M &amp;gt; /dev/null 2&amp;gt;&amp;amp;1 
          echo -n .
    done
    echo done
  fi
}

remount_lustre

echo -n Creating source files
for i in `seq -f &quot;%03.f&quot; 1 ${PAR}`
do
  [ -e ${LOC}/${i} ] || dd if=/dev/zero of=${LOC}/${i} bs=4M count=32 &amp;gt; ${RES}/${i}_create 2&amp;gt;&amp;amp;1 
  echo -n .
done
echo

for BLOCK in &quot;4k&quot; &quot;4M&quot;
do
  echo &quot;Block size $BLOCK&quot;

  [ &quot;$CACHE&quot; = &quot;no&quot; ] &amp;amp;&amp;amp; remount_lustre

  echo -n &quot;cache: $CACHE, Threads read individual files - &quot;

  st=$(date +%s)
  for i in `seq -f &quot;%03.f&quot; 1 ${PAR}`; do
    /usr/bin/time -p -a -o ${RES}/single.${BLOCK} dd if=${LOC}/$i of=/dev/null bs=${BLOCK} &amp;gt; /dev/null 2&amp;gt;&amp;amp;1 &amp;amp;
  done
  wait

  echo -n &quot;wall: $((`date +%s`-st)), &quot;
  t=`grep -h real ${RES}/single.${BLOCK} | sed -e &apos;s/real *[^ ]//&apos; -e &apos;s/s$//&apos; -e &apos;s/m/*60+/&apos;  | bc | awk &apos;{ sum+=$1} END {print sum}&apos;`
  echo &quot;real: $t seconds&quot;
  echo

  [ &quot;$CACHE&quot; = &quot;no&quot; ] &amp;amp;&amp;amp; remount_lustre

  echo -n &quot;cache: $CACHE, All threads reading the same file - &quot;

  st=$(date +%s)
  for i in `seq -f &quot;%03.f&quot; 1 ${PAR}`; do
    /usr/bin/time -p -a -o ${RES}/par.${BLOCK} dd if=${LOC}/001 of=/dev/null bs=${BLOCK} &amp;gt; /dev/null 2&amp;gt;&amp;amp;1 &amp;amp;
  done
  wait

  echo -n &quot;wall: $((`date +%s`-st)), &quot;
  t=`grep -h real ${RES}/par.${BLOCK} | sed -e &apos;s/real *[^ ]//&apos; -e &apos;s/s$//&apos; -e &apos;s/m/*60+/&apos;  | bc | awk &apos;{ sum+=$1} END {print sum}&apos;`
  echo &quot;real: $t seconds&quot;
done
[root@c01 tests]# 
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="76035" author="jay" created="Sat, 1 Feb 2014 05:40:30 +0000"  >&lt;p&gt;Some of the patches are still in prototype phase, and I&apos;ll share the patches with you guys as soon as I can, so that you can try to reproduce what I have done.&lt;/p&gt;

&lt;p&gt;Today is Chinese New Year. Happy Chinese New Year, everybody!&lt;/p&gt;</comment>
                            <comment id="76039" author="james beal" created="Sat, 1 Feb 2014 08:23:27 +0000"  >&lt;p&gt;Happy New Year to you to &lt;img class=&quot;emoticon&quot; src=&quot;https://jira.whamcloud.com/images/icons/emoticons/smile.png&quot; height=&quot;16&quot; width=&quot;16&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/p&gt;

&lt;p&gt;I completely understand about prototypes..., It would be good to have before and after numbers in the tables.&lt;/p&gt;

&lt;p&gt;Thank you for looking at this for us.&lt;/p&gt;</comment>
                            <comment id="76074" author="dmiter" created="Mon, 3 Feb 2014 16:18:22 +0000"  >&lt;p&gt;Could you test with my patch &lt;a href=&quot;http://review.whamcloud.com/#/c/9095?&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#/c/9095?&lt;/a&gt;&lt;br/&gt;
This patch significantly reduce amount of context switches on ll_inode_size_lock(). So, overhead is gone and I have speedup in parallel dds on my machine.&lt;/p&gt;</comment>
                            <comment id="76087" author="james beal" created="Mon, 3 Feb 2014 17:27:38 +0000"  >&lt;p&gt;I will build a new client tomorrow.&lt;/p&gt;</comment>
                            <comment id="76120" author="jay" created="Mon, 3 Feb 2014 19:27:01 +0000"  >&lt;p&gt;Here are the results before and after my patches is applied (in seconds):&lt;/p&gt;

&lt;div class=&apos;table-wrap&apos;&gt;
&lt;table class=&apos;confluenceTable&apos;&gt;&lt;tbody&gt;
&lt;tr&gt;
&lt;td class=&apos;confluenceTd&apos;&gt;&amp;nbsp;&lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt; single &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt; parallel &lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td class=&apos;confluenceTd&apos;&gt; master          &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt; 3.28   &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt; 18.91    &lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td class=&apos;confluenceTd&apos;&gt; master + patch  &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt; 2.84   &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt; 5.80     &lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td class=&apos;confluenceTd&apos;&gt; b1_8            &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt; 0.62   &lt;/td&gt;
&lt;td class=&apos;confluenceTd&apos;&gt; 11.54    &lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;/div&gt;


&lt;p&gt;single: single thread read single file&lt;br/&gt;
parallel: NR_CPUS threads read single file in parallel(8 threads in my case).&lt;/p&gt;

&lt;p&gt;the file size is 1G and is already in cache before threads start; the read block size is 4k.&lt;/p&gt;</comment>
                            <comment id="79241" author="james beal" created="Thu, 13 Mar 2014 15:19:42 +0000"  >&lt;p&gt;While we wait I found the following interesting.&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://lwn.net/Articles/590243/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://lwn.net/Articles/590243/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Performance-oriented patches should, of course, always be accompanied by benchmark results. In this case, Waiman included a set of AIM7 benchmark results with his patch set (which did not include the pending-bit optimization). Some workloads regressed a little, but others shows improvements of 1-2% &#8212; a good result for a low-level locking improvement. The disk benchmark runs, however, improved by as much as 116%; that benchmark suffers from especially strong contention for locks in the virtual filesystem layer and ext4 filesystem code.&lt;/p&gt;</comment>
                            <comment id="81373" author="manish" created="Thu, 10 Apr 2014 14:44:59 +0000"  >&lt;p&gt;Hi Jinshan,&lt;/p&gt;

&lt;p&gt;Any updates on the request made by James. We are looking for the new patch for the performance issues to try out, based on the above comment. &lt;/p&gt;

&lt;p&gt;Thank You,&lt;br/&gt;
          Manish&lt;/p&gt;</comment>
                            <comment id="87747" author="jay" created="Sat, 28 Jun 2014 02:20:08 +0000"  >&lt;p&gt;I came up with a solution for this issue and the initial testing result is exciting.&lt;/p&gt;

&lt;p&gt;The test case I used is based on the test cases shared by sanger, please check below:&lt;/p&gt;

&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;#!/bin/bash

nr_cpus=$(grep -c ^processor /proc/cpuinfo)

&lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; CACHE in no yes; &lt;span class=&quot;code-keyword&quot;&gt;do&lt;/span&gt;
	&lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; BS in 4k 1M; &lt;span class=&quot;code-keyword&quot;&gt;do&lt;/span&gt;
		echo &lt;span class=&quot;code-quote&quot;&gt;&quot;===== cache: $CACHE, block size: $BS =====&quot;&lt;/span&gt;
		[ &lt;span class=&quot;code-quote&quot;&gt;&quot;$CACHE&quot;&lt;/span&gt; = &lt;span class=&quot;code-quote&quot;&gt;&quot;yes&quot;&lt;/span&gt; ] &amp;amp;&amp;amp; { dd &lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt;=/mnt/lustre/testfile bs=1M of=/dev/&lt;span class=&quot;code-keyword&quot;&gt;null&lt;/span&gt; &amp;gt; /dev/&lt;span class=&quot;code-keyword&quot;&gt;null&lt;/span&gt; 2&amp;gt;&amp;amp;1; }
		[ &lt;span class=&quot;code-quote&quot;&gt;&quot;$CACHE&quot;&lt;/span&gt; = &lt;span class=&quot;code-quote&quot;&gt;&quot;no&quot;&lt;/span&gt; ] &amp;amp;&amp;amp; { echo 3 &amp;gt; /proc/sys/vm/drop_caches; }

		echo -n &lt;span class=&quot;code-quote&quot;&gt;&quot;      single read: &quot;&lt;/span&gt;
		dd &lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt;=/mnt/lustre/testfile bs=$BS of=/dev/&lt;span class=&quot;code-keyword&quot;&gt;null&lt;/span&gt; 2&amp;gt;&amp;amp;1 |grep copied |awk -F, &lt;span class=&quot;code-quote&quot;&gt;&apos;{print $3}&apos;&lt;/span&gt;

		[ &lt;span class=&quot;code-quote&quot;&gt;&quot;$CACHE&quot;&lt;/span&gt; = &lt;span class=&quot;code-quote&quot;&gt;&quot;no&quot;&lt;/span&gt; ] &amp;amp;&amp;amp; { echo 3 &amp;gt; /proc/sys/vm/drop_caches; }

		echo -n &lt;span class=&quot;code-quote&quot;&gt;&quot;      parallel read: &quot;&lt;/span&gt;
		&lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; i in `seq -w 1 ${nr_cpus}`; &lt;span class=&quot;code-keyword&quot;&gt;do&lt;/span&gt;
			dd &lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt;=/mnt/lustre/testfile bs=$BS of=/dev/&lt;span class=&quot;code-keyword&quot;&gt;null&lt;/span&gt; &amp;gt; results/${i}_out 2&amp;gt;&amp;amp;1 &amp;amp;
		done
		wait
		grep copied results/1_out | awk -F, &lt;span class=&quot;code-quote&quot;&gt;&apos;{print $3}&apos;&lt;/span&gt;
	done
done
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The test file is 2G in size so that it can fit in memory for cache enabled testing. I applied the patches and compare the test results w/ and w/o patches.&lt;/p&gt;

&lt;p&gt;The result w/ my patches:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;===== cache: no, block size: 4k =====
      single read:  1.2 GB/s
      parallel read:  576 MB/s
===== cache: no, block size: 1M =====
      single read:  1.4 GB/s
      parallel read:  566 MB/s
===== cache: yes, block size: 4k =====
      single read:  3.8 GB/s
      parallel read:  1.8 GB/s
===== cache: yes, block size: 1M =====
      single read:  6.4 GB/s
      parallel read:  1.3 GB/s
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The test w/o my patches:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;===== cache: no, block size: 4k =====
      single read:  257 MB/s
      parallel read:  148 MB/s
===== cache: no, block size: 1M =====
      single read:  1.1 GB/s
      parallel read:  420 MB/s
===== cache: yes, block size: 4k =====
      single read:  361 MB/s
      parallel read:  147 MB/s
===== cache: yes, block size: 1M =====
      single read:  5.8 GB/s
      parallel read:  1.3 GB/s
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The small IO performance improved significantly. I&apos;m still doing some fine tune for the patches and I will release them as soon as I can so that you can do some evaluation.&lt;/p&gt;</comment>
                            <comment id="154358" author="gerrit" created="Thu, 2 Jun 2016 01:18:00 +0000"  >&lt;p&gt;Bobi Jam (bobijam@hotmail.com) uploaded a new patch: &lt;a href=&quot;http://review.whamcloud.com/20574&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/20574&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4257&quot; title=&quot;parallel dds are slower than serial dds&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4257&quot;&gt;&lt;del&gt;LU-4257&lt;/del&gt;&lt;/a&gt; clio: replace semaphore with mutex&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: b2_4&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: 3e1cbe0b81eaee6e509c825455669c89df157915&lt;/p&gt;</comment>
                            <comment id="154540" author="gerrit" created="Fri, 3 Jun 2016 04:34:57 +0000"  >&lt;p&gt;Oleg Drokin (oleg.drokin@intel.com) merged in patch &lt;a href=&quot;http://review.whamcloud.com/20254/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/20254/&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4257&quot; title=&quot;parallel dds are slower than serial dds&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4257&quot;&gt;&lt;del&gt;LU-4257&lt;/del&gt;&lt;/a&gt; obdclass: Get rid of cl_env hash table&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: &lt;br/&gt;
Commit: 45332712783a4756bf5930d6bd5f697bbc27acdb&lt;/p&gt;</comment>
                            <comment id="154541" author="gerrit" created="Fri, 3 Jun 2016 04:35:05 +0000"  >&lt;p&gt;Oleg Drokin (oleg.drokin@intel.com) merged in patch &lt;a href=&quot;http://review.whamcloud.com/20256/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/20256/&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4257&quot; title=&quot;parallel dds are slower than serial dds&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4257&quot;&gt;&lt;del&gt;LU-4257&lt;/del&gt;&lt;/a&gt; llite: fix up iov_iter implementation&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: &lt;br/&gt;
Commit: 1101120d3258509fa74f952cd8664bfdc17bd97d&lt;/p&gt;</comment>
                            <comment id="154542" author="gerrit" created="Fri, 3 Jun 2016 04:35:14 +0000"  >&lt;p&gt;Oleg Drokin (oleg.drokin@intel.com) merged in patch &lt;a href=&quot;http://review.whamcloud.com/20255/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/20255/&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4257&quot; title=&quot;parallel dds are slower than serial dds&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4257&quot;&gt;&lt;del&gt;LU-4257&lt;/del&gt;&lt;/a&gt; llite: fast read implementation&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: &lt;br/&gt;
Commit: 172048eaefa834e310e6a0fa37e506579f4079df&lt;/p&gt;</comment>
                            <comment id="154633" author="pjones" created="Fri, 3 Jun 2016 20:47:39 +0000"  >&lt;p&gt;Landed for 2.9&lt;/p&gt;</comment>
                            <comment id="154975" author="adilger" created="Tue, 7 Jun 2016 20:39:15 +0000"  >&lt;p&gt;This patch introduced an intermittent test failure &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-8248&quot; title=&quot;sanity test_248: fast read was not 4 times faster&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-8248&quot;&gt;&lt;del&gt;LU-8248&lt;/del&gt;&lt;/a&gt; in sanity.sh test_248.&lt;/p&gt;</comment>
                            <comment id="155197" author="gerrit" created="Thu, 9 Jun 2016 00:16:24 +0000"  >&lt;p&gt;Andreas Dilger (andreas.dilger@intel.com) merged in patch &lt;a href=&quot;http://review.whamcloud.com/20647/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/20647/&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4257&quot; title=&quot;parallel dds are slower than serial dds&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4257&quot;&gt;&lt;del&gt;LU-4257&lt;/del&gt;&lt;/a&gt; test: Correct error_ignore message&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: &lt;br/&gt;
Commit: 5f03bf91e68e925149f2331a44d1e4ad858b8006&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10010">
                    <name>Duplicate</name>
                                            <outwardlinks description="duplicates">
                                                        </outwardlinks>
                                                                <inwardlinks description="is duplicated by">
                                                        </inwardlinks>
                                    </issuelinktype>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                        <issuelink>
            <issuekey id="37448">LU-8248</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="43718">LU-9106</issuekey>
        </issuelink>
                            </outwardlinks>
                                                                <inwardlinks description="is related to">
                                        <issuelink>
            <issuekey id="32994">LU-7382</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="46033">LU-9491</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="23208">LU-4650</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="23016">LU-4588</issuekey>
        </issuelink>
                            </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                            <attachment id="13846" name="debug_file.out.gz" size="227" author="kitwestneat" created="Thu, 21 Nov 2013 20:37:24 +0000"/>
                            <attachment id="13883" name="io.png" size="77269" author="jkb" created="Tue, 3 Dec 2013 16:46:09 +0000"/>
                            <attachment id="13836" name="lu-4257.tar.gz" size="223" author="james beal" created="Mon, 18 Nov 2013 12:08:06 +0000"/>
                            <attachment id="13922" name="lustre_1.8.9" size="869965" author="james beal" created="Mon, 16 Dec 2013 13:26:44 +0000"/>
                            <attachment id="13925" name="lustre_2.5" size="816735" author="james beal" created="Mon, 16 Dec 2013 13:26:44 +0000"/>
                            <attachment id="13924" name="readfile.sh" size="459" author="james beal" created="Mon, 16 Dec 2013 13:26:44 +0000"/>
                            <attachment id="13923" name="test.sh" size="1696" author="james beal" created="Mon, 16 Dec 2013 13:26:44 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzw907:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>11618</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>