<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:27:01 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-2649] slow IO at NOAA</title>
                <link>https://jira.whamcloud.com/browse/LU-2649</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;(Sorry, this is kind of a book.)&lt;/p&gt;

&lt;p&gt;For the past couple of weeks, NOAA has been having significantly reduced performance on one of their filesystems, scratch2. Both scratch2 and scratch1 have the same type of hardware and are connected the same IB switches and network. The problem was first reported by a user doing a lot of opens and reads. The streaming rates appear to be normal, but the open rates have dropped significantly.&lt;/p&gt;

&lt;p&gt;We ran mdtest to confirm the drop:&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Dennis.Nelson@fe7 ~/mdtest-1.8.3&amp;#93;&lt;/span&gt;$ mpirun -np 8&lt;br/&gt;
/home/Dennis.Nelson/mdtest-1.8.3/mdtest -z 2 -b 3 -n 2000 -i 3 -u -w 256&lt;br/&gt;
-d /scratch1/portfolios/BMC/nesccmgmt/Dennis.Nelson/mdtest&lt;br/&gt;
&amp;#8211; started at 01/14/2013 15:34:17 &amp;#8211;&lt;/p&gt;

&lt;p&gt;mdtest-1.8.3 was launched with 8 total task(s) on 1 nodes&lt;br/&gt;
Command line used: /home/Dennis.Nelson/mdtest-1.8.3/mdtest -z 2 -b 3 -n&lt;br/&gt;
2000 -i 3 -u -w 256 -d&lt;br/&gt;
/scratch1/portfolios/BMC/nesccmgmt/Dennis.Nelson/mdtest&lt;br/&gt;
Path: /scratch1/portfolios/BMC/nesccmgmt/Dennis.Nelson&lt;br/&gt;
FS: 2504.0 TiB   Used FS: 33.7%   Inodes: 2250.0 Mi   Used Inodes: 5.1%&lt;/p&gt;

&lt;p&gt;8 tasks, 15912 files/directories&lt;/p&gt;

&lt;p&gt;SUMMARY: (of 3 iterations)&lt;br/&gt;
   Operation                  Max        Min       Mean    Std Dev&lt;br/&gt;
   ---------                  &amp;#8212;        &amp;#8212;       ----    -------&lt;br/&gt;
   Directory creation:   3567.963   3526.380   3544.592     17.364&lt;br/&gt;
   Directory stat    :  38138.970  32977.207  36170.723   2278.437&lt;br/&gt;
   Directory removal :   2184.674   2171.584   2178.737      5.412&lt;br/&gt;
   File creation     :   3251.627   2585.204   3024.967    311.008&lt;br/&gt;
   File stat         :  14236.137  13053.559  13622.039    483.862&lt;br/&gt;
   File removal      :   3274.005   3046.231   3138.367     97.944&lt;br/&gt;
   Tree creation     :    511.472    407.753    460.822     42.378&lt;br/&gt;
   Tree removal      :    290.478    282.657    286.577      3.193&lt;/p&gt;

&lt;p&gt;&amp;#8211; finished at 01/14/2013 15:35:29 &amp;#8211;&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;Dennis.Nelson@fe7 ~/mdtest-1.8.3&amp;#93;&lt;/span&gt;$ mpirun -np 8&lt;br/&gt;
/home/Dennis.Nelson/mdtest-1.8.3/mdtest -z 2 -b 3 -n 2000 -i 3 -u -w 256&lt;br/&gt;
-d /scratch2/portfolios/BMC/nesccmgmt/Dennis.Nelson/mdtest&lt;br/&gt;
&amp;#8211; started at 01/14/2013 15:25:24 &amp;#8211;&lt;/p&gt;

&lt;p&gt;mdtest-1.8.3 was launched with 8 total task(s) on 1 nodes&lt;br/&gt;
Command line used: /home/Dennis.Nelson/mdtest-1.8.3/mdtest -z 2 -b 3 -n&lt;br/&gt;
2000 -i 3 -u -w 256 -d&lt;br/&gt;
/scratch2/portfolios/BMC/nesccmgmt/Dennis.Nelson/mdtest&lt;br/&gt;
Path: /scratch2/portfolios/BMC/nesccmgmt/Dennis.Nelson&lt;br/&gt;
FS: 3130.0 TiB   Used FS: 39.7%   Inodes: 2250.0 Mi   Used Inodes: 14.2%&lt;/p&gt;

&lt;p&gt;8 tasks, 15912 files/directories&lt;/p&gt;

&lt;p&gt;SUMMARY: (of 3 iterations)&lt;br/&gt;
   Operation                  Max        Min       Mean    Std Dev&lt;br/&gt;
   ---------                  &amp;#8212;        &amp;#8212;       ----    -------&lt;br/&gt;
   Directory creation:   2327.187   1901.660   2094.272    176.043&lt;br/&gt;
   Directory stat    :  10265.979   8306.610   9315.476    800.973&lt;br/&gt;
   Directory removal :   1600.981   1301.208   1407.570    136.989&lt;br/&gt;
   File creation     :   1592.205   1426.700   1528.690     72.839&lt;br/&gt;
   File stat         :    913.205    581.097    740.446    135.914&lt;br/&gt;
   File removal      :   1733.900   1288.562   1492.555    183.717&lt;br/&gt;
   Tree creation     :    303.718    241.506    266.777     26.705&lt;br/&gt;
   Tree removal      :    159.793     66.400    122.692     40.470&lt;/p&gt;

&lt;p&gt;&amp;#8211; finished at 01/14/2013 15:28:37 &amp;#8211;&lt;/p&gt;

&lt;p&gt;We&apos;ve investigated a few different areas without much luck. First we looked at the storage to make sure it looked ok, and it does look remarkably clean. Drive and LUN latencies are all low. We also looked at iostat on the MDS to make sure that the average wait time was in line with what was seen on scratch1, and it also all looks normal.&lt;/p&gt;

&lt;p&gt;We took a look at the average request wait time on the client, and found that requests to the OSTs and MDTs are taking significantly longer to complete on scratch2 vs scratch1:&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;Kit.Westneat@bmem5 ~&amp;#93;&lt;/span&gt;$ for x in /proc/fs/lustre/osc/scratch2-OST*/stats; do cat $x | awk &apos;/req_wait/ &lt;/p&gt;
{print $7 / $2&quot; &apos;$(echo $x | grep -o OST....)&apos;&quot; }&apos;; done | awk &apos;{s+=$1/220} END {print s}&apos;&lt;br/&gt;
4326.63&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;Kit.Westneat@bmem5 ~&amp;#93;&lt;/span&gt;$ for x in /proc/fs/lustre/osc/scratch1-OST*/stats; do cat $x | awk &apos;/req_wait/ {print $7 / $2&quot; &apos;$(echo $x | grep -o OST....)&apos;&quot; }
&lt;p&gt;&apos;; done | awk &apos;&lt;/p&gt;
{s+=$1/220} END {print s}&apos;&lt;br/&gt;
883.17&lt;br/&gt;
&lt;br/&gt;
scratch1 mdc avg req_waittime  191.04&lt;br/&gt;
scratch2 mdc avg req_waittime  1220.99&lt;br/&gt;
&lt;br/&gt;
We investigated the IB network some, but couldn&apos;t find anything obviously wrong. One interesting thing to note is that some of the OSSes on scratch2 use asymmetric routes between themselves and the test client (bmem5). For example:&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;root@lfs-mds-2-1 ~&amp;#93;&lt;/span&gt;# ibtracert -C mlx4_1 -P2  62 31&lt;br/&gt;
From ca {0x0002c903000faa4a} portnum 1 lid 62-62 &quot;bmem5 mlx4_1&quot;&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;1&amp;#93;&lt;/span&gt; -&amp;gt; switch port {0x00066a00e3002e7c}&lt;span class=&quot;error&quot;&gt;&amp;#91;2&amp;#93;&lt;/span&gt; lid 4-4 &quot;QLogic 12300 GUID=0x00066a00e3002e7c&quot;&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;20&amp;#93;&lt;/span&gt; -&amp;gt; switch port {0x00066a00e30032de}&lt;span class=&quot;error&quot;&gt;&amp;#91;2&amp;#93;&lt;/span&gt; lid 2-2 &quot;QLogic 12300 GUID=0x00066a00e30032de&quot;&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;28&amp;#93;&lt;/span&gt; -&amp;gt; switch port {0x00066a00e3003338}&lt;span class=&quot;error&quot;&gt;&amp;#91;10&amp;#93;&lt;/span&gt; lid 6-6 &quot;QLogic 12300 GUID=0x00066a00e3003338&quot;&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;32&amp;#93;&lt;/span&gt; -&amp;gt; ca port {0x0002c903000f9194}&lt;span class=&quot;error&quot;&gt;&amp;#91;2&amp;#93;&lt;/span&gt; lid 31-31 &quot;lfs-oss-2-01 HCA-2&quot;&lt;br/&gt;
To ca {0x0002c903000f9192} portnum 2 lid 31-31 &quot;lfs-oss-2-01 HCA-2&quot;&lt;br/&gt;
&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;root@lfs-mds-2-1 ~&amp;#93;&lt;/span&gt;# ibtracert -C mlx4_1 -P2  31 62&lt;br/&gt;
From ca {0x0002c903000f9192} portnum 2 lid 31-31 &quot;lfs-oss-2-01 HCA-2&quot;&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;2&amp;#93;&lt;/span&gt; -&amp;gt; switch port {0x00066a00e3003338}&lt;span class=&quot;error&quot;&gt;&amp;#91;32&amp;#93;&lt;/span&gt; lid 6-6 &quot;QLogic 12300 GUID=0x00066a00e3003338&quot;&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;1&amp;#93;&lt;/span&gt; -&amp;gt; switch port {0x00066a00e3003339}&lt;span class=&quot;error&quot;&gt;&amp;#91;19&amp;#93;&lt;/span&gt; lid 3-3 &quot;QLogic 12300 GUID=0x00066a00e3003339&quot;&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;10&amp;#93;&lt;/span&gt; -&amp;gt; switch port {0x00066a00e3002e7c}&lt;span class=&quot;error&quot;&gt;&amp;#91;28&amp;#93;&lt;/span&gt; lid 4-4 &quot;QLogic 12300 GUID=0x00066a00e3002e7c&quot;&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;2&amp;#93;&lt;/span&gt; -&amp;gt; ca port {0x0002c903000faa4b}&lt;span class=&quot;error&quot;&gt;&amp;#91;1&amp;#93;&lt;/span&gt; lid 62-62 &quot;bmem5 mlx4_1&quot;&lt;br/&gt;
To ca {0x0002c903000faa4a} portnum 1 lid 62-62 &quot;bmem5 mlx4_1&quot;&lt;br/&gt;
&lt;br/&gt;
The path from the client to the oss uses switch 0x...32de, but the return path uses switch 0x...3339. I&apos;m not sure if this has any effect on latency, but it&apos;s interesting to note.&lt;br/&gt;
&lt;br/&gt;
The slow average request time also holds from the MDTs to the OST:&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;root@lfs-mds-1-1 ~&amp;#93;&lt;/span&gt;# for x in /proc/fs/lustre/osc/scratch1-OST*/stats; do cat $x | awk &apos;/req_wait/ {print $7 / $2&quot; &apos;$(echo $x | grep -o OST....)&apos;&quot; }&apos;; done | awk &apos;{s+=$1/220}
&lt;p&gt; END &lt;/p&gt;
{print s}&apos;&lt;br/&gt;
2445.21&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;root@lfs-mds-2-2 ~&amp;#93;&lt;/span&gt;# for x in /proc/fs/lustre/osc/scratch2-OST*/stats; do cat $x | awk &apos;/req_wait/ {print $7 / $2&quot; &apos;$(echo $x | grep -o OST....)&apos;&quot; }&apos;; done | awk &apos;{s+=$1/220} END {print s}
&lt;p&gt;&apos;&lt;br/&gt;
42442.6&lt;/p&gt;

&lt;p&gt;One individual OSS had an average response time of 61s:&lt;br/&gt;
61061.2 OST0053&lt;/p&gt;

&lt;p&gt;Checking the brw_stats for that OST, and found that the vast majority of IOs are completed in less than 1s. Less than 1% of all IO took 1s+, most IOs (51%) were done in &amp;lt;= 16ms. This leads me to believe that we can rule out the storage.&lt;/p&gt;

&lt;p&gt;We were looking at loading the OST metadata using debugfs and realized that scratch2 has its caches enabled on the OSTs, while scratch1 does not. We disabled the read and write caches, and performance increased, though still not to the levels of scratch1:&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;root@fe1 Dennis.Nelson&amp;#93;&lt;/span&gt;# mpirun -np 8 /home/Dennis.Nelson/mdtest-1.8.3/mdtest -z 2 -b 3 -n 2000 -i 3 -u -w 256 -d /scratch2/portfolios/BMC/nesccmgmt/Dennis.Nelson/mdtest&lt;br/&gt;
&amp;#8211; started at 01/18/2013 22:33:08 &amp;#8211;&lt;/p&gt;

&lt;p&gt;mdtest-1.8.3 was launched with 8 total task(s) on 1 nodes&lt;br/&gt;
Command line used: /home/Dennis.Nelson/mdtest-1.8.3/mdtest -z 2 -b 3 -n 2000 -i 3 -u -w 256 -d /scratch2/portfolios/BMC/nesccmgmt/Dennis.Nelson/mdtest&lt;br/&gt;
Path: /scratch2/portfolios/BMC/nesccmgmt/Dennis.Nelson&lt;br/&gt;
FS: 3130.0 TiB   Used FS: 37.8%   Inodes: 2250.0 Mi   Used Inodes: 15.2%&lt;/p&gt;

&lt;p&gt;8 tasks, 15912 files/directories&lt;/p&gt;

&lt;p&gt;SUMMARY: (of 3 iterations)&lt;br/&gt;
   Operation                  Max        Min       Mean    Std Dev&lt;br/&gt;
   ---------                  &amp;#8212;        &amp;#8212;       ----    -------&lt;br/&gt;
   Directory creation:   2551.993   2263.761   2403.099    117.864&lt;br/&gt;
   Directory stat    :  13057.811  10732.722  11987.916    958.214&lt;br/&gt;
   Directory removal :   1635.204   1550.824   1602.331     36.882&lt;br/&gt;
   File creation     :   2396.830   2291.217   2330.893     46.945&lt;br/&gt;
   File stat         :   9299.840   6609.414   7960.092   1098.389&lt;br/&gt;
   File removal      :   2082.180   1886.251   1966.489     83.824&lt;br/&gt;
   Tree creation     :    392.155    276.295    330.162     47.647&lt;br/&gt;
   Tree removal      :    223.157    185.368    203.008     15.529&lt;/p&gt;

&lt;p&gt;&amp;#8211; finished at 01/18/2013 22:34:53 &amp;#8211;&lt;/p&gt;

&lt;p&gt;The next thing we&apos;re going to try is doing small message LNET testing to verify that latency isn&apos;t worse on the scratch2 paths versus the scratch1 paths. The OSSes themselves are not reporting abnormal reqwait times. oss-2-08 reports an average wait time of 185.83ms, and oss-1-01 reports 265.72ms (checking using llstat ost).&lt;/p&gt;

&lt;p&gt;I was wondering if you all had any suggestions of what we could check next. Is there any information I can get you? Also I was wondering what exactly the req_waitimes measured. Am I reading them correctly?&lt;/p&gt;

&lt;p&gt;Thanks,&lt;br/&gt;
Kit &lt;/p&gt;
</description>
                <environment></environment>
        <key id="17232">LU-2649</key>
            <summary>slow IO at NOAA</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="4" iconUrl="https://jira.whamcloud.com/images/icons/priorities/minor.svg">Minor</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="6">Not a Bug</resolution>
                                        <assignee username="cliffw">Cliff White</assignee>
                                    <reporter username="kitwestneat">Kit Westneat</reporter>
                        <labels>
                            <label>ptr</label>
                    </labels>
                <created>Fri, 18 Jan 2013 18:05:18 +0000</created>
                <updated>Fri, 8 Feb 2013 13:55:56 +0000</updated>
                            <resolved>Fri, 8 Feb 2013 13:55:56 +0000</resolved>
                                    <version>Lustre 1.8.7</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>8</watches>
                                                                            <comments>
                            <comment id="50842" author="cliffw" created="Fri, 18 Jan 2013 18:20:57 +0000"  >&lt;p&gt;The mdtest results would point toward MDS/network. I believe req_waittime for llstat oat is the time for the IO in the disk queue. I will ask our networking people for further advice.&lt;/p&gt;</comment>
                            <comment id="50844" author="cliffw" created="Fri, 18 Jan 2013 18:46:48 +0000"  >&lt;p&gt;They have suggested using LNET self-test to determine if the routing differences are significant. A test such as:&lt;br/&gt;
 &quot;lst add_test --batch bulk_rw --concurrency 16 --from client --to server brw write size=1M&quot;&lt;br/&gt;
would be useful, where &apos;client&apos; and &apos;server&apos; are a client and the MDS node. &lt;br/&gt;
See the Lustre Manual for lnet self-test example scripts if you haven&apos;t used if before. &lt;/p&gt;</comment>
                            <comment id="50846" author="kitwestneat" created="Fri, 18 Jan 2013 19:08:43 +0000"  >&lt;p&gt;LNET selftest actually performed better on the scratch2 (slow) MDT than on the scratch1 MDT to the same client. scratch2 was 2.9-3GB/s and scratch1 was 2.8ish. I tried doing a single rpc in flight test too, in order to see if latency was an issue, but they both ran at the same speed, 1.5-1.6GB/s. I tried doing a 4k test and RPC rates were lower on scratch2 by about 10% with concurrency 8. When I lowered the concurrency to 1, I got 30k rpcs on scratch2 vs. 40k rpcs on scratch1. That is somewhat significant, but it still not a large enough difference to explain the giant dirstat differences.&lt;/p&gt;</comment>
                            <comment id="50871" author="kitwestneat" created="Sun, 20 Jan 2013 23:22:47 +0000"  >&lt;p&gt;Any more ideas for things to check? The metadata performance delta between the filesystems is still fairly significant and we still haven&apos;t narrowed it down to any component. &lt;/p&gt;</comment>
                            <comment id="50872" author="kitwestneat" created="Sun, 20 Jan 2013 23:35:28 +0000"  >&lt;p&gt;Just to throw another ? into the equation. I looked at import on scratch1 and scratch2 right after running mdtests on both, and they both say 31s as their service estimates.. Though scratch1 has a waittime of 182ms vs a waittime of 262ms for scratch2. &lt;/p&gt;</comment>
                            <comment id="50874" author="niu" created="Mon, 21 Jan 2013 02:04:04 +0000"  >&lt;p&gt;Looks like it&apos;s a network problem between MDT to OSTs (and looks the connection between client and OSTs aren&apos;t good as well), I think we&apos;d run LNET selftest between MDTs and OSTs but not the client to MDTs.&lt;/p&gt;</comment>
                            <comment id="50904" author="kitwestneat" created="Mon, 21 Jan 2013 12:06:37 +0000"  >&lt;p&gt;I ran MDS &amp;lt;-&amp;gt; OSS testing with lnet_selftest. There was a lot of variation in the numbers due to the active workloads on them (the system is in production), but I couldn&apos;t see a significant difference between the scratch1 filesystem and scratch2. I will try to get better data, but at the moment the LST data is inconclusive.&lt;/p&gt;

&lt;p&gt;I was looking at more routes and it looks like some of the scratch1 routes are asymmetric as well, so I don&apos;t think that can be pointed at as the root cause. I am still seeing a 2x avg req_waittime between the MDT and OSTs on scratch2 vs scratch1 though. I&apos;ll try to look at the individual OSTs to see if I can see a pattern.&lt;/p&gt;</comment>
                            <comment id="50919" author="kitwestneat" created="Mon, 21 Jan 2013 14:19:51 +0000"  >&lt;p&gt;Hi Niu,&lt;/p&gt;

&lt;p&gt;I was wondering if you could expand on what makes it seem like a networking issue as opposed to anything else. The customer is interested. Also, is there anything else we could check to definitively rule out any other issues?&lt;/p&gt;

&lt;p&gt;Thanks,&lt;br/&gt;
Kit&lt;/p&gt;</comment>
                            <comment id="50921" author="doug" created="Mon, 21 Jan 2013 14:26:09 +0000"  >&lt;p&gt;Question: have the tunable LND parameters (credits, peer_credits, etc) been modified on any of these systems?  To help see if credits are playing a role in throttling at the LNet layer, can you post the results of these two &quot;cat&quot; commands on the systems with potentially slow networking:&lt;/p&gt;

&lt;p&gt;cat /proc/sys/lnet/nis&lt;br/&gt;
cat /proc/sys/lnet/peers&lt;br/&gt;
cat /proc/sys/lnet/stats&lt;/p&gt;

&lt;p&gt;These three proc files will show information about credits as well as potential networking errors.&lt;/p&gt;

&lt;p&gt;Are there LNet routers involved? If so, can you post the results of this &quot;cat&quot; command on those routers:&lt;/p&gt;

&lt;p&gt;cat /proc/sys/lnet/buffers&lt;/p&gt;

&lt;p&gt;That one will show if we are running out of routing buffers thereby creating a bottleneck.&lt;/p&gt;</comment>
                            <comment id="50943" author="cliffw" created="Mon, 21 Jan 2013 18:28:33 +0000"  >&lt;p&gt;To some extent, we are going by you initial problem statement, you said you checked the storage, and it appeared to be performing properly. In the case of the mdtest results you reported, those depend quite a bit on network/MDS storage performance.  Anything further you can check related to storage performance would help eliminate that as an issue.&lt;/p&gt;</comment>
                            <comment id="50950" author="kitwestneat" created="Mon, 21 Jan 2013 21:24:02 +0000"  >&lt;p&gt;Ok, I&apos;ll attach that output from OSSes and MDSes on scratch1 and scratch2. &lt;/p&gt;

&lt;p&gt;In getting this info, I realized that I had been testing the wrong fabric before for the OSS &amp;lt;-&amp;gt; MDS tests. The LST tests again looked pretty decent except for when I did the MDS as the server, the OSS as the client and did a write-only test. Then it bounces between 1.2-2.2 GB/s (most of the time around 2GB/s though). If I did the MDS as the client, and/or did a read-only test, the results were around 3GB/s. This behavior is seen on both scratch1 and scratch2 filesystems. &lt;/p&gt;

&lt;p&gt;The IO to disk looks decent, but I was wondering if there was some way to rule out the layers above the disk. Is there a way to see how long IOs are waiting in Lustre&apos;s queues before going to the disk? Is it possible for ldiskfs to add significant time? Is there anything else we can check? &lt;/p&gt;</comment>
                            <comment id="50951" author="kitwestneat" created="Mon, 21 Jan 2013 21:24:26 +0000"  >&lt;p&gt;lnet data&lt;/p&gt;</comment>
                            <comment id="50963" author="doug" created="Tue, 22 Jan 2013 04:29:28 +0000"  >&lt;p&gt;In those collected LNet stats are a lot of negative &quot;min credits&quot;.  That represents the low water mark for credits for either a network interface or peer (depends on the proc file being viewed).  A negative number indicates that queuing has been going on and shows how deep the queue became.&lt;/p&gt;

&lt;p&gt;This indicates to me that some level of throttling is going on.  I cannot say for sure that this is enough to create bandwidth reductions you are seeing.  One way to find out is to bump up the credits and peer_credits on critical nodes like the MDS and OSS&apos;s.  See the Lustre manual on the module parameters used to modify the credits and peer_credits.&lt;/p&gt;</comment>
                            <comment id="50974" author="kitwestneat" created="Tue, 22 Jan 2013 10:25:17 +0000"  >&lt;p&gt;One thing I noticed is that the OSS has a smaller min for the MDS peer line for scratch2 (-14) than for the scratch1 counterparts (-421). What does this mean? Does this mean that there is &lt;b&gt;less&lt;/b&gt; queuing on the scratch2 network than scratch1?&lt;/p&gt;</comment>
                            <comment id="50977" author="doug" created="Tue, 22 Jan 2013 12:31:39 +0000"  >&lt;p&gt;That is correct.  There seems to be less queuing on the scratch2 network.&lt;/p&gt;</comment>
                            <comment id="51043" author="kitwestneat" created="Wed, 23 Jan 2013 13:40:15 +0000"  >&lt;p&gt;Hmm, would that imply that the network is most likely not a factor in scratch2 being slower? If there are not many requests waiting in queue, it seems like the problem is in processing the requests and getting them into the queue. &lt;/p&gt;

&lt;p&gt;We have made some progress.. Cpu speed regulating was enabled on the OSSes and MDSes, turning that off has improved performance across the board. There is still a large performance difference between scratch1 and scratch2, of the same magnitude as before. &lt;/p&gt;

&lt;p&gt;Are there any other stats we should check?&lt;/p&gt;</comment>
                            <comment id="51213" author="kitwestneat" created="Fri, 25 Jan 2013 11:31:23 +0000"  >&lt;p&gt;I did a trace and rpctrace debug dump. It looks like the time delta is almost entirely in the ldlm code. Do you have any thoughts as to what might cause that? &lt;/p&gt;

&lt;p&gt;I am going to try to enable ldlm debugging and see if I can get any more information. &lt;/p&gt;</comment>
                            <comment id="51214" author="cliffw" created="Fri, 25 Jan 2013 11:35:05 +0000"  >&lt;p&gt;Can you attach/upload the trace data?&lt;/p&gt;</comment>
                            <comment id="51215" author="cliffw" created="Fri, 25 Jan 2013 11:35:56 +0000"  >&lt;p&gt;Also, have you tried adjusting the peer credits, as Doug mentioned above?&lt;/p&gt;</comment>
                            <comment id="51216" author="kitwestneat" created="Fri, 25 Jan 2013 11:37:38 +0000"  >&lt;p&gt;rpctrace and trace of a single request with timestamp deltas added&lt;/p&gt;</comment>
                            <comment id="51217" author="kitwestneat" created="Fri, 25 Jan 2013 11:38:37 +0000"  >&lt;p&gt;The customer won&apos;t take a downtime to change the credits unless I can say that I think there is a high probability of solving the issue, and I don&apos;t have that level of confidence right now. &lt;/p&gt;</comment>
                            <comment id="51242" author="kitwestneat" created="Fri, 25 Jan 2013 14:25:19 +0000"  >&lt;p&gt;dlm traces, I&apos;m not sure if these actually add anything&lt;/p&gt;</comment>
                            <comment id="51341" author="kitwestneat" created="Mon, 28 Jan 2013 12:34:14 +0000"  >&lt;p&gt;Hello, any updates on this issue? I have been having regular status calls about it, so any ideas would be very helpful in keeping the customer content. &lt;/p&gt;</comment>
                            <comment id="51343" author="cliffw" created="Mon, 28 Jan 2013 12:42:09 +0000"  >&lt;p&gt;I am talking to our engineers, hoping to have some ideas for you soon. &lt;/p&gt;</comment>
                            <comment id="51346" author="green" created="Mon, 28 Jan 2013 12:57:35 +0000"  >&lt;p&gt;It seems pretty clear that entire MDS operations on one filesystem are slower than on hte other and it&apos;s not affected by OST performance (see how dir operations are also degraded and those don&apos;t really touch OSTs).&lt;br/&gt;
I wonder if the load on the more slow MDS is just different and as such leads to slower performance for other clients? (I assume there&apos;s a constant load in background in there?)&lt;br/&gt;
Another interestign thing would be to compare rpcstats and sdiostats from the two MDSes to see how those differ, I think&lt;/p&gt;</comment>
                            <comment id="51365" author="kitwestneat" created="Mon, 28 Jan 2013 21:12:22 +0000"  >&lt;p&gt;There is a variable background load, to be sure. scratch2 is often more loaded, but I&apos;ve tried to run the tests when there isn&apos;t very much activity on the MDSes (as seen by the export stats). I&apos;ll try to get a more precise measurement of relative activity via rpcstats. When you say sdiostats, which stats in particular are you thinking of? I&apos;ve used iostat to watch the disk activity and I don&apos;t see a lot of differences between the two.&lt;/p&gt;

&lt;p&gt;One thing that occurred to me is that scratch2 has 20 OSSes, while scratch1 only has 16 (and scratch2 has 25% more OSTs as well). Could the time spent in the LDLM code be proportional to the number of OSSes/OSTs? It would explain the 25% reduction in performance, to be sure.  &lt;/p&gt;</comment>
                            <comment id="51371" author="green" created="Mon, 28 Jan 2013 23:13:02 +0000"  >&lt;p&gt;LDLM on MDS does not deal with OSTs at all. The only place on MDT that does is in LOV (and OSCs) to do object pre-allocations (also setattrs and such).&lt;br/&gt;
Also when doing directory creates/removals, OSTs are not contacted at all, so the only impact they could make is if there&apos;s a lot of say creations going in background that wait for ost precreations and block other threads from executing, but it&apos;s unlikely to be happening all the time, other mds activities could have similar impact.&lt;br/&gt;
What&apos;s the typical mds wqueue waittimes before requests are serviced?&lt;/p&gt;

&lt;p&gt;I am interested in iostata data about typical io completion times and io sizes.&lt;/p&gt;</comment>
                            <comment id="51511" author="kitwestneat" created="Thu, 31 Jan 2013 02:37:27 +0000"  >&lt;p&gt;I looked at the /proc/fs/lustre/mds/*/stats and the /proc/fs/lustre/mdt/MDS/mds/stats files to see how loaded the system was. scratch2 does seem more loaded, but it might have just been when I was looking. I tried to find a moment when scratch2 was less loaded in order to run the benchmark. The mdtest showed some improvement but the dirstats were still significantly lower than on scratch1. Clearly the load on the system has some effect, and there is a downtime next week we&apos;ll use to test.&lt;/p&gt;

&lt;p&gt;What sorts of tests would be useful on a quiet system? I was going to enable various debug flags in order to be able to hopefully trace out what is going on. Is there anything else you would like to look at?&lt;/p&gt;

&lt;p&gt;What could cause dirstat performance to be so much worse than the other metadata operations?&lt;/p&gt;

&lt;p&gt;Here are some example req_* llstat mds output:&lt;br/&gt;
scratch1:&lt;br/&gt;
/proc/fs/lustre/mdt/MDS/mds/stats @ 1359617590.869072&lt;br/&gt;
Name                      Cur.Count  Cur.Rate   #Events   Unit           last        min          avg        max    stddev&lt;br/&gt;
req_waittime              14886      7443       4161976   &lt;span class=&quot;error&quot;&gt;&amp;#91;usec&amp;#93;&lt;/span&gt;       300812          3        18.38       3017     14.82&lt;br/&gt;
req_qdepth                14886      7443       4161976   &lt;span class=&quot;error&quot;&gt;&amp;#91;reqs&amp;#93;&lt;/span&gt;         1870          0         0.18         14      0.44&lt;br/&gt;
req_active                14886      7443       4161976   &lt;span class=&quot;error&quot;&gt;&amp;#91;reqs&amp;#93;&lt;/span&gt;        22011          1         1.89         55      1.59&lt;br/&gt;
req_timeout               14886      7443       4161976   &lt;span class=&quot;error&quot;&gt;&amp;#91;sec&amp;#93;&lt;/span&gt;         44658          3        26.59         31     10.20&lt;br/&gt;
reqbuf_avail              31265      15632      9080719   &lt;span class=&quot;error&quot;&gt;&amp;#91;bufs&amp;#93;&lt;/span&gt;     96035468       3044      3071.47       3072      1.07&lt;br/&gt;
ldlm_flock_enqueue        5          2          3190      &lt;span class=&quot;error&quot;&gt;&amp;#91;reqs&amp;#93;&lt;/span&gt;            5          1         1.00          1      0.00&lt;br/&gt;
ldlm_ibits_enqueue        14199      7099       3808598   &lt;span class=&quot;error&quot;&gt;&amp;#91;reqs&amp;#93;&lt;/span&gt;        14199          1         1.00          1      0.00&lt;/p&gt;

&lt;p&gt;scratch2:&lt;br/&gt;
/proc/fs/lustre/mdt/MDS/mds/stats @ 1359617591.258278&lt;br/&gt;
Name                      Cur.Count  Cur.Rate   #Events   Unit           last        min          avg        max    stddev&lt;br/&gt;
req_waittime              18790      9395       22080943  &lt;span class=&quot;error&quot;&gt;&amp;#91;usec&amp;#93;&lt;/span&gt;       389637          3        21.72     103629    171.90&lt;br/&gt;
req_qdepth                18790      9395       22080943  &lt;span class=&quot;error&quot;&gt;&amp;#91;reqs&amp;#93;&lt;/span&gt;         5206          0         0.33        147      0.92&lt;br/&gt;
req_active                18790      9395       22080943  &lt;span class=&quot;error&quot;&gt;&amp;#91;reqs&amp;#93;&lt;/span&gt;        64994          1         5.17        127     12.89&lt;br/&gt;
req_timeout               18790      9395       22080943  &lt;span class=&quot;error&quot;&gt;&amp;#91;sec&amp;#93;&lt;/span&gt;         93950          2        15.56         33     14.32&lt;br/&gt;
reqbuf_avail              44790      22395      53001699  &lt;span class=&quot;error&quot;&gt;&amp;#91;bufs&amp;#93;&lt;/span&gt;    894286323      19850     19965.41      19968      8.94&lt;br/&gt;
ldlm_flock_enqueue        38         19         19939     &lt;span class=&quot;error&quot;&gt;&amp;#91;reqs&amp;#93;&lt;/span&gt;           38          1         1.00          1      0.00&lt;br/&gt;
ldlm_ibits_enqueue        13264      6632       15674307  &lt;span class=&quot;error&quot;&gt;&amp;#91;reqs&amp;#93;&lt;/span&gt;        13264          1         1.00          1      0.00&lt;/p&gt;

&lt;p&gt;I was wondering why the reqbuf_avail on scratch2 was so much higher?&lt;/p&gt;</comment>
                            <comment id="51512" author="kitwestneat" created="Thu, 31 Jan 2013 02:38:02 +0000"  >&lt;p&gt;IO statistics for the sd devices, they look very similar between the FSes&lt;/p&gt;</comment>
                            <comment id="51513" author="kitwestneat" created="Thu, 31 Jan 2013 02:40:07 +0000"  >&lt;p&gt;differences in /proc/fs/lustre/mds/scratch2-MDT0000/stats over the run of an mdtest. You can see scratch2 was much more loaded during this time. The opens, closes, and unlinks stick out as major differences too. &lt;/p&gt;</comment>
                            <comment id="51573" author="kitwestneat" created="Thu, 31 Jan 2013 16:54:09 +0000"  >&lt;p&gt;I&apos;ve attached the test plan for the downtime next week, can you review it and see if there is anything else we should be looking at or trying?&lt;/p&gt;</comment>
                            <comment id="51575" author="cliffw" created="Thu, 31 Jan 2013 17:48:11 +0000"  >&lt;p&gt;What about the peer credits tuning mentioned earlier? &lt;/p&gt;</comment>
                            <comment id="51577" author="kitwestneat" created="Thu, 31 Jan 2013 17:50:50 +0000"  >&lt;p&gt;What are the risks in making that change and how likely do you think it is that those changes will affect performance?&lt;/p&gt;</comment>
                            <comment id="51578" author="cliffw" created="Thu, 31 Jan 2013 18:03:59 +0000"  >&lt;p&gt;I think the risks are minimal, shouldn&apos;t hurt anything. I am not certain how likely it is that it will help this issue.&lt;/p&gt;</comment>
                            <comment id="51579" author="kitwestneat" created="Thu, 31 Jan 2013 18:07:53 +0000"  >&lt;p&gt;Ok, the reason I ask is that any changes have to be approved through a change review meeting, and I&apos;m sure that they will ask me those questions.&lt;/p&gt;

&lt;p&gt;Actually another question I have is: can those be set only on the servers, or do they need to be set on the clients too? I remember the last time I played around with ko2iblnd settings, there were some that had to be set on both the client and the servers, or they would stop communicating.   &lt;/p&gt;</comment>
                            <comment id="51656" author="doug" created="Fri, 1 Feb 2013 18:08:36 +0000"  >&lt;p&gt;Credit changes affect the &quot;throttling&quot; on the system they are set on.  I suspect the important systems to change here are just the servers.  The clients (assuming there are many of them) should be fine as is.  The changes don&apos;t need to be the same on both sides of a connection.&lt;/p&gt;</comment>
                            <comment id="51668" author="green" created="Sat, 2 Feb 2013 00:09:14 +0000"  >&lt;p&gt;Note if you enable any extra debug flags, you&apos;ll likely have (significant, depending on flags) slower metadata performance for certain debug flags like +inode or +dlmtrace.&lt;/p&gt;

&lt;p&gt;It looks like scratch1 is almost 2x less loaded than scratch2 at times, and I suspect that on the fully quiet system you&apos;ll have same level of performance.&lt;/p&gt;

&lt;p&gt;dirstat is only affected by network latency and how busy MDS/disk is (even then I suspect for mdstat run created directories are fully in cache and disk does not play any role other than when it slows down other threads).&lt;br/&gt;
I suspect that if you forcefully increase number of mdt threads (module parameter for mds module) then dirstat should go up if you are running with a small number by default.&lt;/p&gt;</comment>
                            <comment id="51763" author="kitwestneat" created="Mon, 4 Feb 2013 21:20:50 +0000"  >&lt;p&gt;It seems like credits changes need to be made on all systems (clients and servers).&lt;/p&gt;

&lt;p&gt;I got these errors trying to ping from a node with credit changes to one without:&lt;br/&gt;
LustreError: 9688:0:(o2iblnd.c:806:kiblnd_create_conn()) Can&apos;t create QP: -22, send_wr: 16448, recv_wr: 256&lt;br/&gt;
LustreError: 7641:0:(o2iblnd_cb.c:2190:kiblnd_passive_connect()) Can&apos;t accept 10.175.31.242@o2ib: incompatible queue depth 8 (128 wanted)&lt;/p&gt;</comment>
                            <comment id="51805" author="doug" created="Tue, 5 Feb 2013 13:24:34 +0000"  >&lt;p&gt;Looks like I was wrong about the &quot;credits&quot; tuneable.  For IB, it controls the QP depth which should be symmetric (and we ensure it is symmetric as seen in the error messages).  All the systems on the same IB network (could be logical networks) need to have the same credits setting.&lt;/p&gt;</comment>
                            <comment id="52051" author="kitwestneat" created="Fri, 8 Feb 2013 13:47:48 +0000"  >&lt;p&gt;we can close this. We found a potential microcode issue with the Intel chipset and upgraded that. We also split the two filesystems over different fabrics. Metadata improved on both filesystems.&lt;/p&gt;</comment>
                            <comment id="52053" author="pjones" created="Fri, 8 Feb 2013 13:55:56 +0000"  >&lt;p&gt;ok thanks Kit!&lt;/p&gt;</comment>
                    </comments>
                    <attachments>
                            <attachment id="12204" name="dlm_trace.scratch1" size="37256" author="kitwestneat" created="Fri, 25 Jan 2013 14:25:19 +0000"/>
                            <attachment id="12203" name="dlm_trace.scratch2" size="37206" author="kitwestneat" created="Fri, 25 Jan 2013 14:25:19 +0000"/>
                            <attachment id="12180" name="lnet_data.tgz" size="53820" author="kitwestneat" created="Mon, 21 Jan 2013 21:24:26 +0000"/>
                            <attachment id="12200" name="processed_trace.scratch1" size="31317" author="kitwestneat" created="Fri, 25 Jan 2013 11:37:38 +0000"/>
                            <attachment id="12199" name="processed_trace.scratch2" size="31065" author="kitwestneat" created="Fri, 25 Jan 2013 11:37:38 +0000"/>
                            <attachment id="12230" name="scratch_1_2.deltas" size="957" author="kitwestneat" created="Thu, 31 Jan 2013 02:40:07 +0000"/>
                            <attachment id="12229" name="scratch_1_2.sd_iostats" size="2561" author="kitwestneat" created="Thu, 31 Jan 2013 02:38:02 +0000"/>
                            <attachment id="12233" name="test_plan.txt" size="1657" author="kitwestneat" created="Thu, 31 Jan 2013 16:54:09 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzvfp3:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>6189</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>