<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:03:18 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-58] poor LNet performance over QLogic HCAs</title>
                <link>https://jira.whamcloud.com/browse/LU-58</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;We have been testing QLogic HCAs for several customers and have run into an issue at our lab where rdma_bw is able to get 2.5GB/s or so, but lnet_selftest only gets 1GB/s. Actually I have gotten as much as ~1200MB/s, which leads me to believe it&apos;s capping out at 10Gb/s. &lt;/p&gt;

&lt;p&gt;Have you ever seen this? Is there anything we can do to debug this from a ko2iblnd point of view? We have already engaged QLogic and they can&apos;t find anything wrong.&lt;/p&gt;</description>
                <environment></environment>
        <key id="10323">LU-58</key>
            <summary>poor LNet performance over QLogic HCAs</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="4" iconUrl="https://jira.whamcloud.com/images/icons/priorities/minor.svg">Minor</priority>
                        <status id="6" iconUrl="https://jira.whamcloud.com/images/icons/statuses/closed.png" description="The issue is considered finished, the resolution is correct. Issues which are closed can be reopened.">Closed</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="2">Won&apos;t Fix</resolution>
                                        <assignee username="laisiyao">Lai Siyao</assignee>
                                    <reporter username="kitwestneat">Kit Westneat</reporter>
                        <labels>
                    </labels>
                <created>Wed, 2 Feb 2011 15:15:21 +0000</created>
                <updated>Wed, 21 Sep 2011 13:04:52 +0000</updated>
                            <resolved>Mon, 13 Jun 2011 19:32:40 +0000</resolved>
                                    <version>Lustre 1.8.6</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>5</watches>
                                                                            <comments>
                            <comment id="10514" author="cliffw" created="Wed, 2 Feb 2011 16:28:56 +0000"  >&lt;p&gt;How many CPU&apos;s do the system have?&lt;/p&gt;</comment>
                            <comment id="10515" author="kitwestneat" created="Wed, 2 Feb 2011 19:32:32 +0000"  >&lt;p&gt;2 socket, 8 cores total&lt;/p&gt;

&lt;p&gt;model name	: Intel(R) Xeon(R) CPU           E5520  @ 2.27GHz&lt;/p&gt;</comment>
                            <comment id="10516" author="liang" created="Wed, 2 Feb 2011 21:47:51 +0000"  >&lt;p&gt;could you post your test script here? I would like to see detail of the test.&lt;/p&gt;</comment>
                            <comment id="10517" author="kitwestneat" created="Wed, 2 Feb 2011 22:09:26 +0000"  >&lt;p&gt;Here is the rdma_bw test I ran:&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;root@oss0 ~&amp;#93;&lt;/span&gt;# rdma_bw oss1-ib0&lt;br/&gt;
22891: | port=18515 | ib_port=1 | size=65536 | tx_depth=100 | sl=0 | iters=1000 | duplex=0 | cma=0 |&lt;br/&gt;
22891: Local address:  LID 0x06, QPN 0x0035, PSN 0x31620c RKey 0x7fdfe00 VAddr 0x002acb8070e000&lt;br/&gt;
22891: Remote address: LID 0x02, QPN 0x005d, PSN 0xc4e9d0, RKey 0x4191a00 VAddr 0x002b70d50bf000&lt;/p&gt;


&lt;p&gt;22891: Bandwidth peak (#19 to #999): 2633.54 MB/sec&lt;br/&gt;
22891: Bandwidth average: 2606.78 MB/sec&lt;br/&gt;
22891: Service Demand peak (#19 to #999): 838 cycles/KB&lt;br/&gt;
22891: Service Demand Avg  : 847 cycles/KB&lt;/p&gt;


&lt;p&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;root@oss0 ~&amp;#93;&lt;/span&gt;# rdma_bw oss1-ib0&lt;br/&gt;
22891: | port=18515 | ib_port=1 | size=65536 | tx_depth=100 | sl=0 | iters=1000 | duplex=0 | cma=0 |&lt;br/&gt;
22891: Local address:  LID 0x06, QPN 0x0035, PSN 0x31620c RKey 0x7fdfe00 VAddr 0x002acb8070e000&lt;br/&gt;
22891: Remote address: LID 0x02, QPN 0x005d, PSN 0xc4e9d0, RKey 0x4191a00 VAddr 0x002b70d50bf000&lt;/p&gt;


&lt;p&gt;22891: Bandwidth peak (#19 to #999): 2633.54 MB/sec&lt;br/&gt;
22891: Bandwidth average: 2606.78 MB/sec&lt;br/&gt;
22891: Service Demand peak (#19 to #999): 838 cycles/KB&lt;br/&gt;
22891: Service Demand Avg  : 847 cycles/KB&lt;/p&gt;

&lt;p&gt;For the LNET test, I&apos;m using the a wrapper script to call lst, I&apos;ll attach it:&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;root@oss1 ~&amp;#93;&lt;/span&gt;# lnet_selftest.sh -c &quot;192.168.99.10&lt;span class=&quot;error&quot;&gt;&amp;#91;1,2&amp;#93;&lt;/span&gt;@o2ib&quot; -s 192.168.99.103@o2ib -w&lt;br/&gt;
You need to manually load lnet_selftest on all nodes&lt;br/&gt;
modprobe lnet_selftest&lt;br/&gt;
LST_SESSION=8760&lt;br/&gt;
SESSION: read/write TIMEOUT: 300 FORCE: No&lt;br/&gt;
192.168.99.103@o2ib are added to session&lt;br/&gt;
192.168.99.10&lt;span class=&quot;error&quot;&gt;&amp;#91;1,2&amp;#93;&lt;/span&gt;@o2ib are added to session&lt;br/&gt;
Test was added successfully&lt;br/&gt;
batch is running now&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;LNet Rates of servers&amp;#93;&lt;/span&gt;&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;R&amp;#93;&lt;/span&gt; Avg: 1145     RPC/s Min: 1145     RPC/s Max: 1145     RPC/s&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;W&amp;#93;&lt;/span&gt; Avg: 2294     RPC/s Min: 2294     RPC/s Max: 2294     RPC/s&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;LNet Bandwidth of servers&amp;#93;&lt;/span&gt;&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;R&amp;#93;&lt;/span&gt; Avg: 0.17     MB/s  Min: 0.17     MB/s  Max: 0.17     MB/s&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;W&amp;#93;&lt;/span&gt; Avg: 1147.23  MB/s  Min: 1147.23  MB/s  Max: 1147.23  MB/s&lt;br/&gt;
session is ended&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;root@oss1 ~&amp;#93;&lt;/span&gt;# lnet_selftest.sh -c &quot;192.168.99.101@o2ib&quot; -s 192.168.99.103@o2ib -w&lt;br/&gt;
You need to manually load lnet_selftest on all nodes&lt;br/&gt;
modprobe lnet_selftest&lt;br/&gt;
LST_SESSION=8815&lt;br/&gt;
SESSION: read/write TIMEOUT: 300 FORCE: No&lt;br/&gt;
192.168.99.103@o2ib are added to session&lt;br/&gt;
192.168.99.101@o2ib are added to session&lt;br/&gt;
Test was added successfully&lt;br/&gt;
batch is running now&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;LNet Rates of servers&amp;#93;&lt;/span&gt;&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;R&amp;#93;&lt;/span&gt; Avg: 1242     RPC/s Min: 1242     RPC/s Max: 1242     RPC/s&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;W&amp;#93;&lt;/span&gt; Avg: 2484     RPC/s Min: 2484     RPC/s Max: 2484     RPC/s&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;LNet Bandwidth of servers&amp;#93;&lt;/span&gt;&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;R&amp;#93;&lt;/span&gt; Avg: 0.19     MB/s  Min: 0.19     MB/s  Max: 0.19     MB/s&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;W&amp;#93;&lt;/span&gt; Avg: 1241.94  MB/s  Min: 1241.94  MB/s  Max: 1241.94  MB/s&lt;br/&gt;
session is ended&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;root@oss1 ~&amp;#93;&lt;/span&gt;# lnet_selftest.sh -c &quot;192.168.99.102@o2ib&quot; -s 192.168.99.103@o2ib -w&lt;br/&gt;
You need to manually load lnet_selftest on all nodes&lt;br/&gt;
modprobe lnet_selftest&lt;br/&gt;
LST_SESSION=8837&lt;br/&gt;
SESSION: read/write TIMEOUT: 300 FORCE: No&lt;br/&gt;
192.168.99.103@o2ib are added to session&lt;br/&gt;
192.168.99.102@o2ib are added to session&lt;br/&gt;
Test was added successfully&lt;br/&gt;
batch is running now&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;LNet Rates of servers&amp;#93;&lt;/span&gt;&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;R&amp;#93;&lt;/span&gt; Avg: 1087     RPC/s Min: 1087     RPC/s Max: 1087     RPC/s&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;W&amp;#93;&lt;/span&gt; Avg: 2175     RPC/s Min: 2175     RPC/s Max: 2175     RPC/s&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;LNet Bandwidth of servers&amp;#93;&lt;/span&gt;&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;R&amp;#93;&lt;/span&gt; Avg: 0.17     MB/s  Min: 0.17     MB/s  Max: 0.17     MB/s&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;W&amp;#93;&lt;/span&gt; Avg: 1087.11  MB/s  Min: 1087.11  MB/s  Max: 1087.11  MB/s&lt;/p&gt;</comment>
                            <comment id="10518" author="kitwestneat" created="Wed, 2 Feb 2011 22:10:10 +0000"  >&lt;p&gt;driver script for lst&lt;/p&gt;</comment>
                            <comment id="10521" author="liang" created="Thu, 3 Feb 2011 15:08:06 +0000"  >&lt;p&gt;Kit,&lt;/p&gt;

&lt;p&gt;we do have SMP performance issue with lnet_selftest(we will have a patch for this in weeks), but I&apos;m not sure whether 2 * 4 cores could hit this.&lt;br/&gt;
If it&apos;s possible, could you please try these to help us survey:&lt;/p&gt;
&lt;ul class=&quot;alternate&quot; type=&quot;square&quot;&gt;
	&lt;li&gt;disable one socket to see whether it can help on lnet_selftest performance&lt;/li&gt;
	&lt;li&gt;disable two cores on each socket, and measure performance with selftest&lt;/li&gt;
	&lt;li&gt;run it with 2 clients and 1 server, and lst stat server to see performance&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;Thanks&lt;br/&gt;
Liang&lt;/p&gt;</comment>
                            <comment id="10562" author="laisiyao" created="Tue, 8 Feb 2011 17:49:29 +0000"  >&lt;p&gt;Peter, I will talk with Liang and work on this.&lt;/p&gt;</comment>
                            <comment id="10569" author="liang" created="Wed, 9 Feb 2011 04:15:40 +0000"  >&lt;p&gt;Kit, another question here is about NUMA, is NUMA enabled on your system (2 nodes or 1 node)?&lt;/p&gt;

&lt;p&gt;Thanks&lt;br/&gt;
Liang&lt;/p&gt;</comment>
                            <comment id="10584" author="kitwestneat" created="Wed, 9 Feb 2011 07:56:22 +0000"  >&lt;p&gt;Hi, sorry I haven&apos;t had a lot of time to do testing recently. It looks like numa is enabled (I don&apos;t know very much about numa yet):&lt;/p&gt;

&lt;p&gt;available: 2 nodes (0-1)&lt;br/&gt;
node 0 size: 12120 MB&lt;br/&gt;
node 0 free: 11446 MB&lt;br/&gt;
node 1 size: 12090 MB&lt;br/&gt;
node 1 free: 11668 MB&lt;br/&gt;
node distances:&lt;br/&gt;
node   0   1 &lt;br/&gt;
  0:  10  20 &lt;br/&gt;
  1:  20  10 &lt;/p&gt;

&lt;p&gt;Thanks,&lt;br/&gt;
Kit&lt;/p&gt;</comment>
                            <comment id="10624" author="morrone" created="Fri, 11 Feb 2011 13:41:36 +0000"  >&lt;p&gt;FYI, LLNL also had trouble getting good performance out of our QLogic cards with LNet.  The main trouble we found is that while they implement the verbs interface for RDMA calls, the operations are not actually RDMA.  &lt;/p&gt;

&lt;p&gt;In other words, &quot;RDMA&quot; operations with the qlogic cards are not zero-copy.  &lt;br/&gt;
The qlogic cards can&apos;t write directly into the destination buffer on the node; they need to do a memory copy.&lt;/p&gt;

&lt;p&gt;There were other tweaks we made that got performance a little higher, but ultimately the lack of true RDMA support on the card was the limiting factor.&lt;/p&gt;

&lt;p&gt;Our IB guy is out today, and I don&apos;t remember the details of what he did to tweak the qlogic performance.  I think that the in-kernel verbs interface only has a single qlogic ring buffer by default, and I believe that he increased that to 4 and we saw some benefit.&lt;/p&gt;</comment>
                            <comment id="10625" author="weiny2" created="Fri, 11 Feb 2011 14:26:34 +0000"  >&lt;p&gt;Disclaimer: We are running the 7340 card so if you have another card I don&apos;t know if this will apply or not.&lt;/p&gt;


&lt;p&gt;I have forgotten some of the details but check your driver for the following options.  Here are the settings we are using.&lt;/p&gt;

&lt;p&gt;options ib_qib krcvqs=4&lt;br/&gt;
options ib_qib rcvhdrcnt=32768&lt;/p&gt;


&lt;p&gt;The krcvqs option increases the number of receive queues used by the driver.  We have 12 cores/node and the card has 18 contexts.  1 of those is used for something I don&apos;t remember.  The rest QLogic recommends allocating 1/core so that left us with 5 (you will have more).  We played around and 4 seemed like the best performance.  However, this required a patch to the module to make it actually use all 4 contexts.  QLogic has the final patch and should be able to provide it.&lt;/p&gt;

&lt;p&gt;The rcvhdrcnt increases a header descriptor count (again I would have to dig up the details about this).  Regardless, this option was another patch to the driver and is now in the upstream kernel.  We came across the need for this option when we got hangs from the card.  QLogic fixed the hang with another patch so you might need to make sure that is available as well.  Anyway during all that testing we found performance was a bit better with rcvhdrcnt set higher and so we left it.&lt;/p&gt;

&lt;p&gt;I will check with QLogic and make sure but I don&apos;t see why you could not pull our version of the driver.  I have a git tree which will build stand alone against a current RHEL5 kernel.  (It may need other modifications for other kernels).  Let me know if you would like that.&lt;/p&gt;

&lt;p&gt;Hope this helps,&lt;br/&gt;
Ira&lt;/p&gt;</comment>
                            <comment id="10626" author="liang" created="Fri, 11 Feb 2011 22:40:42 +0000"  >&lt;p&gt;Chris, Ira,&lt;/p&gt;

&lt;p&gt;Thanks for your information.&lt;br/&gt;
We do have performance issue on NUMA system (selftest, or obdfilter-survey, and the whole lustre stack), and as you said, QLogic doesn&apos;t have true RDMA support on the card, so memory copy might make it worse especially Lustre/LNet has many threads context switch over the stack...&lt;/p&gt;

&lt;p&gt;I&apos;ve got a branch to support NUMA better, but patch for lnet_selftest is still in progress(actually I have a old patch for lnet_selftest but it can&apos;t apply to any branch now), although most works in other modules have been done, I will post it here when I finish patch of lnet_selftest.&lt;/p&gt;

&lt;p&gt;Kit,&lt;br/&gt;
If you have any chance to run those tests, please also collect output of numastat before &amp;amp; after each test (I don&apos;t know whether there is anyway to reset counters...)&lt;/p&gt;

&lt;p&gt;Thanks again&lt;br/&gt;
Liang&lt;/p&gt;</comment>
                            <comment id="10642" author="kitwestneat" created="Mon, 14 Feb 2011 11:32:47 +0000"  >&lt;p&gt;Unfortunately I don&apos;t currently have access to the systems, so I&apos;m unable to do more testing. Here is the modprobe.conf line I was using:&lt;br/&gt;
options ib_qib singleport=1 krcvqs=8 rcvhdrcnt=4096&lt;/p&gt;

&lt;p&gt;We&apos;re using the latest engineering build of the qib driver. I had the QLogic folks on the system looking at it, but they couldn&apos;t see anything particularly wrong. The severe performance difference between rdma_bw and lst is what makes me think it&apos;s an issue at the lnet level, I&apos;ve never seen such a large difference.&lt;/p&gt;

&lt;p&gt;I let you know how the NUMA testing goes when I&apos;m able to get back on the system.  &lt;/p&gt;</comment>
                            <comment id="10655" author="kitwestneat" created="Tue, 15 Feb 2011 12:49:49 +0000"  >&lt;p&gt;Ira, Chris,&lt;/p&gt;

&lt;p&gt;In your tests, did lnet_selftest performance match rdma_bw performance or did you see a mismatch? Did you see Lustre performance more on the level of lnet_selftest? I just want to see if my experience matches yours.&lt;/p&gt;

&lt;p&gt;Thanks,&lt;br/&gt;
Kit&lt;/p&gt;</comment>
                            <comment id="10694" author="ihara" created="Sun, 20 Feb 2011 01:54:59 +0000"  >&lt;p&gt;This is the various benchmarks on Infiniband to compare Mellanox and Qlogic.&lt;br/&gt;
We know the Lustre performance with Mellanox QDR is well and it is close to wire-speed. However, with Qlogic QDR, we can only see 2.2-2.5GB/sec with RDMA benchmark, and 1.4GB/sec on OSS and 700MB/sec on client with LNET.&lt;br/&gt;
MPI on Qlogic QDR performs well, but others are really not good compared with Mellanox. &lt;/p&gt;</comment>
                            <comment id="10750" author="ihara" created="Fri, 25 Feb 2011 03:42:30 +0000"  >&lt;p&gt;Ira, &lt;/p&gt;

&lt;p&gt;Have you tried rdma_bw or LNET performance before?&lt;br/&gt;
We also working with Qlogic and tried to run LNET selftest with Qlogic HCA, but only getting around 2GB/sec per server. Please see below lnet_selftest number on the a server and 4 clients.&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt; 
[LNet Bandwidth of s1]
[R] Avg: 1891.21  MB/s  Min: 1891.21  MB/s  Max: 1891.21  MB/s
[W] Avg: 0.29     MB/s  Min: 0.29     MB/s  Max: 0.29     MB/s
[LNet Bandwidth of c1]
[R] Avg: 0.07     MB/s  Min: 0.07     MB/s  Max: 0.07     MB/s
[W] Avg: 460.37   MB/s  Min: 460.37   MB/s  Max: 460.37   MB/s
[LNet Bandwidth of c2]
[R] Avg: 0.06     MB/s  Min: 0.06     MB/s  Max: 0.06     MB/s
[W] Avg: 387.04   MB/s  Min: 387.04   MB/s  Max: 387.04   MB/s
[LNet Bandwidth of c3]
[R] Avg: 0.08     MB/s  Min: 0.08     MB/s  Max: 0.08     MB/s
[W] Avg: 533.66   MB/s  Min: 533.66   MB/s  Max: 533.66   MB/s
[LNet Bandwidth of c4]
[R] Avg: 0.07     MB/s  Min: 0.07     MB/s  Max: 0.07     MB/s
[W] Avg: 466.42   MB/s  Min: 466.42   MB/s  Max: 466.42   MB/s
[LNet Bandwidth of s1]
[R] Avg: 2201.17  MB/s  Min: 2201.17  MB/s  Max: 2201.17  MB/s
[W] Avg: 0.34     MB/s  Min: 0.34     MB/s  Max: 0.34     MB/s
[LNet Bandwidth of c1]
[R] Avg: 0.07     MB/s  Min: 0.07     MB/s  Max: 0.07     MB/s
[W] Avg: 485.16   MB/s  Min: 485.16   MB/s  Max: 485.16   MB/s
[LNet Bandwidth of c2]
[R] Avg: 0.07     MB/s  Min: 0.07     MB/s  Max: 0.07     MB/s
[W] Avg: 471.09   MB/s  Min: 471.09   MB/s  Max: 471.09   MB/s
[LNet Bandwidth of c3]
[R] Avg: 0.10     MB/s  Min: 0.10     MB/s  Max: 0.10     MB/s
[W] Avg: 668.38   MB/s  Min: 668.38   MB/s  Max: 668.38   MB/s
[LNet Bandwidth of c4]
[R] Avg: 0.10     MB/s  Min: 0.10     MB/s  Max: 0.10     MB/s
[W] Avg: 627.77   MB/s  Min: 627.77   MB/s  Max: 627.77   MB/s
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt; 

&lt;p&gt;Regarding RDMA benchmark, we did some Qlogic tuning and could get 3GB/sec as a peak, but still low when the messages size is big (e.g. 512K, 1M, 2M...) compared with Mellanox HCA. So, I wonder how much RDMA and LNET performance are you getting.&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt; 
 #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]
 2          5000           0.68               0.67   
 4          5000           1.39               1.36   
 8          5000           2.77               2.73   
 16         5000           5.61               5.44   
 32         5000           11.19              10.88  
 64         5000           22.40              21.88  
 128        5000           45.89              43.70  
 256        5000           92.18              89.20  
 512        5000           194.19             187.29 
 1024       5000           397.34             370.98 
 2048       5000           798.60             776.63 
 4096       5000           1343.20            1281.17
 8192       5000           1920.83            1865.76
 16384      5000           2588.50            2537.42
 32768      5000           3159.68            3153.84
 65536      5000           3162.81            3162.80
 131072     5000           3075.97            3056.87
 262144     5000           3011.65            2432.95
 524288     5000           2948.59            2757.53
 1048576    5000           2910.32            2754.89
 2097152    5000           2884.98            2761.72
 4194304    5000           2860.59            2769.83
 8388608    5000           2764.13            2667.08
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt; </comment>
                            <comment id="10765" author="ihara" created="Sun, 27 Feb 2011 22:46:31 +0000"  >&lt;p&gt;Hello Liang,&lt;/p&gt;

&lt;p&gt;After some Qlogic tuning, we got 9.6GB/sec write from 4 OSSs. (2.4-.2.5GB/sec per OSS) This is not perfect, but according Qlogic RDMA benchmark (~2.7GB/sec), the number is reasonable.&lt;/p&gt;

&lt;p&gt;But, the read is still slow on the Lustre. The problem seems kiblnd_sd_XX threads are spending many CPU resources. Please see below &quot;top&quot; outputs during the benchmark. I&apos;m also getting oprofile data and will post it. Could you have a look at them, please? &lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;top - 22:38:03 up 2 days, 35 min,  2 users,  load average: 48.77, 42.98, 31.63
Tasks: 512 total,  51 running, 461 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.8%us, 84.7%sy,  0.0%ni,  0.0%id,  0.0%wa, 13.8%hi,  0.7%si,  0.0%st
Mem:  24545172k total, 24328468k used,   216704k free,     3188k buffers
Swap:  2096376k total,      612k used,  2095764k free, 23775256k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                         
19702 root      25   0     0    0    0 R 80.0  0.0 129:29.76 kiblnd_sd_06                                                     
19698 root      25   0     0    0    0 R 79.4  0.0 129:37.71 kiblnd_sd_02                                                     
19701 root      25   0     0    0    0 R 78.4  0.0 129:43.95 kiblnd_sd_05                                                     
19703 root      25   0     0    0    0 R 77.8  0.0 129:28.93 kiblnd_sd_07                                                     
19700 root      25   0     0    0    0 R 77.5  0.0 130:07.11 kiblnd_sd_04                                                     
19697 root      25   0     0    0    0 R 77.2  0.0 129:00.45 kiblnd_sd_01                                                     
19699 root      25   0     0    0    0 R 66.3  0.0 129:06.17 kiblnd_sd_03                                                     
19696 root      25   0     0    0    0 R 49.6  0.0 129:31.55 kiblnd_sd_00                                                     
29257 root      15   0 24052 4744 1576 S 16.0  0.0   0:47.91 oprofiled                                                        
  564 root      10  -5     0    0    0 S  5.4  0.0  91:42.85 kswapd0                                                          
  565 root      10  -5     0    0    0 S  2.9  0.0  25:56.07 kswapd1                                                          
19965 root      15   0     0    0    0 R  2.6  0.0   7:54.08 ll_ost_io_08                                                     
20022 root      15   0     0    0    0 S  2.6  0.0   7:17.86 ll_ost_io_65                                                     
20037 root      15   0     0    0    0 R  2.6  0.0   7:24.11 ll_ost_io_80                                                     
20055 root      15   0     0    0    0 S  2.6  0.0   7:12.28 ll_ost_io_98   
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="10766" author="ihara" created="Sun, 27 Feb 2011 22:55:58 +0000"  >&lt;p&gt;opreport output, it seems ib_qib (qlogic driver) is most high.&lt;/p&gt;</comment>
                            <comment id="10767" author="ihara" created="Sun, 27 Feb 2011 22:56:31 +0000"  >&lt;p&gt;&quot;opreport -l&quot; output.&lt;/p&gt;</comment>
                            <comment id="10768" author="liang" created="Sun, 27 Feb 2011 23:19:01 +0000"  >&lt;p&gt;Ihara, could you run opreport like this: &quot;opreport -l -p /lib/modules/`uname -r` &amp;gt; output_file&quot; so we can see symbols of Lustre modules? &lt;/p&gt;

&lt;p&gt;Thanks&lt;br/&gt;
Liang&lt;/p&gt;</comment>
                            <comment id="10769" author="ihara" created="Sun, 27 Feb 2011 23:27:46 +0000"  >&lt;p&gt;Liang, attached is output of &quot;opreport -l -p /lib/modules/`uname -r`&quot;.&lt;/p&gt;</comment>
                            <comment id="10771" author="liang" created="Mon, 28 Feb 2011 03:04:10 +0000"  >&lt;p&gt;Ihara, thanks, I have a few more questions:&lt;/p&gt;
&lt;ul class=&quot;alternate&quot; type=&quot;square&quot;&gt;
	&lt;li&gt;how many clients in your tests&lt;/li&gt;
	&lt;li&gt;as you said, read performance is &quot;slow&quot;, do you have any numbers at hand?&lt;/li&gt;
	&lt;li&gt;I assume opreport output is from OSS, and it&apos;s for &quot;read&quot; tests, right?&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;I think we probably can&apos;t help too much on high CPU load if QLogic doesn&apos;t have true RDMA, also, would it possible for you to run lnet_selftest (read and write separately, 2 or more clients with one server and concurrency=8 and brw_test size=1M) to see LNet performance.&lt;br/&gt;
I&apos;m digging into QLogic driver, at the same time, it would be very helpful if you can also help us to try with ko2iblnd map_on_demand=32 (sorry but this has to be set on all nodes) and to see if this can help on performance.&lt;/p&gt;

&lt;p&gt;Thanks&lt;br/&gt;
Liang&lt;/p&gt;</comment>
                            <comment id="10772" author="ihara" created="Mon, 28 Feb 2011 04:32:48 +0000"  >&lt;p&gt;Hi Liang,&lt;/p&gt;

&lt;p&gt;We have 40 clients. (AMD 48 cores, 128GB memory per client) Server is Intel Westmere 8 cores, 24 GB memory. Each client&apos;s I/O throughput is also slow (500MB/sec per clients), I think this is an another problem (many cores related).. However, we don&apos;t focus this problem, but just need an aggregate server throughput for write/read.&lt;/p&gt;

&lt;p&gt;Here is IOR results with 40 clients and 4 OSSs. We could 9.6GB/sec for write, but only getting 6.7GB/sec for read.&lt;br/&gt;
Max Write: 9197.14 MiB/sec (9643.90 MB/sec)&lt;br/&gt;
Max Read:  6408.91 MiB/sec (6720.23 MB/sec)&lt;/p&gt;

&lt;p&gt;What I sent opreport is what I got on OSS during the read IO testing, so your assuming is correct.&lt;/p&gt;

&lt;p&gt;Here is LNET write/read testing with from 8 clients and single server. I didn&apos;t try map_on_demand=32 yet, let me try this soon and will you updates. &lt;/p&gt;

&lt;ul class=&quot;alternate&quot; type=&quot;square&quot;&gt;
	&lt;li&gt;Write&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;LNet Bandwidth of s&amp;#93;&lt;/span&gt;&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;R&amp;#93;&lt;/span&gt; Avg: 3191.25  MB/s  Min: 3191.25  MB/s  Max: 3191.25  MB/s&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;W&amp;#93;&lt;/span&gt; Avg: 0.49     MB/s  Min: 0.49     MB/s  Max: 0.49     MB/s&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;LNet Bandwidth of c&amp;#93;&lt;/span&gt;&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;R&amp;#93;&lt;/span&gt; Avg: 0.06     MB/s  Min: 0.06     MB/s  Max: 0.06     MB/s&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;W&amp;#93;&lt;/span&gt; Avg: 399.11   MB/s  Min: 398.03   MB/s  Max: 401.14   MB/s&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;LNet Bandwidth of s&amp;#93;&lt;/span&gt;&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;R&amp;#93;&lt;/span&gt; Avg: 3188.86  MB/s  Min: 3188.86  MB/s  Max: 3188.86  MB/s&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;W&amp;#93;&lt;/span&gt; Avg: 0.49     MB/s  Min: 0.49     MB/s  Max: 0.49     MB/s&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;LNet Bandwidth of c&amp;#93;&lt;/span&gt;&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;R&amp;#93;&lt;/span&gt; Avg: 0.06     MB/s  Min: 0.06     MB/s  Max: 0.06     MB/s&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;W&amp;#93;&lt;/span&gt; Avg: 399.04   MB/s  Min: 397.43   MB/s  Max: 400.66   MB/s&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;LNet Bandwidth of s&amp;#93;&lt;/span&gt;&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;R&amp;#93;&lt;/span&gt; Avg: 3194.23  MB/s  Min: 3194.23  MB/s  Max: 3194.23  MB/s&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;W&amp;#93;&lt;/span&gt; Avg: 0.49     MB/s  Min: 0.49     MB/s  Max: 0.49     MB/s&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;LNet Bandwidth of c&amp;#93;&lt;/span&gt;&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;R&amp;#93;&lt;/span&gt; Avg: 0.06     MB/s  Min: 0.06     MB/s  Max: 0.06     MB/s&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;W&amp;#93;&lt;/span&gt; Avg: 399.31   MB/s  Min: 398.11   MB/s  Max: 401.14   MB/s&lt;/li&gt;
&lt;/ul&gt;


&lt;ul class=&quot;alternate&quot; type=&quot;square&quot;&gt;
	&lt;li&gt;Read&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;LNet Bandwidth of s&amp;#93;&lt;/span&gt;&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;R&amp;#93;&lt;/span&gt; Avg: 0.25     MB/s  Min: 0.25     MB/s  Max: 0.25     MB/s&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;W&amp;#93;&lt;/span&gt; Avg: 1598.90  MB/s  Min: 1598.90  MB/s  Max: 1598.90  MB/s&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;LNet Bandwidth of c&amp;#93;&lt;/span&gt;&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;R&amp;#93;&lt;/span&gt; Avg: 199.59   MB/s  Min: 196.99   MB/s  Max: 203.74   MB/s&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;W&amp;#93;&lt;/span&gt; Avg: 0.03     MB/s  Min: 0.03     MB/s  Max: 0.03     MB/s&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;LNet Bandwidth of s&amp;#93;&lt;/span&gt;&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;R&amp;#93;&lt;/span&gt; Avg: 0.25     MB/s  Min: 0.25     MB/s  Max: 0.25     MB/s&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;W&amp;#93;&lt;/span&gt; Avg: 1600.68  MB/s  Min: 1600.68  MB/s  Max: 1600.68  MB/s&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;LNet Bandwidth of c&amp;#93;&lt;/span&gt;&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;R&amp;#93;&lt;/span&gt; Avg: 200.03   MB/s  Min: 198.05   MB/s  Max: 204.79   MB/s&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;W&amp;#93;&lt;/span&gt; Avg: 0.03     MB/s  Min: 0.03     MB/s  Max: 0.03     MB/s&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;LNet Bandwidth of s&amp;#93;&lt;/span&gt;&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;R&amp;#93;&lt;/span&gt; Avg: 0.24     MB/s  Min: 0.24     MB/s  Max: 0.24     MB/s&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;W&amp;#93;&lt;/span&gt; Avg: 1552.72  MB/s  Min: 1552.72  MB/s  Max: 1552.72  MB/s&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;LNet Bandwidth of c&amp;#93;&lt;/span&gt;&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;R&amp;#93;&lt;/span&gt; Avg: 194.65   MB/s  Min: 190.99   MB/s  Max: 198.14   MB/s&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;W&amp;#93;&lt;/span&gt; Avg: 0.03     MB/s  Min: 0.03     MB/s  Max: 0.03     MB/s&lt;/li&gt;
&lt;/ul&gt;
</comment>
                            <comment id="10798" author="ihara" created="Mon, 28 Feb 2011 21:21:13 +0000"  >&lt;p&gt;Liang,&lt;/p&gt;

&lt;p&gt;map_on_demand=32, it shows better performance.&lt;/p&gt;

&lt;p&gt;Max Write: 9723.80 MiB/sec (10196.15 MB/sec)&lt;br/&gt;
Max Read:  7797.74 MiB/sec (8176.52 MB/sec)&lt;/p&gt;

&lt;p&gt;does it worth to try more small numbers to map_on_demand (e.g. map_on_demand=16) and we see? we want 9.4GB/sec for read/write.&lt;/p&gt;

&lt;p&gt;btw, what does map_on_demand mean?&lt;/p&gt;

&lt;p&gt;Thanks&lt;br/&gt;
Ihara&lt;/p&gt;</comment>
                            <comment id="10799" author="liang" created="Mon, 28 Feb 2011 22:01:53 +0000"  >&lt;p&gt;Ihara, map_on_demand means that we will enable FMR, map_on_demand=32 will use FMR for any RDMA &amp;gt; 32 * 4K (128K). So I suspect having a smaller map_on_demand will help nothing unless IO request size &amp;lt; 128k.&lt;/p&gt;

&lt;p&gt;ko2iblnd doesn&apos;t use FMR by default, it just create a global MR and premap all memory, it&apos;s quick on some HCAs especially for small IO request because we don&apos;t need to map again before RDMA, however, I made a quick look at source code of QIB and feel it&apos;s heavy to send fragments one by one in qib_post_send-&amp;gt;qib_post_one_send, for 1M IO request, qib_post_one_send will be called for 256 times, so I think if we enable FMR and map pages to one fragment, it will reduce a lot of overhead.&lt;/p&gt;

&lt;p&gt;could you please gather oprofiles with map_on_demand enabled? So we can try to find out whether there is more we can improve&lt;/p&gt;

&lt;p&gt;Regards&lt;br/&gt;
Liang&lt;/p&gt;</comment>
                            <comment id="10801" author="ihara" created="Mon, 28 Feb 2011 23:03:16 +0000"  >&lt;p&gt;Liang, &lt;/p&gt;

&lt;p&gt;thanks for your descriptions. I&apos;ve just attached a output &quot;opreport -l -p /lib/modules/`uname -r`&quot; after set ko2iblnd map_on_demand=32. I collected this data during the read testing.&lt;/p&gt;</comment>
                            <comment id="10806" author="liang" created="Tue, 1 Mar 2011 01:23:08 +0000"  >&lt;p&gt;I suspect ib_post_send will do memory copy for QIB, if so probably the only way is moving ib_post_send out from spinlock of o2iblnd (conn:ibc_lock), it&apos;s not very easy because credits system of o2iblnd replies on this spinlock. I need some time to think about it.&lt;/p&gt;

&lt;p&gt;Thanks&lt;br/&gt;
Liang&lt;/p&gt;</comment>
                            <comment id="10828" author="liang" created="Tue, 1 Mar 2011 20:52:18 +0000"  >&lt;p&gt;Ihara, &lt;/p&gt;

&lt;p&gt;I guess I was wrong in previous comment, there is no reason we call ib_post_send with holding of ibc_lock, although I&apos;m not 100% sure, but I think we are having this just because o2iblnd inherit this piece of code from another old LND(iiblnd), which doesn&apos;t allow re-entrant of the same QP, which is not the case of OFED.&lt;br/&gt;
I&apos;ve posted a patch at here: &lt;a href=&quot;http://review.whamcloud.com/#change,285&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#change,285&lt;/a&gt;&lt;br/&gt;
This patch will release ibc_lock before calling ib_post_send, which will avoid a lot of connections if there are some other threads want to post more data on the same connection.&lt;/p&gt;

&lt;p&gt;if it&apos;s possible, could you please try this patch with and w/o map_on_demand and collect oprofiles?&lt;br/&gt;
it&apos;s still an experimental patch, so please don&apos;t be surprised if you got any problem with the patch... &lt;img class=&quot;emoticon&quot; src=&quot;https://jira.whamcloud.com/images/icons/emoticons/sad.png&quot; height=&quot;16&quot; width=&quot;16&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/p&gt;

&lt;p&gt;Liang&lt;/p&gt;</comment>
                            <comment id="10830" author="ihara" created="Tue, 1 Mar 2011 22:39:06 +0000"  >&lt;p&gt;Liang, &lt;/p&gt;

&lt;p&gt;Thanks. Attached is what I collected the oprofile on OSS when I ran the read benchmark without map_on_demand setting after applied your patch.&lt;/p&gt;</comment>
                            <comment id="10831" author="liang" created="Tue, 1 Mar 2011 22:56:05 +0000"  >&lt;p&gt;Hi Ihara, could you provide performance data as well? thanks&lt;br/&gt;
Liang&lt;/p&gt;</comment>
                            <comment id="10832" author="ihara" created="Tue, 1 Mar 2011 22:58:40 +0000"  >&lt;p&gt;Liang,&lt;/p&gt;

&lt;p&gt;Nope, it was slower than map_on_demand=32 without patch. I&apos;m going to run benchmark again with map_on_demand=32 and patch.&lt;/p&gt;</comment>
                            <comment id="10833" author="ihara" created="Wed, 2 Mar 2011 01:44:13 +0000"  >&lt;p&gt;with map_on_demand=32 and patch was slower than map_on_demand=32 without patch.&lt;br/&gt;
Max Write: 8752.85 MiB/sec (9178.03 MB/sec)&lt;br/&gt;
Max Read:  6623.76 MiB/sec (6945.51 MB/sec)&lt;/p&gt;

&lt;p&gt;Even functions() in ko2iblnd are reduced, the performance can&apos;t go up.. Instead, qib_sdma_verbs_send() goes up really high... now. &lt;/p&gt;

&lt;p&gt;So, is fixing qib only way to improve the performance?&lt;/p&gt;</comment>
                            <comment id="10862" author="liang" created="Thu, 3 Mar 2011 17:37:56 +0000"  >&lt;p&gt;Ihara,&lt;br/&gt;
while we run lnet_selftest with &quot;read&quot; test, O2iblnd has one more message than &quot;write&quot; test:&lt;/p&gt;

&lt;p&gt;READ&lt;br/&gt;
(selftest req)&lt;br/&gt;
server: &amp;lt;-- PUT_REQ   &amp;lt;-- client&lt;br/&gt;
server: --&amp;gt; PUT_NOACK --&amp;gt; client &lt;br/&gt;
(selftest bulk)&lt;br/&gt;
server: --&amp;gt; PUT_REQ   --&amp;gt; client&lt;br/&gt;
server: &amp;lt;-- PUT_ACK   &amp;lt;-- client&lt;br/&gt;
server: --&amp;gt; PUT_DONE  --&amp;gt; client&lt;br/&gt;
(selftest reply)&lt;br/&gt;
server: --&amp;gt; PUT_REQ   --&amp;gt; client&lt;br/&gt;
server: &amp;lt;-- PUT_NOACK &amp;lt;-- client&lt;/p&gt;

&lt;p&gt;WRITE&lt;br/&gt;
(selftest req)&lt;br/&gt;
server: &amp;lt;-- PUT_REQ   &amp;lt;-- client&lt;br/&gt;
server: --&amp;gt; PUT_NOACK --&amp;gt; client &lt;br/&gt;
(selftest bulk)&lt;br/&gt;
server: --&amp;gt; GET_REQ   --&amp;gt; client&lt;br/&gt;
server: --&amp;gt; GET_DONE  --&amp;gt; client&lt;br/&gt;
(selftest reply)&lt;br/&gt;
server: --&amp;gt; PUT_REQ   --&amp;gt; client&lt;br/&gt;
server: &amp;lt;-- PUT_NOACK &amp;lt;-- client&lt;/p&gt;

&lt;p&gt;so we normally see &quot;read&quot; performance is a little lower than &quot;write&quot; performance with same tuning parameters.&lt;br/&gt;
But this can&apos;t explain why &quot;read&quot; performance dropped significantly while number of clients increased:&lt;/p&gt;

&lt;p&gt;Data from your email:&lt;br/&gt;
-----------------------------------------------------&lt;br/&gt;
Intel (server) &amp;lt;-&amp;gt; Intel (client)&lt;br/&gt;
(2 x Intel E5620 2.4GHz, 8 core CPU, 24GB memory)&lt;br/&gt;
#client      write(GB/sec)        read(GB/sec)&lt;br/&gt;
1                  2.0                      2.4&lt;br/&gt;
2                  2.6                      2.4&lt;br/&gt;
3                  3.2                      2.2&lt;br/&gt;
4                  3.2                      2.0&lt;br/&gt;
5                  3.2                      2.0&lt;/p&gt;

&lt;p&gt;Intel (server) &amp;lt;-&amp;gt; AMD (client)&lt;br/&gt;
(2 x Intel E5620 2.4GHz, 24GB memory) - ( 4 x AMD Opteron 6174, 8 core CPU, 128GB memory)&lt;br/&gt;
#client      write(GB/sec)        read(GB/sec)&lt;br/&gt;
 1                 1.2                      0.9&lt;br/&gt;
 2                 1.4                      1.9&lt;br/&gt;
 3                 2.9                      2.3&lt;br/&gt;
 4                 3.1                      2.2&lt;br/&gt;
 5                 3.1                      2.1&lt;br/&gt;
 6                 3.1                      2.0&lt;br/&gt;
 7                 3.1                      1.9&lt;br/&gt;
 8                 3.1                      1.8&lt;br/&gt;
 9                 3.1                      1.8&lt;br/&gt;
10                 3.1                      1.7&lt;/p&gt;

&lt;p&gt;I noticed default &quot;credits&quot; of o2iblnd is a little low, so it might be worth to try with higher value on both client &amp;amp; server:&lt;br/&gt;
ko2iblnd map_on_demand=16 ntx=1024 credits=512 peer_credits=32 &lt;br/&gt;
Though I&apos;m not sure how much it can help.&lt;/p&gt;

&lt;p&gt;As I said in my mail, I think qib_sdma_verbs_send is still suspicious because it&apos;s holding spin_lock_irqsave all the time which could have impact on performance, hope we can get some help from qlogic engineers.&lt;/p&gt;

&lt;p&gt;Regards&lt;br/&gt;
Liang&lt;/p&gt;</comment>
                            <comment id="16100" author="pjones" created="Mon, 13 Jun 2011 14:42:54 +0000"  >&lt;p&gt;Ihara&lt;/p&gt;

&lt;p&gt;Any update on this?&lt;/p&gt;

&lt;p&gt;Thanks&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="16132" author="ihara" created="Mon, 13 Jun 2011 19:22:48 +0000"  >&lt;p&gt;we replaeced Qlogic HCA with Mellanox finally. so, at this morment, it&apos;s ok to be close, we can&apos;t still achieve same numbers (which I got on mellanox) on Qlogic HCA though..&lt;/p&gt;</comment>
                            <comment id="16136" author="pjones" created="Mon, 13 Jun 2011 19:32:40 +0000"  >&lt;p&gt;ok, thanks Ihara&lt;/p&gt;</comment>
                    </comments>
                    <attachments>
                            <attachment id="10102" name="lnet_selftest.sh" size="1710" author="kitwestneat" created="Wed, 2 Feb 2011 22:10:10 +0000"/>
                            <attachment id="10136" name="opreport-l-p-2.out" size="136055" author="ihara" created="Mon, 28 Feb 2011 23:03:16 +0000"/>
                            <attachment id="10137" name="opreport-l-p-3.out" size="130816" author="ihara" created="Tue, 1 Mar 2011 22:39:06 +0000"/>
                            <attachment id="10135" name="opreport-l-p.out" size="144245" author="ihara" created="Sun, 27 Feb 2011 23:27:46 +0000"/>
                            <attachment id="10134" name="opreport-l.out" size="112410" author="ihara" created="Sun, 27 Feb 2011 22:56:31 +0000"/>
                            <attachment id="10133" name="opreport.out" size="3170" author="ihara" created="Sun, 27 Feb 2011 22:55:58 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzw10f:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>10244</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>