<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:56:18 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-5995] Apparent scale issue with 2.5.2 clients to 2.5.3 servers</title>
                <link>https://jira.whamcloud.com/browse/LU-5995</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;I am running into a significant performance issue when running IOR primarily to the file system mentioned in the environment but is being observed on other  global lustre file systems as well. &lt;/p&gt;

&lt;p&gt;On the smaller of the two systems with a single router I am getting near wire speed with running IOR on 32 nodes with 8 threads. On the larger system I am getting only approximately 10% of what I expect to see going through the routers and where I&apos;m expecting to see ~10GB/sec to the file system I am only seeing roughly 6% of that performance.&lt;/p&gt;

&lt;p&gt;I have run both netperf and lnet_selftest from the routers to the servers and in the case of netperf am seeing basically wire speed. lnet_selftest show approximately the same result when using a concurrency of 8 but with a concurrency of 1 I am seeing about half.&lt;/p&gt;

&lt;p&gt;Loads on the servers and routers are insignificant. I have increased the credits on both the servers and gateways with no observable impact (although changes to the routers has not been implement only on 2 gateways do to it being a production environment). Credits changes have not been made on the client side (again because this is a production system).&lt;/p&gt;

&lt;p&gt;I am unsure what to try next nor do I know whether there is a known compatibility issue between client and server versions.&lt;/p&gt;

&lt;p&gt;Any help would be greatly appreciated. &lt;/p&gt;</description>
                <environment>Compute Clusters - Sun X6275 blades, QDR IB torus &lt;br/&gt;
Cluster 1: Toss-2.2-6, Lustre 2.4.2-17 2854 nodes. 12 IB/10gigE Routers Cluster 2: Toss-2.1.1.4, Lustre 2.4.0-21 90 nodes, 1 IB/10gigE Routers&lt;br/&gt;
Storage Clustre - Dell 720 servers,NetApp 5524/60 dual homed IB/10gigE Toss2.2.1.1, Lustre 2.5.3-2 - mixed mode ldiskfs MDT/zfs OSTs</environment>
        <key id="27818">LU-5995</key>
            <summary>Apparent scale issue with 2.5.2 clients to 2.5.3 servers</summary>
                <type id="9" iconUrl="https://jira.whamcloud.com/images/icons/issuetypes/undefined.png">Question/Request</type>
                                            <priority id="3" iconUrl="https://jira.whamcloud.com/images/icons/priorities/major.svg">Major</priority>
                        <status id="1" iconUrl="https://jira.whamcloud.com/images/icons/statuses/open.png" description="The issue is open and ready for the assignee to start work on it.">Open</status>
                    <statusCategory id="2" key="new" colorName="default"/>
                                    <resolution id="-1">Unresolved</resolution>
                                        <assignee username="green">Oleg Drokin</assignee>
                                    <reporter username="jamervi">Joe Mervini</reporter>
                        <labels>
                    </labels>
                <created>Fri, 5 Dec 2014 20:20:32 +0000</created>
                <updated>Tue, 7 Jun 2016 15:38:21 +0000</updated>
                                            <version>Lustre 2.4.2</version>
                    <version>Lustre 2.5.3</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>5</watches>
                                                                            <comments>
                            <comment id="101670" author="pjones" created="Tue, 16 Dec 2014 01:03:30 +0000"  >&lt;p&gt;Oleg&lt;/p&gt;

&lt;p&gt;What do you advise here?&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="101687" author="green" created="Tue, 16 Dec 2014 06:17:40 +0000"  >&lt;p&gt;There should not be any compatibility issues between 2.4 and 2.5 versions.&lt;/p&gt;

&lt;p&gt;I guess you already have looked into router statistics to ensure the load is evenly distributed across all the routers and not going through one-two of them only making them the bottleneck of the entire thing (same goes for your file striping, but I guess you have checked that too already?)&lt;br/&gt;
If you get roughly 10% of your performance expected in large config with 12 routers and expected performance out of a small system with just one router, I think this one might be first one to doublecheck and rule out.&lt;/p&gt;

&lt;p&gt;When you say &quot;lnet_selftest show approximately the same result&quot; do you mean that you get wirespeed in such a test as well?&lt;/p&gt;</comment>
                            <comment id="101745" author="jamervi" created="Tue, 16 Dec 2014 19:29:36 +0000"  >&lt;p&gt;The data through the (12) routers was relatively balanced. I was monitoring the router stats during all my testing. The performance of the lnet_selftest was consistent between the 2 systems in terms that on a single router the results were similar and the average when up on the larger system when addition routers were added to the test (with lst only running between the router nodes themselves and the lustre servers as opposed to running lst on clients through the routers)&lt;/p&gt;

&lt;p&gt;One thing we were wondering about is credits and peer_credits and whether their setting might be a factor. To be honest we haven&apos;t really monkeyed with the module load options in years with the exception of adding new networks and routers. In terms of load time option we are using they are shown below:&lt;/p&gt;

&lt;ol&gt;
	&lt;li&gt;Device  aliases&lt;br/&gt;
alias ib0  ib_ipoib&lt;br/&gt;
alias ib1  ib_ipoib&lt;/li&gt;
&lt;/ol&gt;


&lt;p&gt;###############################################################################&lt;/p&gt;
&lt;ol&gt;
	&lt;li&gt;LNET options&lt;/li&gt;
&lt;/ol&gt;


&lt;p&gt;options lnet tiny_router_buffers=4096&lt;br/&gt;
options lnet small_router_buffers=65536&lt;br/&gt;
options lnet large_router_buffers=4096&lt;br/&gt;
options lnet live_router_check_interval=60&lt;br/&gt;
options lnet dead_router_check_interval=60&lt;br/&gt;
options lnet check_routers_before_use=1&lt;/p&gt;

&lt;ol&gt;
	&lt;li&gt;TCP LND options&lt;/li&gt;
&lt;/ol&gt;


&lt;ol&gt;
	&lt;li&gt;OpenIB LND options&lt;br/&gt;
options ko2iblnd timeout=100&lt;/li&gt;
&lt;/ol&gt;
</comment>
                            <comment id="102147" author="ashehata" created="Sat, 20 Dec 2014 17:28:56 +0000"  >&lt;p&gt;Here the tunings that might affect performance:&lt;br/&gt;
1. router buffers, increasing this will increase the number of messages a router can handle&lt;br/&gt;
2. peer_buffer_credits: # router buffer credits per peer.  These are receiving credits and apply to the router only.  Increasing these will increase the number of messages a router can handle simoultaneously&lt;br/&gt;
3. credits (default = 256): # concurrent sends to all peers&lt;br/&gt;
4. peer_credits (default = 8): # concurrent sends to 1 peer (this overrides the peer_buffer_credits)&lt;/p&gt;

&lt;p&gt;You can try manipulating these on clients and servers to see if the increase performance.&lt;/p&gt;</comment>
                            <comment id="102152" author="liang" created="Sun, 21 Dec 2014 02:39:49 +0000"  >&lt;p&gt;Hi Joe, did you try lnet_selftest with all 32 clients and all servers (OSSs) or at least 12 servers, what the average server performance in that case? Also, when you ran IOR on the large system, were IO requests from clients evenly spreading over all servers?&lt;/p&gt;</comment>
                    </comments>
                    <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10490" key="com.atlassian.jira.plugin.system.customfieldtypes:datepicker">
                        <customfieldname>End date</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>Fri, 19 Dec 2014 20:20:32 +0000</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                        <customfield id="customfield_10030" key="com.atlassian.jira.plugin.system.customfieldtypes:labels">
                        <customfieldname>Epic/Theme</customfieldname>
                        <customfieldvalues>
                                        <label>Performance</label>
    
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzx20n:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>16720</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                <customfield id="customfield_10493" key="com.atlassian.jira.plugin.system.customfieldtypes:datepicker">
                        <customfieldname>Start date</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>Fri, 5 Dec 2014 20:20:32 +0000</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                    </customfields>
    </item>
</channel>
</rss>