<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 02:43:41 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-11415] ksocklnd performance improvement on 40Gbps ethernet</title>
                <link>https://jira.whamcloud.com/browse/LU-11415</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;Recently I&apos;m benchmarking a newly setup lustre servers with 40Gbps ethernet network connection. Jumbo frame is enabled and MTU is set to 9000 for both NICs on the client and server side. The connection between client and server is really simple, they are under the same TOR switch, no routers in between.&lt;/p&gt;

&lt;p&gt;Firstly I used iperf3 to verify the throughput between client and server, and the throughput is stable at 30~32gib/s from either direction. However, when I launched lnet selftest, I usually see less throughput than perf3, which is about ~2500MiB/s.&lt;/p&gt;

&lt;p&gt;After speaking with Amir and Doug, I monitored ksocklnd threads on both client and server side, the problem we&apos;re seeing is that when lnet selftest is performing reading test, there is only one ksocklnd thread consuming 100% CPU time, while the other threads don&apos;t take any workload; write test is similar but only one server ksocklnd thread is busy doing work. The workload doesn&apos;t seem to spread out to all threads in the pool.&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;It would be possible that the only thread is enough to handle all the traffic so there is no need to launch the workload to the other threads, but it&apos;s also possible that there are some scheduling problems in the implementation of ksocklnd. Doug mentioned that o2iblnd could spread the workload well.&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;</description>
                <environment></environment>
        <key id="53385">LU-11415</key>
            <summary>ksocklnd performance improvement on 40Gbps ethernet</summary>
                <type id="4" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11310&amp;avatarType=issuetype">Improvement</type>
                                            <priority id="4" iconUrl="https://jira.whamcloud.com/images/icons/priorities/minor.svg">Minor</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="ashehata">Amir Shehata</assignee>
                                    <reporter username="Jinshan">Jinshan Xiong</reporter>
                        <labels>
                    </labels>
                <created>Fri, 21 Sep 2018 20:27:22 +0000</created>
                <updated>Mon, 8 Apr 2019 14:10:59 +0000</updated>
                            <resolved>Fri, 4 Jan 2019 05:28:13 +0000</resolved>
                                                    <fixVersion>Lustre 2.13.0</fixVersion>
                    <fixVersion>Lustre 2.12.1</fixVersion>
                                        <due></due>
                            <votes>0</votes>
                                    <watches>8</watches>
                                                                            <comments>
                            <comment id="233884" author="jinshan" created="Fri, 21 Sep 2018 21:46:21 +0000"  >&lt;p&gt;Typically this is what I have seen from the server side when doing write test:&lt;/p&gt;

&lt;p&gt;{{&lt;tt&gt;top - 21:45:16 up 21:57, 1 user, load average: 11.02, 7.19, 5.00&lt;/tt&gt;}}&lt;br/&gt;
{{ &lt;tt&gt;Tasks: 1175 total, 3 running, 1172 sleeping, 0 stopped, 0 zombie&lt;/tt&gt;}}&lt;br/&gt;
{{ &lt;tt&gt;%Cpu(s): 0.0 us, 11.3 sy, 0.0 ni, 80.5 id, 6.9 wa, 0.0 hi, 1.3 si, 0.0 st&lt;/tt&gt;}}&lt;br/&gt;
{{ &lt;tt&gt;KiB Mem : 13174288+total, 56613536 free, 68320632 used, 6808716 buff/cache&lt;/tt&gt;}}&lt;br/&gt;
{{ &lt;tt&gt;KiB Swap: 4194300 total, 4194300 free, 0 used. 62550556 avail Mem&lt;/tt&gt;&lt;tt&gt;PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND&lt;/tt&gt;}}&lt;br/&gt;
{{ &lt;tt&gt;21319 root 20 0 0 0 0 R 100.0 0.0 15:11.80 socknal_sd00_02&lt;/tt&gt;}}&lt;br/&gt;
{{ 2 root 20 0 0 0 0 S 9.2 0.0 4:09.04 kthreadd}}&lt;br/&gt;
{{ 1734 root 0 -20 0 0 0 R 5.6 0.0 2:25.93 spl_dynamic_tas}}&lt;/p&gt;</comment>
                            <comment id="237134" author="ashehata" created="Sat, 17 Nov 2018 02:10:56 +0000"  >&lt;p&gt;I looked at the socklnd scheduling code and it&apos;s very similar to the o2iblnd code. There is a scheduler created per CPT. A thread is created for each cpu in the CPT (if the number of threads is not explicitly configured). When creating a connection, the CPT is derived from the peer nid using: lnet_cpt_of_nid() hashing function. The CPT is used to grab the appropriate scheduler and assign it to the connection. All operations on the connection uses the assigned scheduler. Which means if we&apos;re running a 1-1 testing, then the same scheduler will always be used. If there are multiple threads in the scheduler then we should round robin over them.&lt;/p&gt;

&lt;p&gt;In the MLX o2iblnd case most of the work is offloaded to the HW. In case of OPA I believe the HFI driver has its own set of threads which do the work. But in socklnd all the work is done in the socklnd scheduler thread, which causes the thread to consume a lot of the CPU resources if it&apos;s the only thread in the scheduler.&lt;/p&gt;

&lt;p&gt;In the test above how many CPUs are in each CPT?&lt;/p&gt;</comment>
                            <comment id="237152" author="jinshan" created="Sat, 17 Nov 2018 23:03:58 +0000"  >&lt;p&gt;Everything is default there is no settings in `lnet.conf` other than specifying NIC for Lustre.&lt;/p&gt;

&lt;p&gt;This node has 2 NUMA node with 24 cores, what would be the recommended configuration for CPT?&lt;/p&gt;</comment>
                            <comment id="237218" author="ashehata" created="Mon, 19 Nov 2018 23:24:56 +0000"  >&lt;p&gt;How many socklnd threads were started?&lt;/p&gt;</comment>
                            <comment id="237229" author="jinshan" created="Tue, 20 Nov 2018 03:57:29 +0000"  >&lt;p&gt;These are all the threads:&lt;/p&gt;

&lt;p&gt;&lt;tt&gt;# ps ax | grep sock&lt;/tt&gt;&lt;br/&gt;
{{ 43583 pts/0 S+ 0:00 grep --color=auto sock}}&lt;br/&gt;
{{ 44194 ? S 0:00 &lt;span class=&quot;error&quot;&gt;&amp;#91;socknal_cd00&amp;#93;&lt;/span&gt;}}&lt;br/&gt;
{{ 44195 ? S 0:00 &lt;span class=&quot;error&quot;&gt;&amp;#91;socknal_cd01&amp;#93;&lt;/span&gt;}}&lt;br/&gt;
{{ 44196 ? S 0:00 &lt;span class=&quot;error&quot;&gt;&amp;#91;socknal_cd02&amp;#93;&lt;/span&gt;}}&lt;br/&gt;
{{ 44197 ? S 0:00 &lt;span class=&quot;error&quot;&gt;&amp;#91;socknal_cd03&amp;#93;&lt;/span&gt;}}&lt;br/&gt;
{{ 44198 ? S 0:03 &lt;span class=&quot;error&quot;&gt;&amp;#91;socknal_reaper&amp;#93;&lt;/span&gt;}}&lt;br/&gt;
{{ 44199 ? S 42:59 &lt;span class=&quot;error&quot;&gt;&amp;#91;socknal_sd00_00&amp;#93;&lt;/span&gt;}}&lt;br/&gt;
{{ 44200 ? S 0:00 &lt;span class=&quot;error&quot;&gt;&amp;#91;socknal_sd00_01&amp;#93;&lt;/span&gt;}}&lt;br/&gt;
{{ 44201 ? S 584:14 &lt;span class=&quot;error&quot;&gt;&amp;#91;socknal_sd00_02&amp;#93;&lt;/span&gt;}}&lt;br/&gt;
{{ 44202 ? S 0:00 &lt;span class=&quot;error&quot;&gt;&amp;#91;socknal_sd00_03&amp;#93;&lt;/span&gt;}}&lt;br/&gt;
{{ 44203 ? S 0:00 &lt;span class=&quot;error&quot;&gt;&amp;#91;socknal_sd00_04&amp;#93;&lt;/span&gt;}}&lt;br/&gt;
{{ 44204 ? S 0:00 &lt;span class=&quot;error&quot;&gt;&amp;#91;socknal_sd00_05&amp;#93;&lt;/span&gt;}}&lt;br/&gt;
{{ 44205 ? S 0:27 &lt;span class=&quot;error&quot;&gt;&amp;#91;socknal_sd01_00&amp;#93;&lt;/span&gt;}}&lt;br/&gt;
{{ 44206 ? S 0:00 &lt;span class=&quot;error&quot;&gt;&amp;#91;socknal_sd01_01&amp;#93;&lt;/span&gt;}}&lt;br/&gt;
{{ 44207 ? S 0:00 &lt;span class=&quot;error&quot;&gt;&amp;#91;socknal_sd01_02&amp;#93;&lt;/span&gt;}}&lt;br/&gt;
{{ 44208 ? S 0:26 &lt;span class=&quot;error&quot;&gt;&amp;#91;socknal_sd01_03&amp;#93;&lt;/span&gt;}}&lt;br/&gt;
{{ 44209 ? S 0:00 &lt;span class=&quot;error&quot;&gt;&amp;#91;socknal_sd01_04&amp;#93;&lt;/span&gt;}}&lt;br/&gt;
{{ 44210 ? S 0:00 &lt;span class=&quot;error&quot;&gt;&amp;#91;socknal_sd01_05&amp;#93;&lt;/span&gt;}}&lt;/p&gt;</comment>
                            <comment id="237294" author="ashehata" created="Wed, 21 Nov 2018 02:31:05 +0000"  >&lt;p&gt;I think I found the issue in the code. In slocklnd.c:ksocknal_choose_scheduler_locked():&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
 683 select_sched:
 684 &#187;&#183;&#183;&#183;&#183;&#183;&#183;&#183;sched = &amp;amp;info-&amp;gt;ksi_scheds[0];
 685 &#187;&#183;&#183;&#183;&#183;&#183;&#183;&#183;/*
 686 &#187;&#183;&#183;&#183;&#183;&#183;&#183;&#183; * NB: it&apos;s safe so far, but info-&amp;gt;ksi_nthreads could be changed
 687 &#187;&#183;&#183;&#183;&#183;&#183;&#183;&#183; * at runtime when we have dynamic LNet configuration, then we
 688 &#187;&#183;&#183;&#183;&#183;&#183;&#183;&#183; * need to take care of &lt;span class=&quot;code-keyword&quot;&gt;this&lt;/span&gt;.
 689 &#187;&#183;&#183;&#183;&#183;&#183;&#183;&#183; */
 690 &#187;&#183;&#183;&#183;&#183;&#183;&#183;&#183;CDEBUG(D_NET, &lt;span class=&quot;code-quote&quot;&gt;&quot;info-&amp;gt;ksi_nthreads = %d \n&quot;&lt;/span&gt;, info-&amp;gt;ksi_nthreads);
 691 &#187;&#183;&#183;&#183;&#183;&#183;&#183;&#183;&lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; (i = 1; i &amp;lt; info-&amp;gt;ksi_nthreads; i++) {
 692 &#187;&#183;&#183;&#183;&#183;&#183;&#183;&#183;&#187;&#183;&#183;&#183;&#183;&#183;&#183;&#183;CDEBUG(D_NET, &lt;span class=&quot;code-quote&quot;&gt;&quot;sched-&amp;gt;kss_nconns = %d info-&amp;gt;ksi_scheds[%d].kss_nconns = %d\n&quot;&lt;/span&gt;,
 693 &#187;&#183;&#183;&#183;&#183;&#183;&#183;&#183;&#187;&#183;&#183;&#183;&#183;&#183;&#183;&#183;       sched-&amp;gt;kss_nconns, i, info-&amp;gt;ksi_scheds[i].kss_nconns);
 694 &#187;&#183;&#183;&#183;&#183;&#183;&#183;&#183;&#187;&#183;&#183;&#183;&#183;&#183;&#183;&#183;&lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt; (sched-&amp;gt;kss_nconns &amp;gt; info-&amp;gt;ksi_scheds[i].kss_nconns)
 695 &#187;&#183;&#183;&#183;&#183;&#183;&#183;&#183;&#187;&#183;&#183;&#183;&#183;&#183;&#183;&#183;&#187;&#183;&#183;&#183;&#183;&#183;&#183;&#183;sched = &amp;amp;info-&amp;gt;ksi_scheds[i];
 696 &#187;&#183;&#183;&#183;&#183;&#183;&#183;&#183;}
 697 
 698 &#187;&#183;&#183;&#183;&#183;&#183;&#183;&#183;&lt;span class=&quot;code-keyword&quot;&gt;return&lt;/span&gt; sched;
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;When running an lnet_selftest script the connection is created once and is used for the duration of the run. This will result in using the same scheduler continuously for the same connection. This algorithm works if you have multiple connections, each connection will use a different scheduler thread. But if most of the traffic is between the same two peers we will not change the scheduler thread and we will run into this bottle neck.&lt;/p&gt;

&lt;p&gt;Seems like we ought to be balancing the traffic not in connection creation time only, but when receiving a new message. I&apos;m gonna create a patch within the next day or so. Would you be able to test it and see if it performs better?&lt;/p&gt;</comment>
                            <comment id="237309" author="jinshan" created="Wed, 21 Nov 2018 04:13:38 +0000"  >&lt;p&gt;I will be happy to try that out. Thanks.&lt;/p&gt;</comment>
                            <comment id="237402" author="ashehata" created="Fri, 23 Nov 2018 00:18:53 +0000"  >&lt;p&gt;Summarized the issue and the proposed solution here:&lt;br/&gt;
&lt;a href=&quot;https://wiki.whamcloud.com/display/LNet/Socklnd+Scheduler+Improvements&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://wiki.whamcloud.com/display/LNet/Socklnd+Scheduler+Improvements&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Will try and get a patch in soon.&lt;/p&gt;</comment>
                            <comment id="237580" author="gerrit" created="Wed, 28 Nov 2018 02:10:38 +0000"  >&lt;p&gt;Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/33740&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/33740&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-11415&quot; title=&quot;ksocklnd performance improvement on 40Gbps ethernet&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-11415&quot;&gt;&lt;del&gt;LU-11415&lt;/del&gt;&lt;/a&gt; socklnd: improve scheduling algorithm&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: 69fbcfc89f165ba4238286afcbcd3a2059615b4f&lt;/p&gt;</comment>
                            <comment id="237581" author="ashehata" created="Wed, 28 Nov 2018 02:12:31 +0000"  >&lt;p&gt;Hey Jinshan, can you try this patch? run a 1-1 selftest and monitor the socknal_sd_* threads&apos; CPU usage.&lt;br/&gt;
thanks&lt;/p&gt;</comment>
                            <comment id="239353" author="gerrit" created="Fri, 4 Jan 2019 04:48:25 +0000"  >&lt;p&gt;Oleg Drokin (green@whamcloud.com) merged in patch &lt;a href=&quot;https://review.whamcloud.com/33740/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/33740/&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-11415&quot; title=&quot;ksocklnd performance improvement on 40Gbps ethernet&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-11415&quot;&gt;&lt;del&gt;LU-11415&lt;/del&gt;&lt;/a&gt; socklnd: improve scheduling algorithm&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: &lt;br/&gt;
Commit: 89df5e712ffd40064f1d4ce2f00f9156f68a2262&lt;/p&gt;</comment>
                            <comment id="239389" author="pjones" created="Fri, 4 Jan 2019 05:28:13 +0000"  >&lt;p&gt;Landed for 2.13&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;</comment>
                            <comment id="242688" author="gerrit" created="Mon, 25 Feb 2019 17:23:49 +0000"  >&lt;p&gt;Minh Diep (mdiep@whamcloud.com) uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/34299&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/34299&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-11415&quot; title=&quot;ksocklnd performance improvement on 40Gbps ethernet&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-11415&quot;&gt;&lt;del&gt;LU-11415&lt;/del&gt;&lt;/a&gt; socklnd: improve scheduling algorithm&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: b2_12&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: 35864dd4370c4ece26198033beced450ed6443d0&lt;/p&gt;</comment>
                            <comment id="245368" author="gerrit" created="Mon, 8 Apr 2019 06:27:39 +0000"  >&lt;p&gt;Oleg Drokin (green@whamcloud.com) merged in patch &lt;a href=&quot;https://review.whamcloud.com/34299/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/34299/&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-11415&quot; title=&quot;ksocklnd performance improvement on 40Gbps ethernet&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-11415&quot;&gt;&lt;del&gt;LU-11415&lt;/del&gt;&lt;/a&gt; socklnd: improve scheduling algorithm&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: b2_12&lt;br/&gt;
Current Patch Set: &lt;br/&gt;
Commit: ec964395b249087c28e82b1afa1db4a7c9322196&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10010">
                    <name>Duplicate</name>
                                            <outwardlinks description="duplicates">
                                                        </outwardlinks>
                                                        </issuelinktype>
                    </issuelinks>
                <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|i002vj:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                                                                                </customfields>
    </item>
</channel>
</rss>