<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 03:24:50 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-16191] ksocklnd tries to open connections forever if there is a mismatch between conns_per_peer</title>
                <link>https://jira.whamcloud.com/browse/LU-16191</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;If there is a mismatch between conns_per_peer then the peer with larger conns_per_peer will continually try to create additional connections to the peer with lower conns_per_peer. These connection requests will be rejected by the peer with lower conns_per_peer.&lt;/p&gt;

&lt;p&gt;In this test &quot;n00&quot; is running 2.15 and &quot;n03&quot; is running 2.12. I&apos;m issuing a single ping from n00 to n03. There is no Lustre, so there is no other LNet traffic other than this single ping:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;cassini-hosta:~ # pdsh -w n0[0,3] &apos;lctl --net tcp conn_list&apos;
n03: &amp;lt;no connections&amp;gt;
n00: &amp;lt;no connections&amp;gt;
cassini-hosta:~ # lctl ping 172.18.2.4@tcp
12345-0@lo
12345-172.18.2.4@tcp
cassini-hosta:~ # pdsh -w n0[0,3] &apos;lctl --net tcp conn_list&apos; | dshbak -c
----------------
n00
----------------
12345-172.18.2.4@tcp O[2]172.18.2.1-&amp;gt;172.18.2.4:988 332800/131072 nonagle
12345-172.18.2.4@tcp I[2]172.18.2.1-&amp;gt;172.18.2.4:988 332800/131072 nonagle
12345-172.18.2.4@tcp C[2]172.18.2.1-&amp;gt;172.18.2.4:988 332800/131072 nonagle
----------------
n03
----------------
12345-172.18.2.1@tcp I[3]s-lmo-gaz38b-&amp;gt;172.18.2.1:1021 332800/235392 nonagle
12345-172.18.2.1@tcp O[3]s-lmo-gaz38b-&amp;gt;172.18.2.1:1022 332800/235392 nonagle
12345-172.18.2.1@tcp C[3]s-lmo-gaz38b-&amp;gt;172.18.2.1:1023 332800/235392 nonagle
cassini-hosta:~ # lctl dk &amp;gt; /tmp/dk.log
cassini-hosta:~ # grep -c ksocknal_create_conn /tmp/dk.log
131
cassini-hosta:~ # sleep 30; lctl dk &amp;gt; /tmp/dk.log2
cassini-hosta:~ # grep -c ksocknal_create_conn /tmp/dk.log2
350
cassini-hosta:~ #
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;conns_per_peer on n00 (the 2.15 node) is default 0 and ends up at 4 because of link speed:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;cassini-hosta:~ # lnetctl net show --net tcp -v | grep -e net -e conns
net:
    - net type: tcp
              conns_per_peer: 4
cassini-hosta:~ #
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;debug log on n00 is mostly just this repeated (with +net and +malloc):&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;00000800:00000200:24.0:1663961292.792229:0:17357:0:(socklnd.c:946:ksocknal_create_conn()) ksocknal_send_hello conn 0000000089a5bbf4 returned 0
00000800:00000200:24.0:1663961292.792396:0:17357:0:(socklnd.c:958:ksocknal_create_conn()) ksocknal_recv_hello conn 0000000089a5bbf4 returned 114
00000800:00000200:24.0:1663961292.792405:0:17357:0:(socklnd.c:1237:ksocknal_create_conn()) Not creating conn 12345-172.18.2.4@tcp(0000000089a5bbf4) type 2: lost conn race
00000800:00000010:24.0:1663961292.792407:0:17357:0:(socklnd.c:1267:ksocknal_create_conn()) kfreed &apos;hello&apos;: 144 at 0000000051a6b08c (tot 1492135).
00000800:00000010:24.0:1663961292.792409:0:17357:0:(socklnd.c:1269:ksocknal_create_conn()) kfreed &apos;conn&apos;: 4712 at 0000000089a5bbf4 (tot 1487423).

cassini-hosta:~ # grep -c &apos;Not creating conn&apos; /tmp/dk.log2
50
cassini-hosta:~ # grep &apos;Not creating conn&apos; /tmp/dk.log2 | grep -c &apos;type 2:&apos;
50
cassini-hosta:~ #
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;If I repeat the test with conns_per_peer=1 on the 2.15 node then we don&apos;t see the extra calls to ksocknal_create_conn():&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;cassini-hosta:~ # lctl ping 172.18.2.4@tcp
12345-0@lo
12345-172.18.2.4@tcp
cassini-hosta:~ # lctl dk &amp;gt; /tmp/dk.log4
cassini-hosta:~ # sleep 30; lctl dk &amp;gt; /tmp/dk.log5
cassini-hosta:~ # grep -c ksocknal_create_conn /tmp/dk.log5
0
cassini-hosta:~ # lnetctl net show --net tcp -v | grep -e net -e conns
net:
    - net type: tcp
              conns_per_peer: 1
cassini-hosta:~ #
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</description>
                <environment></environment>
        <key id="72531">LU-16191</key>
            <summary>ksocklnd tries to open connections forever if there is a mismatch between conns_per_peer</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="4" iconUrl="https://jira.whamcloud.com/images/icons/priorities/minor.svg">Minor</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="ssmirnov">Serguei Smirnov</assignee>
                                    <reporter username="hornc">Chris Horn</reporter>
                        <labels>
                    </labels>
                <created>Mon, 26 Sep 2022 18:34:39 +0000</created>
                <updated>Fri, 17 Feb 2023 21:49:17 +0000</updated>
                            <resolved>Fri, 17 Feb 2023 21:49:16 +0000</resolved>
                                                    <fixVersion>Lustre 2.16.0</fixVersion>
                                        <due></due>
                            <votes>0</votes>
                                    <watches>4</watches>
                                                                            <comments>
                            <comment id="347964" author="adilger" created="Mon, 26 Sep 2022 21:35:50 +0000"  >&lt;p&gt;Definitely it doesn&apos;t make sense to try too hard to add extra connections.&#160; In addition to the conns_per_peer mismatch (which might also happen in the future if the speed-&amp;gt;conns heuristic changes), there could be TCP port limits if there are too many peers.&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;If the second or later connection to a peer fails to be established then the node shouldn&apos;t loop trying to reconnect.&#160; Maybe retry with exponential backoff?&lt;/p&gt;</comment>
                            <comment id="347971" author="gerrit" created="Mon, 26 Sep 2022 23:49:38 +0000"  >&lt;p&gt;&quot;Serguei Smirnov &amp;lt;ssmirnov@whamcloud.com&amp;gt;&quot; uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/48664&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/48664&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-16191&quot; title=&quot;ksocklnd tries to open connections forever if there is a mismatch between conns_per_peer&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-16191&quot;&gt;&lt;del&gt;LU-16191&lt;/del&gt;&lt;/a&gt; socklnd: limit retries on conns_per_peer mismatch&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: ca622a6d036b5c2f3272184e10d30906370227ea&lt;/p&gt;</comment>
                            <comment id="349091" author="gerrit" created="Mon, 10 Oct 2022 05:39:02 +0000"  >&lt;p&gt;&quot;Oleg Drokin &amp;lt;green@whamcloud.com&amp;gt;&quot; merged in patch &lt;a href=&quot;https://review.whamcloud.com/c/fs/lustre-release/+/48664/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/c/fs/lustre-release/+/48664/&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-16191&quot; title=&quot;ksocklnd tries to open connections forever if there is a mismatch between conns_per_peer&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-16191&quot;&gt;&lt;del&gt;LU-16191&lt;/del&gt;&lt;/a&gt; socklnd: limit retries on conns_per_peer mismatch&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: &lt;br/&gt;
Commit: da893c6c9707ca3b2e7532d05f754fccf1cffc74&lt;/p&gt;</comment>
                            <comment id="349118" author="pjones" created="Mon, 10 Oct 2022 13:00:58 +0000"  >&lt;p&gt;Landed for 2.16&lt;/p&gt;</comment>
                            <comment id="349986" author="simmonsja" created="Tue, 18 Oct 2022 12:54:53 +0000"  >&lt;p&gt;With the latest 2.15 LTS with these patches we still hit this bug on our production machine:&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;408577.737098&amp;#93;&lt;/span&gt; LNetError: 16137:0:(socklnd_cb.c:1985:ksocknal_connect()) ASSERTION( (wanted &amp;amp; (1 &amp;lt;&amp;lt; 3)) != 0 ) failed:&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;408577.748695&amp;#93;&lt;/span&gt; LNetError: 16137:0:(socklnd_cb.c:1985:ksocknal_connect()) LBUG &lt;span class=&quot;error&quot;&gt;&amp;#91;408577.756620&amp;#93;&lt;/span&gt; Kernel panic - not syncing: LBUG in interrupt.&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;408577.765600&amp;#93;&lt;/span&gt; CPU: 14 PID: 16137 Comm: socknal_cd01 Kdump: loaded Tainted: P OE ------------ T 3.10.0-1160.71.1.el7.x86_64 #1&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;408577.779586&amp;#93;&lt;/span&gt; Hardware name: Dell Inc. PowerEdge R640/0RGP26, BIOS 2.14.2 03/21/2022 &lt;span class=&quot;error&quot;&gt;&amp;#91;408577.788151&amp;#93;&lt;/span&gt; Call Trace:&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;408577.791600&amp;#93;&lt;/span&gt; &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa35865c9&amp;gt;&amp;#93;&lt;/span&gt; dump_stack+0x19/0x1b &lt;span class=&quot;error&quot;&gt;&amp;#91;408577.797718&amp;#93;&lt;/span&gt; &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa35802d1&amp;gt;&amp;#93;&lt;/span&gt; panic+0xe8/0x21f&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;408577.803495&amp;#93;&lt;/span&gt; &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffc0d0b8bd&amp;gt;&amp;#93;&lt;/span&gt; lbug_with_loc+0x8d/0xa0 &lt;span class=&quot;error&quot;&gt;&amp;#91;libcfs&amp;#93;&lt;/span&gt; &lt;span class=&quot;error&quot;&gt;&amp;#91;408577.810638&amp;#93;&lt;/span&gt; &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffc0e2ffbb&amp;gt;&amp;#93;&lt;/span&gt; ksocknal_connd+0xc4b/0xd80 &lt;span class=&quot;error&quot;&gt;&amp;#91;ksocklnd&amp;#93;&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;408577.818201&amp;#93;&lt;/span&gt; &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa2edaf20&amp;gt;&amp;#93;&lt;/span&gt; ? wake_up_state+0x20/0x20&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;408577.824710&amp;#93;&lt;/span&gt; &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffc0e2f370&amp;gt;&amp;#93;&lt;/span&gt; ? ksocknal_thread_fini+0x30/0x30 &lt;span class=&quot;error&quot;&gt;&amp;#91;ksocklnd&amp;#93;&lt;/span&gt; &lt;span class=&quot;error&quot;&gt;&amp;#91;408577.832775&amp;#93;&lt;/span&gt; &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa2ec5f91&amp;gt;&amp;#93;&lt;/span&gt; kthread+0xd1/0xe0&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;408577.838575&amp;#93;&lt;/span&gt; &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa2ec5ec0&amp;gt;&amp;#93;&lt;/span&gt; ? insert_kthread_work+0x40/0x40&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;408577.845571&amp;#93;&lt;/span&gt; &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa3599ddd&amp;gt;&amp;#93;&lt;/span&gt; ret_from_fork_nospec_begin+0x7/0x21 &lt;span class=&quot;error&quot;&gt;&amp;#91;408577.852903&amp;#93;&lt;/span&gt; &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa2ec5ec0&amp;gt;&amp;#93;&lt;/span&gt; ? insert_kthread_work+0x40/0x40&lt;/p&gt;</comment>
                            <comment id="350000" author="hornc" created="Tue, 18 Oct 2022 14:14:21 +0000"  >&lt;p&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/ViewProfile.jspa?name=jsimmons&quot; class=&quot;user-hover&quot; rel=&quot;jsimmons&quot;&gt;jsimmons&lt;/a&gt; That looks like a new issue. Are you saying it is a regression with this patch?&lt;/p&gt;</comment>
                            <comment id="350005" author="simmonsja" created="Tue, 18 Oct 2022 14:41:59 +0000"  >&lt;p&gt;We saw this before the patches of this work landed. Since it touched this area I was hoping it was fixed by this work. You think its a separate issue?&lt;/p&gt;</comment>
                            <comment id="350007" author="hornc" created="Tue, 18 Oct 2022 15:10:28 +0000"  >&lt;p&gt;Yes, this ticket is about behavior when conns_per_peer does not match between two peers. It is not about any ASSERTION.&lt;/p&gt;</comment>
                            <comment id="350008" author="ssmirnov" created="Tue, 18 Oct 2022 15:14:44 +0000"  >&lt;p&gt;James, does the build you&apos;re using include this fix?: &lt;a href=&quot;https://review.whamcloud.com/47361&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/47361&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="350017" author="simmonsja" created="Tue, 18 Oct 2022 16:04:56 +0000"  >&lt;p&gt;Let me try that. Will report back.&lt;/p&gt;</comment>
                            <comment id="350027" author="simmonsja" created="Tue, 18 Oct 2022 17:28:29 +0000"  >&lt;p&gt;Doesn&apos;&apos;t apply for 2.12. The dynamic conn_per_peer doesn&apos;t exist for 2.12.&lt;/p&gt;</comment>
                            <comment id="363314" author="hornc" created="Fri, 17 Feb 2023 21:35:13 +0000"  >&lt;p&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/ViewProfile.jspa?name=pjones&quot; class=&quot;user-hover&quot; rel=&quot;pjones&quot;&gt;pjones&lt;/a&gt; I think this can be closed&lt;/p&gt;</comment>
                            <comment id="363316" author="pjones" created="Fri, 17 Feb 2023 21:49:17 +0000"  >&lt;p&gt;ok thanks - feel free to close things yourself in future though - you&apos;ve got permissions &lt;img class=&quot;emoticon&quot; src=&quot;https://jira.whamcloud.com/images/icons/emoticons/smile.png&quot; height=&quot;16&quot; width=&quot;16&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/p&gt;</comment>
                    </comments>
                    <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|i0319j:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>