<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 03:27:49 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-16530] OOM on routers with a faulty link/interface with 1 node</title>
                <link>https://jira.whamcloud.com/browse/LU-16530</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;A LNet router crash regularly with OOM on a compute partition at the CEA.&lt;br/&gt;
Each time, the router complains about a compute node (with RDMA timeout) and then crash with OOM.&lt;br/&gt;
This issue seems to be linked to a defective compute rack or infiniband interface, but this should not cause the LNet router to crash.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Environment:&lt;/b&gt;&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;x32         infiniband    x12       infiniband    ~ x100
computes    &amp;lt;--o2ib1--&amp;gt;   routers   &amp;lt;--o2ib0--&amp;gt;   servers

peer_credits = 42
discovery = 0
health_sensitivity = 0
transaction_timeout = 50
retry_count = 0

router RAM amount : 48GB
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;&lt;b&gt;Kdumps information:&lt;/b&gt;&lt;br/&gt;
On the peer interface (lnet_peer_ni) to the faulty compute: &lt;br/&gt;
tx credits: ~ -4500&lt;br/&gt;
I read the msg tx queue (lpni_txq) and sort the messages by NID sources: for 69 NIDs I count 42 messages (peer_credits value) blocked in the tx queue.&lt;/p&gt;

&lt;p&gt;I found the peer interface with a server NID that have 42 msg blocked on tx:&lt;br/&gt;
peer buffer credit: ~ -17000&lt;br/&gt;
On the peer router queue (lpni_rtrq), messages seems to be linked to different kib_conn (kib_rx.rx_conn) every 42 messages.&lt;br/&gt;
These connections are in disconnected state, with ibc_list and ibc_sched_list not linked (poison value inside). But qp and cq are not freed.&lt;br/&gt;
A QP take 512 pages and a CQ take 256 pages, ~ 3&#160;MB per connection.&lt;/p&gt;

&lt;p&gt;So it seems to be a connections leak.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Analyze&lt;/b&gt;&lt;br/&gt;
Here what I understood with the lnet/ko2iblnd sources:&lt;/p&gt;

&lt;p&gt;1. Compute node have an issue and do not answer (or partially) to the router.&lt;br/&gt;
2. Messages from the servers to the compute node are queued and the tx peer credits is negative.&lt;br/&gt;
3. When a server peer interface have more than 42 messages blocked on tx, peer_buffer_credits is negative (by default, peer_credits == peer_buffer_credits). In that case, new message from server are queued in lpni_rtrq.&lt;br/&gt;
4. After that, the server is not able to send any messages to the router because peer_buffer_credits &amp;lt; 0. All messages from the server sent to the router timeout (RDMA timeout).&lt;br/&gt;
5. The server disconnects/reconnects to the routers and cleans its tx credits and resend its messages.&lt;br/&gt;
6. On the router, the old connection is set to disconnect but not freed because old Rx messages are not cleaned and still reference the old connection.&lt;/p&gt;

&lt;p&gt;Can someone help me with this ?&lt;br/&gt;
I am not used to debug LNet/ko2iblnd.&lt;/p&gt;</description>
                <environment>Production, Lustre 2.12.7 on router and computes, Lustre 2.12.9 + patches on servers&lt;br/&gt;
peer_credits = 42&lt;br/&gt;
infiniband (mofed 5.4 on router and on computes, mofed 4.7 on servers)</environment>
        <key id="74332">LU-16530</key>
            <summary>OOM on routers with a faulty link/interface with 1 node</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="3" iconUrl="https://jira.whamcloud.com/images/icons/priorities/major.svg">Major</priority>
                        <status id="1" iconUrl="https://jira.whamcloud.com/images/icons/statuses/open.png" description="The issue is open and ready for the assignee to start work on it.">Open</status>
                    <statusCategory id="2" key="new" colorName="default"/>
                                    <resolution id="-1">Unresolved</resolution>
                                        <assignee username="cbordage">Cyril Bordage</assignee>
                                    <reporter username="eaujames">Etienne Aujames</reporter>
                        <labels>
                            <label>LNet</label>
                            <label>ko2iblnd</label>
                            <label>lnet</label>
                            <label>router</label>
                    </labels>
                <created>Fri, 3 Feb 2023 17:04:13 +0000</created>
                <updated>Wed, 26 Jul 2023 10:18:06 +0000</updated>
                                                                                <due></due>
                            <votes>0</votes>
                                    <watches>5</watches>
                                                                            <comments>
                            <comment id="361566" author="pjones" created="Fri, 3 Feb 2023 17:59:15 +0000"  >&lt;p&gt;Cyril&lt;/p&gt;

&lt;p&gt;Can you please advise?&lt;/p&gt;

&lt;p&gt;Thanks&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="364517" author="eaujames" created="Wed, 1 Mar 2023 14:43:18 +0000"  >&lt;p&gt;Hi,&lt;/p&gt;

&lt;p&gt;We successfully reproduce the issue on a test filesystem with Infiniband:&lt;/p&gt;


&lt;p&gt;&lt;b&gt;Configuration&lt;/b&gt;&lt;/p&gt;
&lt;ul&gt;
	&lt;li&gt;1 MDT/MDS&lt;/li&gt;
	&lt;li&gt;5 OSS/OST&lt;/li&gt;
	&lt;li&gt;2 Clients&lt;/li&gt;
	&lt;li&gt;1 router IB &amp;lt;--&amp;gt; IB&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;Lustre 2.12.7 LTS on all nodes.&lt;/p&gt;

&lt;p&gt;LNet configuration:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;options lnet lnet_peer_discovery_disabled=1 lnet_health_sensitivity=0
options ko2iblnd peer_credits=42
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;servers use o2ib50&lt;br/&gt;
clients use o2ib51&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Reproducer&lt;/b&gt;&lt;/p&gt;
&lt;ol&gt;
	&lt;li&gt;Mount clients&lt;/li&gt;
	&lt;li&gt;Do some IO with clients (I used multithread fio)&lt;/li&gt;
	&lt;li&gt;Spam client1 with lnet ping from all the servers
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;clush -w@servers &quot;while true; do seq 1 100 | xargs -P100 -I{} lnetctl ping client1@o2ib51; done&quot;
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;&lt;/li&gt;
&lt;/ol&gt;


&lt;ol&gt;
	&lt;li&gt;Add a delay (2s) rule on the client1 for the incoming traffic from *@o2ib50
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;ssh client1 lctl net_delay_add -s &quot;*.o2ib50&quot; -d &quot;o2ib51&quot; -l 2 --rate=1
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;&lt;/li&gt;
&lt;/ol&gt;


&lt;p&gt;That&apos;s it!&lt;/p&gt;

&lt;p&gt;On the router:&lt;/p&gt;
&lt;ul&gt;
	&lt;li&gt;tx credits value is minimum for client1 peer_ni (available_tx_credits = - peer_buffer_credit_param * server_nodes = - 42 * 5 = -210)&lt;/li&gt;
	&lt;li&gt;The available_rtr_credits (peer_buffer_credit) keep decreasing on all the peer_ni of servers&lt;/li&gt;
	&lt;li&gt;The number of QP keep increasing&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;The client2 is not able to communicate with servers. All the Rx peers_ni of the servers are saturated on the router (peer_buffer_credit &amp;lt; 0).&lt;/p&gt;

&lt;p&gt;The servers keep trying to reconnect to the clients.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Remarks&lt;/b&gt;&lt;br/&gt;
A drop rule or an Infiniband device reset on the client1 do not produce the issue: communication errors are detected by the router and the peer_ni on the router is set to down (see auto_down feature), messages are dropped.&lt;/p&gt;

&lt;p&gt;I try to increase peer_buffer_credit and set lnet_health_sensitivity, this does not change the behavior.&lt;/p&gt;</comment>
                            <comment id="364531" author="cbordage" created="Wed, 1 Mar 2023 16:08:46 +0000"  >&lt;p&gt;Hello Etienne,&lt;/p&gt;

&lt;p&gt;thank you for the reproducer. I will take a look into that when I will be back in one week.&lt;/p&gt;</comment>
                            <comment id="367836" author="eaujames" created="Thu, 30 Mar 2023 08:18:58 +0000"  >&lt;p&gt;Hi Cyril,&lt;/p&gt;

&lt;p&gt;Have you got the time to look into that issue ?&lt;/p&gt;</comment>
                            <comment id="367837" author="cbordage" created="Thu, 30 Mar 2023 08:39:08 +0000"  >&lt;p&gt;Hello Etienne,&lt;/p&gt;

&lt;p&gt;I did take a look but then was on something else&#8230; Sorry about that. I plan to work on it again very soon.&lt;/p&gt;

&lt;p&gt;Thank you.&lt;/p&gt;</comment>
                            <comment id="370548" author="cbordage" created="Tue, 25 Apr 2023 15:25:10 +0000"  >&lt;p&gt;Hello Etienne,&lt;/p&gt;

&lt;p&gt;do you have logs of your tests? Is your setup still available?&lt;/p&gt;

&lt;p&gt;Thank you.&lt;/p&gt;</comment>
                            <comment id="370797" author="eaujames" created="Thu, 27 Apr 2023 07:10:04 +0000"  >&lt;p&gt;Hi Cyril,&lt;/p&gt;

&lt;p&gt;I can&apos;t get you debug_log (maybe some dmesg if you want).&lt;br/&gt;
The setup is not available because it was reproduced on a router from the cluster (to reproduce this it needs a node with 2 ib interfaces on different networks).&lt;br/&gt;
I tried to reproduce this with tcp &amp;lt;-&amp;gt; ib but unsuccessfully.&lt;/p&gt;</comment>
                            <comment id="370801" author="cbordage" created="Thu, 27 Apr 2023 08:40:22 +0000"  >&lt;p&gt;Hello Etienne,&lt;/p&gt;

&lt;p&gt;yes, dmesg could be useful.&lt;/p&gt;

&lt;p&gt;Thank you.&lt;/p&gt;</comment>
                            <comment id="380148" author="eaujames" created="Wed, 26 Jul 2023 09:03:36 +0000"  >&lt;p&gt;Hi Cyril,&lt;/p&gt;

&lt;p&gt;Sorry for the delay.&lt;/p&gt;

&lt;p&gt;I have submitted 2 dmesg logs:&lt;/p&gt;
&lt;ul class=&quot;alternate&quot; type=&quot;square&quot;&gt;
	&lt;li&gt;&lt;span class=&quot;nobr&quot;&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/attachment/49791/49791_vmcore-dmesg_hide_router272a_20221209_152308_1.txt&quot; title=&quot;vmcore-dmesg_hide_router272a_20221209_152308_1.txt attached to LU-16530&quot;&gt;vmcore-dmesg_hide_router272a_20221209_152308_1.txt&lt;sup&gt;&lt;img class=&quot;rendericon&quot; src=&quot;https://jira.whamcloud.com/images/icons/link_attachment_7.gif&quot; height=&quot;7&quot; width=&quot;7&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/sup&gt;&lt;/a&gt;&lt;/span&gt;&lt;/li&gt;
	&lt;li&gt;&lt;span class=&quot;nobr&quot;&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/attachment/49792/49792_vmcore-dmesg_hide_router272a_20221220_213634_1.txt&quot; title=&quot;vmcore-dmesg_hide_router272a_20221220_213634_1.txt attached to LU-16530&quot;&gt;vmcore-dmesg_hide_router272a_20221220_213634_1.txt&lt;sup&gt;&lt;img class=&quot;rendericon&quot; src=&quot;https://jira.whamcloud.com/images/icons/link_attachment_7.gif&quot; height=&quot;7&quot; width=&quot;7&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/sup&gt;&lt;/a&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;Those are logs from 2 crashes of the router in production.&lt;/p&gt;

&lt;p&gt;The situation was stabilized by changing the CPU of the faulty client node.&lt;/p&gt;</comment>
                            <comment id="380162" author="eaujames" created="Wed, 26 Jul 2023 10:18:06 +0000"  >&lt;p&gt;Here some context for logs:&lt;/p&gt;
&lt;ul class=&quot;alternate&quot; type=&quot;square&quot;&gt;
	&lt;li&gt;o2ibxx: storage network&lt;/li&gt;
	&lt;li&gt;o2ibyy: compute network&lt;/li&gt;
	&lt;li&gt;for  &lt;span class=&quot;nobr&quot;&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/attachment/49792/49792_vmcore-dmesg_hide_router272a_20221220_213634_1.txt&quot; title=&quot;vmcore-dmesg_hide_router272a_20221220_213634_1.txt attached to LU-16530&quot;&gt;vmcore-dmesg_hide_router272a_20221220_213634_1.txt&lt;sup&gt;&lt;img class=&quot;rendericon&quot; src=&quot;https://jira.whamcloud.com/images/icons/link_attachment_7.gif&quot; height=&quot;7&quot; width=&quot;7&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/sup&gt;&lt;/a&gt;&lt;/span&gt; : BB.BB.ID8@o2ibyy is the faulty client node&lt;/li&gt;
	&lt;li&gt;for  &lt;span class=&quot;nobr&quot;&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/attachment/49791/49791_vmcore-dmesg_hide_router272a_20221209_152308_1.txt&quot; title=&quot;vmcore-dmesg_hide_router272a_20221209_152308_1.txt attached to LU-16530&quot;&gt;vmcore-dmesg_hide_router272a_20221209_152308_1.txt&lt;sup&gt;&lt;img class=&quot;rendericon&quot; src=&quot;https://jira.whamcloud.com/images/icons/link_attachment_7.gif&quot; height=&quot;7&quot; width=&quot;7&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/sup&gt;&lt;/a&gt;&lt;/span&gt; : BB.BB.ID17@o2ibyy is the faulty client node&lt;/li&gt;
&lt;/ul&gt;
</comment>
                    </comments>
                    <attachments>
                            <attachment id="49791" name="vmcore-dmesg_hide_router272a_20221209_152308_1.txt" size="753513" author="eaujames" created="Tue, 25 Jul 2023 16:15:34 +0000"/>
                            <attachment id="49792" name="vmcore-dmesg_hide_router272a_20221220_213634_1.txt" size="1034079" author="eaujames" created="Tue, 25 Jul 2023 16:15:35 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|i03bz3:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>