<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 03:11:36 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-14652] LNet router stuck generating RDMA tx timeout</title>
                <link>https://jira.whamcloud.com/browse/LU-14652</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;Hello, we are still seeing weird behavior with LNet over IB with 2.12+. We have tried to upgrade clients and routers to 2.13 and then 2.14 without success. We went back to 2.12.6 LTS, but we are still seeing occasional kiblnd errors and timeout. The IB fabrics are healthy, sometimes a little bit of congestion but no discard. I&apos;m starting to suspect a deeper problem with LNet/ko2iblnd, where sometimes credits are exhausted? We didn&apos;t have with 2.10. To me, the problem seems similar to the one reported in &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-14026&quot; title=&quot;symptoms of message loss or corruption after upgrading routers to lustre 2.12.5&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-14026&quot;&gt;LU-14026&lt;/a&gt; by LLNL.&lt;/p&gt;

&lt;p&gt;We do have the following setup:&lt;/p&gt;

&lt;p&gt;Fir (serves /scratch) o2ib7 &#8212; 8 x lnet routers (IB) &#8212; Sherlock v3 (o2ib3)&lt;/p&gt;

&lt;p&gt;Last night, one of the 8 routers (sh03-fir06) started to have problems. I&apos;ve taken traces so that we can investigate.&lt;/p&gt;

&lt;p&gt;Router NIDs:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[root@sh03-fir06 ~]# lctl list_nids
10.51.0.116@o2ib3
10.0.10.237@o2ib7
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;LNet config on the router (we have discovery enabled and using a few Multi-Rail nodes on o2ib3):&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[root@sh03-fir06 ~]# lnetctl global show
global:
    numa_range: 0
    max_intf: 200
    discovery: 1
    drop_asym_route: 0
    retry_count: 0
    transaction_timeout: 50
    health_sensitivity: 0
    recovery_interval: 1
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Fir Lustre servers on o2ib7 started to exhibit the following errors:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;fir-io7-s2: Apr 28 23:25:40 fir-io7-s2 kernel: LustreError: 64485:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff9488213ef800
fir-io7-s2: Apr 28 23:25:40 fir-io7-s2 kernel: LustreError: 64485:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff9488213ef800
fir-io7-s2: Apr 28 23:25:40 fir-io7-s2 kernel: LustreError: 64485:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff9480cfc4c800
fir-io7-s2: Apr 28 23:25:59 fir-io7-s2 kernel: LustreError: 64485:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff94643457fc00
fir-io7-s2: Apr 28 23:25:59 fir-io7-s2 kernel: LustreError: 64485:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff94643457fc00
fir-io7-s2: Apr 28 23:26:05 fir-io7-s2 kernel: LustreError: 64485:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff949e65fc2400
fir-io7-s2: Apr 28 23:26:05 fir-io7-s2 kernel: LustreError: 64485:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff949e65fc2400
fir-io7-s2: Apr 28 23:26:11 fir-io7-s2 kernel: LustreError: 64485:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff945be5538400
fir-io7-s2: Apr 28 23:26:11 fir-io7-s2 kernel: LustreError: 64485:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff94886b2a9000
fir-io7-s2: Apr 28 23:26:17 fir-io7-s2 kernel: LustreError: 64485:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff94428fd56000
fir-io7-s2: Apr 28 23:26:24 fir-io7-s2 kernel: LustreError: 64485:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff947e46cbdc00
fir-io7-s2: Apr 28 23:26:24 fir-io7-s2 kernel: LustreError: 64485:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff947e46cbdc00
fir-io7-s2: Apr 28 23:26:43 fir-io7-s2 kernel: LustreError: 64485:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff9462486d2400
fir-io7-s2: Apr 28 23:26:51 fir-io7-s2 kernel: LustreError: 68967:0:(ldlm_lib.c:3279:target_bulk_io()) @@@ Reconnect on bulk WRITE  req@ffff949f1a1a3850 x1696974434037056/t0(0) o4-&amp;gt;12f8a639-7e97-4157-8d89-6e1e00a728eb@10.51.13.20@o2ib3:363/0 lens 488/448 e 0 to 0 dl 1619677703 ref 1 fl Interpret:/0/0 rc 0/0
fir-io6-s1: Apr 28 23:24:11 fir-io6-s1 kernel: LustreError: 24015:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff918fc3b44400
fir-io6-s1: Apr 28 23:24:11 fir-io6-s1 kernel: LustreError: 24015:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff918fc3b44400
fir-io6-s1: Apr 28 23:24:18 fir-io6-s1 kernel: LustreError: 24015:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff91913b76a800
fir-io6-s1: Apr 28 23:24:30 fir-io6-s1 kernel: LustreError: 24015:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff912be4f95000
fir-io6-s1: Apr 28 23:24:30 fir-io6-s1 kernel: LustreError: 24015:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff912be4f95000
fir-io6-s1: Apr 28 23:24:36 fir-io6-s1 kernel: LustreError: 24015:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff911f0108a400
fir-io6-s1: Apr 28 23:24:55 fir-io6-s1 kernel: LustreError: 24015:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff914885d3bc00
fir-io6-s1: Apr 28 23:24:55 fir-io6-s1 kernel: LustreError: 24015:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff914885d3bc00
fir-io6-s1: Apr 28 23:25:08 fir-io6-s1 kernel: LNet: 24015:0:(o2iblnd_cb.c:2081:kiblnd_close_conn_locked()) Closing conn to 10.0.10.237@o2ib7: error -110(waiting)
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;One interesting thing about that router, vs. the 7 others, is that it had a lot of refs (&amp;gt; 3000) in {{/sys/kernel/debug/lnet/nis }} and tx stuck at -367. The high refs count is similar to an issue we noticed with 2.14 routers reported in &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-14584&quot; title=&quot;LNet: 2 CPTs on a single NUMA node instead of one&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-14584&quot;&gt;LU-14584&lt;/a&gt;, and we thought that maybe this was a CPT issue. Here, it happened with Lustre 2.12.6 routers.&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;nid                      status alive refs peer  rtr   max    tx   min
0@lo                       down     0    2    0    0     0     0     0
0@lo                       down     0    0    0    0     0     0     0
0@lo                       down     0    0    0    0     0     0     0
0@lo                       down     0    0    0    0     0     0     0
0@lo                       down     0    0    0    0     0     0     0
0@lo                       down     0    0    0    0     0     0     0
0@lo                       down     0    0    0    0     0     0     0
0@lo                       down     0    0    0    0     0     0     0
10.51.0.116@o2ib3            up     0    1    8    0    64    64    47
10.51.0.116@o2ib3            up     0    0    8    0    64    64    47
10.51.0.116@o2ib3            up     0    0    8    0    64    64    46
10.51.0.116@o2ib3            up     0    0    8    0    64    64    46
10.51.0.116@o2ib3            up     0    0    8    0    64    64    41
10.51.0.116@o2ib3            up     0    0    8    0    64    64    48
10.51.0.116@o2ib3            up     0 3493    8    0    64  -367  -367
10.51.0.116@o2ib3            up     0    0    8    0    64    64    46
10.0.10.237@o2ib7            up     0 3318    8    0    64    64    48
10.0.10.237@o2ib7            up     0 3062    8    0    64    64    50
10.0.10.237@o2ib7            up     0 6202    8    0    64    64    47
10.0.10.237@o2ib7            up     0 3032    8    0    64    64    49
10.0.10.237@o2ib7            up     0 6082    8    0    64    64    48
10.0.10.237@o2ib7            up     0 5467    8    0    64    64    48
10.0.10.237@o2ib7            up     0 3115    8    0    64    64    50
10.0.10.237@o2ib7            up     0 3206    8    0    64    64    48
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;I took traces on this router at the time of the problem. I&apos;m attaching a zip files &lt;span class=&quot;nobr&quot;&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/attachment/38453/38453_sh03-fir06-20210428.zip&quot; title=&quot;sh03-fir06-20210428.zip attached to LU-14652&quot;&gt;sh03-fir06-20210428.zip&lt;sup&gt;&lt;img class=&quot;rendericon&quot; src=&quot;https://jira.whamcloud.com/images/icons/link_attachment_7.gif&quot; height=&quot;7&quot; width=&quot;7&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/sup&gt;&lt;/a&gt;&lt;/span&gt; with the output of:&lt;/p&gt;
&lt;ul class=&quot;alternate&quot; type=&quot;square&quot;&gt;
	&lt;li&gt;lnetctl stats show&lt;/li&gt;
	&lt;li&gt;lnetctl peer show&lt;/li&gt;
	&lt;li&gt;cat /sys/kernel/debug/lnet/nis&lt;/li&gt;
	&lt;li&gt;cat /sys/kernel/debug/lnet/peers&lt;/li&gt;
	&lt;li&gt;kernel logs&lt;/li&gt;
	&lt;li&gt;short dk logs with +net enabled, just in case that would show something interesting&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;Rebooting the router fixed the problem.&lt;/p&gt;</description>
                <environment>CentOS 7.9 (3.10.0-1160.24.1.el7.x86_64) on routers, Lustre 2.12.6</environment>
        <key id="63981">LU-14652</key>
            <summary>LNet router stuck generating RDMA tx timeout</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="2" iconUrl="https://jira.whamcloud.com/images/icons/priorities/critical.svg">Critical</priority>
                        <status id="1" iconUrl="https://jira.whamcloud.com/images/icons/statuses/open.png" description="The issue is open and ready for the assignee to start work on it.">Open</status>
                    <statusCategory id="2" key="new" colorName="default"/>
                                    <resolution id="-1">Unresolved</resolution>
                                        <assignee username="ssmirnov">Serguei Smirnov</assignee>
                                    <reporter username="sthiell">Stephane Thiell</reporter>
                        <labels>
                    </labels>
                <created>Thu, 29 Apr 2021 16:34:11 +0000</created>
                <updated>Tue, 1 Feb 2022 02:06:54 +0000</updated>
                                            <version>Lustre 2.12.6</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>11</watches>
                                                                            <comments>
                            <comment id="300221" author="pjones" created="Fri, 30 Apr 2021 16:47:02 +0000"  >&lt;p&gt;Serguei&lt;/p&gt;

&lt;p&gt;Could you please advise?&lt;/p&gt;

&lt;p&gt;Thanks&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="300295" author="ssmirnov" created="Fri, 30 Apr 2021 23:19:56 +0000"  >&lt;p&gt;Hi Stephane,&lt;/p&gt;

&lt;p&gt;Could you please provide the output of&#160;&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
lnetctl net show -v 4&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;from the router and one of the server and one of the clients? It looks like the number of credits configured for the router may be low.&lt;/p&gt;

&lt;p&gt;Also, what is the output of&#160;&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
lnetctl global show &lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;on clients and servers?&#160; In the 20 sec captured in the logs there are a few transaction expiring. Perhaps transaction_timeout can be increased, but first let&apos;s check the situation with the credits.&lt;/p&gt;

&lt;p&gt;Thanks,&lt;/p&gt;

&lt;p&gt;Serguei.&lt;/p&gt;</comment>
                            <comment id="300297" author="sthiell" created="Fri, 30 Apr 2021 23:50:04 +0000"  >&lt;p&gt;Hi Serguei,&lt;/p&gt;

&lt;p&gt;I&apos;m attaching the output of &lt;tt&gt;lnetctl net show -v 4&lt;/tt&gt;:&lt;/p&gt;
&lt;ul class=&quot;alternate&quot; type=&quot;square&quot;&gt;
	&lt;li&gt;from the servers as &lt;span class=&quot;nobr&quot;&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/attachment/38467/38467_net_show_servers.txt&quot; title=&quot;net_show_servers.txt attached to LU-14652&quot;&gt;net_show_servers.txt&lt;sup&gt;&lt;img class=&quot;rendericon&quot; src=&quot;https://jira.whamcloud.com/images/icons/link_attachment_7.gif&quot; height=&quot;7&quot; width=&quot;7&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/sup&gt;&lt;/a&gt;&lt;/span&gt;&lt;/li&gt;
	&lt;li&gt;from the routers as &lt;span class=&quot;nobr&quot;&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/attachment/38468/38468_net_show_routers.txt&quot; title=&quot;net_show_routers.txt attached to LU-14652&quot;&gt;net_show_routers.txt&lt;sup&gt;&lt;img class=&quot;rendericon&quot; src=&quot;https://jira.whamcloud.com/images/icons/link_attachment_7.gif&quot; height=&quot;7&quot; width=&quot;7&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/sup&gt;&lt;/a&gt;&lt;/span&gt;&lt;/li&gt;
	&lt;li&gt;from a 2.13 client as &lt;span class=&quot;nobr&quot;&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/attachment/38469/38469_net_show_client_2.13.txt&quot; title=&quot;net_show_client_2.13.txt attached to LU-14652&quot;&gt;net_show_client_2.13.txt&lt;sup&gt;&lt;img class=&quot;rendericon&quot; src=&quot;https://jira.whamcloud.com/images/icons/link_attachment_7.gif&quot; height=&quot;7&quot; width=&quot;7&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/sup&gt;&lt;/a&gt;&lt;/span&gt;&lt;/li&gt;
	&lt;li&gt;from a 2.12.6 client as &lt;span class=&quot;nobr&quot;&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/attachment/38470/38470_net_show_client_2.12.txt&quot; title=&quot;net_show_client_2.12.txt attached to LU-14652&quot;&gt;net_show_client_2.12.txt&lt;sup&gt;&lt;img class=&quot;rendericon&quot; src=&quot;https://jira.whamcloud.com/images/icons/link_attachment_7.gif&quot; height=&quot;7&quot; width=&quot;7&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/sup&gt;&lt;/a&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;Output of &lt;tt&gt;lnetctl global show&lt;/tt&gt;:&lt;/p&gt;
&lt;ul class=&quot;alternate&quot; type=&quot;square&quot;&gt;
	&lt;li&gt;on servers:
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;# clush -w@mds,@oss -b &apos;lnetctl global show&apos;
---------------
fir-io[1-8]-s[1-2],fir-md1-s[1-4] (20)
---------------
global:
    numa_range: 0
    max_intf: 200
    discovery: 1
    drop_asym_route: 0
    retry_count: 2
    transaction_timeout: 50
    health_sensitivity: 100
    recovery_interval: 1
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;&lt;/li&gt;
&lt;/ul&gt;


&lt;ul class=&quot;alternate&quot; type=&quot;square&quot;&gt;
	&lt;li&gt;on routers:
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;clush -w sh03-fir[01-08] -b &quot;lnetctl global show&quot;
---------------
sh03-fir[01-08] (8)
---------------
global:
    numa_range: 0
    max_intf: 200
    discovery: 1
    drop_asym_route: 0
    retry_count: 0
    transaction_timeout: 50
    health_sensitivity: 0
    recovery_interval: 1
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;&lt;/li&gt;
&lt;/ul&gt;


&lt;ul class=&quot;alternate&quot; type=&quot;square&quot;&gt;
	&lt;li&gt;on 2.13 clients (1,294 total):
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;global:                                                                         
    numa_range: 0                                                               
    max_intf: 200                                                               
    discovery: 1                                                                
    drop_asym_route: 0                                                          
    retry_count: 0                                                              
    transaction_timeout: 50                                                     
    health_sensitivity: 0                                                       
    recovery_interval: 5                                                        
    router_sensitivity: 100 
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;&lt;/li&gt;
&lt;/ul&gt;


&lt;ul class=&quot;alternate&quot; type=&quot;square&quot;&gt;
	&lt;li&gt;on 2.12.6 clients (370 total):
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;global:                                                                         
    numa_range: 0                                                               
    max_intf: 200                                                               
    discovery: 1                                                                
    drop_asym_route: 0                                                          
    retry_count: 0                                                              
    transaction_timeout: 50                                                     
    health_sensitivity: 0                                                       
    recovery_interval: 1 
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;&lt;/li&gt;
&lt;/ul&gt;


&lt;ul class=&quot;alternate&quot; type=&quot;square&quot;&gt;
	&lt;li&gt;on 2.14 clients (only 6):
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;global:                                                                         
    numa_range: 0                                                               
    max_intf: 200                                                               
    discovery: 1                                                                
    drop_asym_route: 0                                                          
    retry_count: 0                                                              
    transaction_timeout: 50                                                     
    health_sensitivity: 0                                                       
    recovery_interval: 1                                                        
    router_sensitivity: 100                                                     
    lnd_timeout: 49                                                             
    response_tracking: 3                                                        
    recovery_limit: 0 
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;As you can see, we are already making sure that transaction_timeout is 50 on all clients and servers, even with 2.13 clients where it was 10 by default. But maybe you will some other problems here? Let me know if you need more data from us. Thanks!&lt;/p&gt;</comment>
                            <comment id="300301" author="ssmirnov" created="Sat, 1 May 2021 02:09:13 +0000"  >&lt;p&gt;Stephane,&lt;/p&gt;

&lt;p&gt;On servers:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
 retry_count: 2
 transaction_timeout: 50&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;This means that resulting lnd_timeout is 50/3 seconds, while on other nodes it is 50. If you want to keep the retry_count at 2, please increase transaction_timeout to 150 on the servers, it will make sure every node is using the same lnd_timeout.&lt;/p&gt;

&lt;p&gt;Thanks,&lt;/p&gt;

&lt;p&gt;Serguei.&lt;/p&gt;</comment>
                            <comment id="300302" author="sthiell" created="Sat, 1 May 2021 05:46:40 +0000"  >&lt;p&gt;Thanks for catching that, Serguei! Ok, I&apos;ll try either that or disable lnet health and set retry_count to 0 like for the other nodes.&lt;/p&gt;</comment>
                            <comment id="300304" author="sthiell" created="Sat, 1 May 2021 06:19:25 +0000"  >&lt;p&gt;I fixed that issue:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;---------------
fir-io[1-8]-s[1-2],fir-md1-s[1-4] (20)
---------------
global:
    numa_range: 0
    max_intf: 200
    discovery: 1
    drop_asym_route: 0
    retry_count: 0
    transaction_timeout: 50
    health_sensitivity: 0
    recovery_interval: 1
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;but shortly after that change, another router started to exhibit a similar problem (sh02-fir04). This time it&apos;s a router between the same servers (o2ib7, HDR) and o2ib2 (Sherlock v2, another fabric, EDR based), but it&apos;s the same idea.&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[root@sh02-fir04 ~]# lctl list_nids
10.50.0.114@o2ib2
10.0.10.227@o2ib7

[root@sh02-fir04 ~]# cat /sys/kernel/debug/lnet/nis 
nid                      status alive refs peer  rtr   max    tx   min
0@lo                       down     0    2    0    0     0     0     0
0@lo                       down     0    0    0    0     0     0     0
10.50.0.114@o2ib2            up     0    2    8    0   128   127    64
10.50.0.114@o2ib2            up     0    1    8    0   128   127    57
10.0.10.227@o2ib7            up     0   29    8    0   128   127    64
10.0.10.227@o2ib7            up     0  379    8    0   128   128    56
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;and refs is growing.&lt;/p&gt;

&lt;p&gt;On a server:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;
Apr 30 23:02:34 fir-io8-s2 kernel: LustreError: 76363:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff9f68619b2c00
Apr 30 23:02:34 fir-io8-s2 kernel: LNet: 76363:0:(lib-move.c:976:lnet_post_send_locked()) Aborting message for 12345-10.0.10.227@o2ib7: LNetM[DE]Unlink() already called on the MD/ME.
Apr 30 23:02:34 fir-io8-s2 kernel: LNet: 76363:0:(lib-move.c:976:lnet_post_send_locked()) Skipped 2 previous similar messages
Apr 30 23:02:34 fir-io8-s2 kernel: LustreError: 76363:0:(events.c:450:server_bulk_callback()) event type 5, status -125, desc ffff9f8d975ef400
Apr 30 23:02:34 fir-io8-s2 kernel: LustreError: 76363:0:(events.c:450:server_bulk_callback()) event type 5, status -125, desc ffff9f8d975ef400

&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;From a server (10.0.10.115@o2ib7):&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[root@fir-io8-s1 ~]# lnetctl peer show --nid 10.0.10.227@o2ib7 -v 4
peer:
    - primary nid: 10.50.0.114@o2ib2
      Multi-Rail: True
      peer state: 137
      peer ni:
        - nid: 10.0.10.227@o2ib7
          state: up
          max_ni_tx_credits: 8
          available_tx_credits: 7
          min_tx_credits: -64957
          tx_q_num_of_buf: 480
          available_rtr_credits: 8
          min_rtr_credits: 8
          refcount: 5
          statistics:
              send_count: 4282001375
              recv_count: 119463736
              drop_count: 0
          sent_stats:
              put: 3953956139
              get: 328045236
              reply: 0
              ack: 0
              hello: 0
          received_stats:
              put: 3102103707
              get: 493
              reply: 336149354
              ack: 976177478
              hello: 0
          dropped_stats:
              put: 0
              get: 0
              reply: 0
              ack: 0
              hello: 0
          health stats:
              health value: 1000
              dropped: 6
              timeout: 250
              error: 0
              network timeout: 0
        - nid: 10.50.0.114@o2ib2
          state: NA
          max_ni_tx_credits: 0
          available_tx_credits: 0
          min_tx_credits: 0
          tx_q_num_of_buf: 0
          available_rtr_credits: 0
          min_rtr_credits: 0
          refcount: 2
          statistics:
              send_count: 0
              recv_count: 0
              drop_count: 0
          sent_stats:
              put: 0
              get: 0
              reply: 0
              ack: 0
              hello: 0
          received_stats:
              put: 0
              get: 0
              reply: 0
              ack: 0
              hello: 0
          dropped_stats:
              put: 0
              get: 0
              reply: 0
              ack: 0
              hello: 0
          health stats:
              health value: 1000
              dropped: 0
              timeout: 0
              error: 0
              network timeout: 0
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Peer show of the server from the router (10.0.10.227@o2ib7/10.50.0.114@o2ib2):&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[root@sh02-fir04 ~]# lnetctl peer show --nid 10.0.10.115@o2ib7 -v 4
peer:
    - primary nid: 10.0.10.115@o2ib7
      Multi-Rail: True
      peer state: 137
      peer ni:
        - nid: 10.0.10.115@o2ib7
          state: up
          max_ni_tx_credits: 8
          available_tx_credits: 8
          min_tx_credits: -89
          tx_q_num_of_buf: 0
          available_rtr_credits: 6
          min_rtr_credits: -8
          refcount: 3
          statistics:
              send_count: 291018436
              recv_count: 286578888
              drop_count: 607
          sent_stats:
              put: 198315455
              get: 14434
              reply: 21538685
              ack: 71149862
              hello: 0
          received_stats:
              put: 265177056
              get: 21401720
              reply: 55
              ack: 57
              hello: 0
          dropped_stats:
              put: 0
              get: 577
              reply: 26
              ack: 4
              hello: 0
          health stats:
              health value: 1000
              dropped: 577
              timeout: 0
              error: 0
              network timeout: 0
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;I wonder if there is a way to see what is causing the router to be stuck and accumulating refs. I took another trace with +net on this router while it was stuck. Attaching as  &lt;span class=&quot;nobr&quot;&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/attachment/38471/38471_sh02-fir04-dknet.log.gz&quot; title=&quot;sh02-fir04-dknet.log.gz attached to LU-14652&quot;&gt;sh02-fir04-dknet.log.gz&lt;sup&gt;&lt;img class=&quot;rendericon&quot; src=&quot;https://jira.whamcloud.com/images/icons/link_attachment_7.gif&quot; height=&quot;7&quot; width=&quot;7&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/sup&gt;&lt;/a&gt;&lt;/span&gt; &lt;/p&gt;</comment>
                            <comment id="300352" author="ssmirnov" created="Mon, 3 May 2021 18:18:16 +0000"  >&lt;p&gt;Stephane,&lt;/p&gt;

&lt;p&gt;From the router net log you provided it looks like the router is not reporting any errors. Was the issue still happening when you were capturing it? I wonder how persistent this issue is, or maybe it just occurs in short bursts. Would it be possible to get net logs from the client and the server, too? That would give some insight into how they qualify failed connections and whether failed connections involve the same router.&#160;&lt;/p&gt;

&lt;p&gt;&quot;lnetctl net show&quot; output provided for clients and servers indicates that a single ib interface is configured for LNet, are there other ib interfaces on these machines that are not configured for LNet?&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;</comment>
                            <comment id="301058" author="sthiell" created="Mon, 10 May 2021 17:49:54 +0000"  >&lt;p&gt;Hi Serguei,&lt;/p&gt;

&lt;p&gt;&amp;gt; &quot;lnetctl net show&quot; output provided for clients and servers indicates that a single ib interface is configured for LNet, are there other ib interfaces on these machines that are not configured for LNet?&lt;/p&gt;

&lt;p&gt;Not that we know of. Some clients have dual ports card but only one port is up. We do have a few MR clients on o2ib3 though.&lt;/p&gt;

&lt;p&gt;Not easy to gather logs on this system (Fir) anymore, as the issue didn&apos;t reappear. But this morning, we had a similar issue between another storage system Oak (2.12.6) o2ib5 and Sherlock (o2ib&lt;span class=&quot;error&quot;&gt;&amp;#91;1-3&amp;#93;&lt;/span&gt;)&lt;/p&gt;

&lt;p&gt;Timeouts and network errors:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;May 10 00:07:48 oak-io2-s1 kernel: LustreError: 54968:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff8be0d58dcc00
May 10 00:07:48 oak-io2-s1 kernel: LustreError: 54967:0:(events.c:450:server_bulk_callback()) event type 5, status -5, desc ffff8bd84d303c00
May 10 00:07:48 oak-io2-s1 kernel: LustreError: 54966:0:(events.c:450:server_bulk_callback()) event type 5, status -5, desc ffff8bd84d303c00
May 10 00:07:48 oak-io2-s1 kernel: LustreError: 54969:0:(events.c:450:server_bulk_callback()) event type 5, status -5, desc ffff8be0d58dcc00
May 10 00:07:48 oak-io2-s1 kernel: LustreError: 54968:0:(events.c:450:server_bulk_callback()) event type 5, status -5, desc ffff8bf5a32dec00
May 10 00:08:53 oak-io2-s1 kernel: LustreError: 218398:0:(ldlm_lib.c:3344:target_bulk_io()) @@@ network error on bulk READ  req@ffff8bdff554c850 x1698172291403584/t0(0) o3-&amp;gt;a703c65e-c48f-97d7-efaf-c377b3ded349@10.51.15.3@o2ib3:410/0 lens 488/440 e 0 to 0 dl 1620630560 ref 1 fl Interpret:/0/0 rc 0/0
May 10 00:08:53 oak-io2-s1 kernel: Lustre: oak-OST0052: Bulk IO read error with a703c65e-c48f-97d7-efaf-c377b3ded349 (at 10.51.15.3@o2ib3), client will retry: rc -110
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;I&apos;m attaching this OSS (oak-io2-s1 10.0.2.105@o2ib5) kernel logs as &lt;span class=&quot;nobr&quot;&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/attachment/38554/38554_oak-io2-s1.kern.log&quot; title=&quot;oak-io2-s1.kern.log attached to LU-14652&quot;&gt;oak-io2-s1.kern.log&lt;sup&gt;&lt;img class=&quot;rendericon&quot; src=&quot;https://jira.whamcloud.com/images/icons/link_attachment_7.gif&quot; height=&quot;7&quot; width=&quot;7&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/sup&gt;&lt;/a&gt;&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;Very high refs count on the o2ib5 &amp;lt;&amp;gt; o2ib&lt;span class=&quot;error&quot;&gt;&amp;#91;1-3&amp;#93;&lt;/span&gt; routers :&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[root@sh02-hn01 sthiell.root]# clush -w@rtr_oak -b &quot;cat /sys/kernel/debug/lnet/nis&quot;
---------------
sh01-oak01
---------------
nid                      status alive refs peer  rtr   max    tx   min
0@lo                       down     0    2    0    0     0     0     0
0@lo                       down     0    0    0    0     0     0     0
10.49.0.131@o2ib1            up     0    1    8    0   128   128    97
10.49.0.131@o2ib1            up     0    0    8    0   128   128    91
10.0.2.212@o2ib5             up     0    1    8    0   128   128    92
10.0.2.212@o2ib5             up     0    1    8    0   128   128    94
---------------
sh01-oak02
---------------
nid                      status alive refs peer  rtr   max    tx   min
0@lo                       down     0    2    0    0     0     0     0
0@lo                       down     0    0    0    0     0     0     0
10.49.0.132@o2ib1            up     0    1    8    0   128   128    96
10.49.0.132@o2ib1            up     0    0    8    0   128   128    94
10.0.2.213@o2ib5             up     0    4    8    0   128   128    92
10.0.2.213@o2ib5             up     0    1    8    0   128   128    95
---------------
sh02-oak01
---------------
nid                      status alive refs peer  rtr   max    tx   min
0@lo                       down     0    2    0    0     0     0     0
0@lo                       down     0    0    0    0     0     0     0
10.50.0.131@o2ib2            up     0 1032    8    0   128   125    62
10.50.0.131@o2ib2            up     0  959    8    0   128   126    54
10.0.2.214@o2ib5             up     0    6    8    0   128   127    74
10.0.2.214@o2ib5             up     0 1143    8    0   128   120    61
---------------
sh02-oak02
---------------
nid                      status alive refs peer  rtr   max    tx   min
0@lo                       down     0    2    0    0     0     0     0
0@lo                       down     0    0    0    0     0     0     0
10.50.0.132@o2ib2            up     0  927    8    0   128   128    59
10.50.0.132@o2ib2            up     0  838    8    0   128   128    39
10.0.2.215@o2ib5             up     0  296    8    0   128   120    75
10.0.2.215@o2ib5             up     0  687    8    0   128   120    57
---------------
sh03-oak01
---------------
nid                      status alive refs peer  rtr   max    tx   min
0@lo                       down     0    2    0    0     0     0     0
0@lo                       down     0    0    0    0     0     0     0
0@lo                       down     0    0    0    0     0     0     0
0@lo                       down     0    0    0    0     0     0     0
0@lo                       down     0    0    0    0     0     0     0
0@lo                       down     0    0    0    0     0     0     0
0@lo                       down     0    0    0    0     0     0     0
0@lo                       down     0    0    0    0     0     0     0
10.51.0.131@o2ib3            up     0   60    8    0    64    64    26
10.51.0.131@o2ib3            up     0   61    8    0    64    64    28
10.51.0.131@o2ib3            up     0   87    8    0    64    64     9
10.51.0.131@o2ib3            up     0  180    8    0    64    64    27
10.51.0.131@o2ib3            up     0  126    8    0    64    64    26
10.51.0.131@o2ib3            up     0  150    8    0    64    64    12
10.51.0.131@o2ib3            up     0   63    8    0    64    64    32
10.51.0.131@o2ib3            up     0   93    8    0    64    64    19
10.0.2.216@o2ib5             up     0    1    8    0    64    64    56
10.0.2.216@o2ib5             up     0    0    8    0    64    64    58
10.0.2.216@o2ib5             up     0    0    8    0    64    64    40
10.0.2.216@o2ib5             up     0  160    8    0    64    56    48
10.0.2.216@o2ib5             up     0    0    8    0    64    64    56
10.0.2.216@o2ib5             up     0    0    8    0    64    64    48
10.0.2.216@o2ib5             up     0    0    8    0    64    64    45
10.0.2.216@o2ib5             up     0  290    8    0    64    56    40
---------------
sh03-oak02
---------------
nid                      status alive refs peer  rtr   max    tx   min
0@lo                       down     0    2    0    0     0     0     0
0@lo                       down     0    0    0    0     0     0     0
0@lo                       down     0    0    0    0     0     0     0
0@lo                       down     0    0    0    0     0     0     0
0@lo                       down     0    0    0    0     0     0     0
0@lo                       down     0    0    0    0     0     0     0
0@lo                       down     0    0    0    0     0     0     0
0@lo                       down     0    0    0    0     0     0     0
10.51.0.132@o2ib3            up     0   89    8    0    64    64    26
10.51.0.132@o2ib3            up     0   76    8    0    64    64    28
10.51.0.132@o2ib3            up     0   76    8    0    64    64    22
10.51.0.132@o2ib3            up     0  143    8    0    64    64    21
10.51.0.132@o2ib3            up     0  113    8    0    64    64    29
10.51.0.132@o2ib3            up     0   98    8    0    64    64    14
10.51.0.132@o2ib3            up     0   89    8    0    64    64    22
10.51.0.132@o2ib3            up     0   78    8    0    64    64    25
10.0.2.217@o2ib5             up     0    1    8    0    64    64    56
10.0.2.217@o2ib5             up     0    0    8    0    64    64    56
10.0.2.217@o2ib5             up     0  294    8    0    64    56    40
10.0.2.217@o2ib5             up     0    1    8    0    64    64    48
10.0.2.217@o2ib5             up     0    0    8    0    64    64    56
10.0.2.217@o2ib5             up     0  180    8    0    64    56    48
10.0.2.217@o2ib5             up     0    1    8    0    64    64    40
10.0.2.217@o2ib5             up     0    0    8    0    64    64    40
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Peers with queuing from the routers:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[root@sh02-hn01 sthiell.root]# clush -w@rtr_oak -b &quot;cat /sys/kernel/debug/lnet/peers  | awk &apos;/^nid|o2ib/ &amp;amp;&amp;amp; \$NF!=0&apos;&quot; 
---------------
sh01-oak[01-02],sh03-oak01 (3)
---------------
nid                      refs state  last   max   rtr   min    tx   min queue
---------------
sh02-oak01
---------------
nid                      refs state  last   max   rtr   min    tx   min queue
10.0.2.106@o2ib5          316    up   176     8     7    -8  -306 -2561 2551117
---------------
sh02-oak02
---------------
nid                      refs state  last   max   rtr   min    tx   min queue
10.0.2.110@o2ib5          206    up    94     8     1    -8  -190 -3121 1520563
10.0.2.101@o2ib5          242    up    84     8     7    -8  -232 -2145 70352
---------------
sh03-oak02
---------------
nid                      refs state  last   max   rtr   min    tx   min queue
10.0.2.105@o2ib5         1163    up    66     8     8   -16 -1154 -1987 7682915
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;I gathered some short +net debug logs from Oak&apos;s OSS oak-io2-s1 10.0.2.105@o2ib5, where we can see a drop:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;00000400:00000100:37.0:1620667146.491444:0:54966:0:(lib-move.c:3930:lnet_parse_reply()) 10.0.2.105@o2ib5: Dropping REPLY from 12345-10.51.1.14@o2ib3 for invalid MD 0x1678074439ff2d4c.0x476212d0d
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Attaching as &lt;span class=&quot;nobr&quot;&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/attachment/38555/38555_oak-io2-s1.dknet9.gz&quot; title=&quot;oak-io2-s1.dknet9.gz attached to LU-14652&quot;&gt;oak-io2-s1.dknet9.gz&lt;sup&gt;&lt;img class=&quot;rendericon&quot; src=&quot;https://jira.whamcloud.com/images/icons/link_attachment_7.gif&quot; height=&quot;7&quot; width=&quot;7&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/sup&gt;&lt;/a&gt;&lt;/span&gt; and &lt;span class=&quot;nobr&quot;&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/attachment/38556/38556_oak-io2-s1.dknet10.gz&quot; title=&quot;oak-io2-s1.dknet10.gz attached to LU-14652&quot;&gt;oak-io2-s1.dknet10.gz&lt;sup&gt;&lt;img class=&quot;rendericon&quot; src=&quot;https://jira.whamcloud.com/images/icons/link_attachment_7.gif&quot; height=&quot;7&quot; width=&quot;7&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/sup&gt;&lt;/a&gt;&lt;/span&gt; (this one shows a Dropping REPLY)&lt;/p&gt;

&lt;p&gt;Attaching +net debugs from a client on Sherlock (10.50.14.15@o2ib2), one on o2ib2 I saw some queueing from the routers, but other than that, I picked it randomly and not sure it will be relevant ( &lt;span class=&quot;nobr&quot;&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/attachment/38557/38557_sh02-14n15.dknet1.gz&quot; title=&quot;sh02-14n15.dknet1.gz attached to LU-14652&quot;&gt;sh02-14n15.dknet1.gz&lt;sup&gt;&lt;img class=&quot;rendericon&quot; src=&quot;https://jira.whamcloud.com/images/icons/link_attachment_7.gif&quot; height=&quot;7&quot; width=&quot;7&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/sup&gt;&lt;/a&gt;&lt;/span&gt; &lt;span class=&quot;nobr&quot;&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/attachment/38558/38558_sh02-14n15.dknet2.gz&quot; title=&quot;sh02-14n15.dknet2.gz attached to LU-14652&quot;&gt;sh02-14n15.dknet2.gz&lt;sup&gt;&lt;img class=&quot;rendericon&quot; src=&quot;https://jira.whamcloud.com/images/icons/link_attachment_7.gif&quot; height=&quot;7&quot; width=&quot;7&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/sup&gt;&lt;/a&gt;&lt;/span&gt; &lt;span class=&quot;nobr&quot;&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/attachment/38559/38559_sh02-14n15.dknet3.gz&quot; title=&quot;sh02-14n15.dknet3.gz attached to LU-14652&quot;&gt;sh02-14n15.dknet3.gz&lt;sup&gt;&lt;img class=&quot;rendericon&quot; src=&quot;https://jira.whamcloud.com/images/icons/link_attachment_7.gif&quot; height=&quot;7&quot; width=&quot;7&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/sup&gt;&lt;/a&gt;&lt;/span&gt; )&lt;/p&gt;</comment>
                            <comment id="301064" author="sthiell" created="Mon, 10 May 2021 18:01:22 +0000"  >&lt;p&gt;Also attaching +net debug logs from client&#160;sh02-14n15 10.51.1.14@o2ib3, the one shown above on the OSS with Dropping REPLY. But I think it might have been too late when I gathered the logs.&#160;&lt;span class=&quot;nobr&quot;&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/attachment/38560/38560_sh03-01n14.dknet.gz&quot; title=&quot;sh03-01n14.dknet.gz attached to LU-14652&quot;&gt;sh03-01n14.dknet.gz&lt;sup&gt;&lt;img class=&quot;rendericon&quot; src=&quot;https://jira.whamcloud.com/images/icons/link_attachment_7.gif&quot; height=&quot;7&quot; width=&quot;7&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/sup&gt;&lt;/a&gt;&lt;/span&gt;&lt;/p&gt;</comment>
                            <comment id="301536" author="ssmirnov" created="Thu, 13 May 2021 20:40:00 +0000"  >&lt;p&gt;Stephane,&#160;&lt;/p&gt;

&lt;p&gt;I still can&apos;t see what&apos;s going wrong exctly, but for 2.12.6 I&apos;d consider disabling discovery on the routers to avoid extra complexity. For example, it would make sure that the messages sent by a node get to the specific nid of the remote node as decided by the sender, as opposed to the router making this decision when it has discovery enabled. It is more of a general recommendation that may be unrelated to the problem you&apos;re seeing though.&lt;/p&gt;

&lt;p&gt;If you choose to disable discovery on the routers, you can do so by going one at a time:&lt;/p&gt;
&lt;ul&gt;
	&lt;li&gt;Remove the corresponding route from nodes that are using it (lnetctl route del)&lt;/li&gt;
	&lt;li&gt;Disable discovery on the router (if done via the conf file, reload modules)&lt;/li&gt;
	&lt;li&gt;Delete the peer representing the router from the nodes that are using it (lnetctl peer del)&lt;/li&gt;
	&lt;li&gt;Add the routes back (lnetctl route add)&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;&#160;&lt;/p&gt;</comment>
                            <comment id="316158" author="sthiell" created="Thu, 21 Oct 2021 01:32:45 +0000"  >&lt;p&gt;Hi Serguei,&lt;/p&gt;

&lt;p&gt;We still have this problem with 2.12.7 routers and clients. The servers (Fir) are still running 2.12.5 though. Some routers started to get a high refs count on their NIs and on specific peers, which are generating random RDMA timeouts. If we reboot the routers, that usually doesn&apos;t change anything, they become unusable. In some very rare case,&#160; (like after requeuing a lot of jobs), we&apos;re able to put a router back into production (we were able to do that today), but otherwise, the refs count start to increase immediately.&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;Clients are on o2ib2, servers on o2ib7. We&apos;re pretty much confident the IB fabrics are OK. The problem occurs between routers and servers (o2ib7). Example of a router (at 10.0.10.226@o2ib7) &amp;lt;-&amp;gt; server (OSS) ( at 10.0.10.101@o2ib7) that is stuck:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[root@sh02-fir03 ~]# lnetctl peer show --nid 10.0.10.101@o2ib7 -v4
peer:
    - primary nid: 10.0.10.101@o2ib7
      Multi-Rail: False
      peer ni:
        - nid: 10.0.10.101@o2ib7
          state: up
          max_ni_tx_credits: 8
          available_tx_credits: 8
          min_tx_credits: -3
          tx_q_num_of_buf: 0
          available_rtr_credits: -44          &amp;lt;&amp;lt;&amp;lt;
          min_rtr_credits: -44
          refcount: 53
          statistics:
              send_count: 43532
              recv_count: 2843
              drop_count: 226
[root@sh02-fir03 ~]# lctl ping 10.0.10.101@o2ib7
failed to ping 10.0.10.101@o2ib7: Input/output error
[root@sh02-fir03 ~]# lctl ping 10.0.10.101@o2ib7
failed to ping 10.0.10.101@o2ib7: Input/output error
[root@sh02-fir03 ~]# lctl ping 10.0.10.101@o2ib7
failed to ping 10.0.10.101@o2ib7: Input/output error
[root@sh02-fir03 ~]# lctl ping 10.0.10.101@o2ib7
failed to ping 10.0.10.101@o2ib7: Input/output error
[root@sh02-fir03 ~]# lctl set_param debug=+ALL; lctl clear; lctl ping 10.0.10.101@o2ib7;  lctl dk /tmp/debug.log; lctl set_param debug=-ALL
debug=+ALL
failed to ping 10.0.10.101@o2ib7: Input/output error
Debug log: 193037 lines, 193037 kept, 0 dropped, 0 bad.
debug=-ALL
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Debug logs from the router (sh02-fir03 at 10.0.10.226@o2ib7 ) attached as &lt;span class=&quot;nobr&quot;&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/attachment/41058/41058_sh02-fir03-debug-lctl_ping_sync.log.gz&quot; title=&quot;sh02-fir03-debug-lctl_ping_sync.log.gz attached to LU-14652&quot;&gt;sh02-fir03-debug-lctl_ping_sync.log.gz&lt;sup&gt;&lt;img class=&quot;rendericon&quot; src=&quot;https://jira.whamcloud.com/images/icons/link_attachment_7.gif&quot; height=&quot;7&quot; width=&quot;7&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/sup&gt;&lt;/a&gt;&lt;/span&gt; &lt;br/&gt;
 Debug logs from the OSS (fir-io1-s1 at 10.0.10.101@o2ib7) at the same time attached as &lt;span class=&quot;nobr&quot;&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/attachment/41059/41059_fir-io1-s1_lctl_ping_sync.log.gz&quot; title=&quot;fir-io1-s1_lctl_ping_sync.log.gz attached to LU-14652&quot;&gt;fir-io1-s1_lctl_ping_sync.log.gz&lt;sup&gt;&lt;img class=&quot;rendericon&quot; src=&quot;https://jira.whamcloud.com/images/icons/link_attachment_7.gif&quot; height=&quot;7&quot; width=&quot;7&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/sup&gt;&lt;/a&gt;&lt;/span&gt; &lt;br/&gt;
 &#160;&lt;/p&gt;

&lt;p&gt;So the router sh02-fir03 (10.0.10.226@o2ib7) is running out of rtr credits, and from the OSS point of view (fir-io1-s1 at 10.0.10.101@o2ib7), it&apos;s out of tx credits when this happens, as shown below:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[root@fir-io1-s1 ~]# lnetctl peer show --nid 10.0.10.226@o2ib7 -v 4
peer:
    - primary nid: 10.0.10.226@o2ib7
      Multi-Rail: False
      peer state: 0
      peer ni:
        - nid: 10.0.10.226@o2ib7
          state: up
          max_ni_tx_credits: 8
          available_tx_credits: -2              &amp;lt;&amp;lt;&amp;lt;
          min_tx_credits: -154
          tx_q_num_of_buf: 7230544
          available_rtr_credits: 8
          min_rtr_credits: 8
          refcount: 14
          statistics:
              send_count: 427582
              recv_count: 2109316
              drop_count: 0
          sent_stats:
              put: 389065
              get: 38517
              reply: 0
              ack: 0
              hello: 0
          received_stats:
              put: 1514551
              get: 6
              reply: 122123
              ack: 472636
              hello: 0
          dropped_stats:
              put: 0
              get: 0
              reply: 0
              ack: 0
              hello: 0
          health stats:
              health value: 1000
              dropped: 12
              timeout: 298
              error: 0
              network timeout: 0
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;We also noticed that the NIs refs are increasing on the routers when the following messages show up on some OSS&apos;s (not all...):&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Oct 20 13:34:40 fir-io1-s1 kernel: LustreError: 58590:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff9ae220efb800
Oct 20 13:34:40 fir-io1-s1 kernel: LustreError: 58590:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff9a810b5e6000
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;From the OSS, the lnet pinger shows this:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;00000400:00000200:20.0:1634761462.846900:0:58577:0:(router.c:1099:lnet_ping_router_locked()) rtr 10.0.10.224@o2ib7 60: deadline 0 ping_notsent 0 alive 1 alive_count 1 lpni_ping_timestamp 1748161
00000400:00000200:20.0:1634761462.846903:0:58577:0:(router.c:1099:lnet_ping_router_locked()) rtr 10.0.10.225@o2ib7 60: deadline 1748227 ping_notsent 1 alive 1 alive_count 351 lpni_ping_timestamp 1748167
00000400:00000200:20.0:1634761462.846905:0:58577:0:(router.c:1099:lnet_ping_router_locked()) rtr 10.0.10.226@o2ib7 60: deadline 1748200 ping_notsent 1 alive 1 alive_count 384 lpni_ping_timestamp 1748140
00000400:00000200:20.0:1634761462.846908:0:58577:0:(router.c:1099:lnet_ping_router_locked()) rtr 10.0.10.227@o2ib7 60: deadline 0 ping_notsent 0 alive 1 alive_count 85 lpni_ping_timestamp 1748141
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Our problems are with routers 10.0.10.225@o2ib7 and 10.0.10.226@o2ib7 at the moment. Rebooting them doesn&apos;t fix the problem.&lt;/p&gt;

&lt;p&gt;Our OSS have this configured:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;options lnet live_router_check_interval=60
options lnet dead_router_check_interval=300
options lnet router_ping_timeout=60
options lnet avoid_asym_router_failure=1
options lnet check_routers_before_use=1
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Clients have the default settings (so the difference is check_routers_before_use=0).&lt;/p&gt;

&lt;p&gt;Also attaching the peer list of this problematic router (sh02-fir03 at 10.0.10.226@o2ib7) as &lt;span class=&quot;nobr&quot;&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/attachment/41060/41060_sh02-fir03.20211020.peers.txt&quot; title=&quot;sh02-fir03.20211020.peers.txt attached to LU-14652&quot;&gt;sh02-fir03.20211020.peers.txt&lt;sup&gt;&lt;img class=&quot;rendericon&quot; src=&quot;https://jira.whamcloud.com/images/icons/link_attachment_7.gif&quot; height=&quot;7&quot; width=&quot;7&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/sup&gt;&lt;/a&gt;&lt;/span&gt; , where we can see that OSS 10.0.10.101@o2ib7 and 10.0.10.102@o2ib7 are out of rtr credits.&lt;/p&gt;

&lt;p&gt;Do you know if there a way to identify in the logs what is using these high NI/peers refs count?&lt;/p&gt;</comment>
                            <comment id="316388" author="ssmirnov" created="Fri, 22 Oct 2021 20:19:55 +0000"  >&lt;p&gt;Hi Stephane,&lt;/p&gt;

&lt;p&gt;My understanding is that the ref counts go high because the messages on the router queue don&apos;t get cleared, same reason as for the negative &quot;available credits&quot;. The messages can get stuck on the queue if for some reason the OSS is not responding. After some timeout (50 seconds?) this causes transactions to start expiring. I&apos;ll take a closer look at the OSS logs for clues why that may be. We may need to retrieve &quot;ldlm&quot; and &quot;ptlrpc&quot; logs in addition to &quot;net&quot;. In the meantime, is there any information about CPU usage on the OSS? Are there any signs of lock-up? Perhaps we can get the output of &quot;echo l &amp;gt; /proc/sysrq-trigger&quot; on the server and the router to check for lock-ups?&lt;/p&gt;

&lt;p&gt;Thanks,&lt;/p&gt;

&lt;p&gt;Serguei.&lt;/p&gt;</comment>
                            <comment id="316613" author="sthiell" created="Tue, 26 Oct 2021 22:49:44 +0000"  >&lt;p&gt;Hi Serguei,&lt;/p&gt;

&lt;p&gt;Thanks for this. I checked the OSS for lock-ups but found nothing.&lt;/p&gt;

&lt;p&gt;Kernel logs of OSS &lt;tt&gt;fir-io1-s2&lt;/tt&gt; (10.0.10.102@o2ib7) when we started to put two &quot;bad&quot; routers (10.0.10.225@o2ib7 and 10.0.10.226@o2ib7) back online and problems started again: &lt;span class=&quot;nobr&quot;&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/attachment/41115/41115_fir-io1-s2_kernel_sysrq-l_20211026.log.gz&quot; title=&quot;fir-io1-s2_kernel_sysrq-l_20211026.log.gz attached to LU-14652&quot;&gt;fir-io1-s2_kernel_sysrq-l_20211026.log.gz&lt;sup&gt;&lt;img class=&quot;rendericon&quot; src=&quot;https://jira.whamcloud.com/images/icons/link_attachment_7.gif&quot; height=&quot;7&quot; width=&quot;7&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/sup&gt;&lt;/a&gt;&lt;/span&gt; (including output of &lt;tt&gt;echo l &amp;gt; /proc/sysrq-trigger&lt;/tt&gt; several times)&lt;/p&gt;

&lt;p&gt;Lustre logs on this OSS with dlmtrace and net: &lt;span class=&quot;nobr&quot;&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/attachment/41116/41116_fir-io1-s2_dk_net_dlmtrace_20211026.log.gz&quot; title=&quot;fir-io1-s2_dk_net_dlmtrace_20211026.log.gz attached to LU-14652&quot;&gt;fir-io1-s2_dk_net_dlmtrace_20211026.log.gz&lt;sup&gt;&lt;img class=&quot;rendericon&quot; src=&quot;https://jira.whamcloud.com/images/icons/link_attachment_7.gif&quot; height=&quot;7&quot; width=&quot;7&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/sup&gt;&lt;/a&gt;&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;Every time we put these two routers back online, the problem shows up with the same two OSS (10.0.10.101@o2ib7 and 0.0.10.102@o2ib7). These OSS are fine with the other routers...&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;# clush -u5 -Lw sh02-fir[01-04]  -b cat /sys/kernel/debug/lnet/peers | awk &apos;$3 &amp;gt; 10 {print }&apos;
sh02-fir[01-04]: nid                      refs state  last   max   rtr   min    tx   min queue
sh02-fir02: 10.0.10.102@o2ib7         125    up   179     8  -116  -116     8   -15 0
sh02-fir03: 10.0.10.102@o2ib7         115    up    33     8  -106  -106     8   -12 0
sh02-fir03: 10.0.10.101@o2ib7          75    up    32     8   -66   -66     8    -1 0
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="316618" author="sthiell" created="Tue, 26 Oct 2021 23:11:52 +0000"  >&lt;p&gt;Adding&#160;&lt;span class=&quot;nobr&quot;&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/attachment/41118/41118_sh02-fir03-kern_sysrq-l-t_20211026.log.gz&quot; title=&quot;sh02-fir03-kern_sysrq-l-t_20211026.log.gz attached to LU-14652&quot;&gt;sh02-fir03-kern_sysrq-l-t_20211026.log.gz&lt;sup&gt;&lt;img class=&quot;rendericon&quot; src=&quot;https://jira.whamcloud.com/images/icons/link_attachment_7.gif&quot; height=&quot;7&quot; width=&quot;7&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/sup&gt;&lt;/a&gt;&lt;/span&gt;with kernel logs of a router (10.0.10.226@o2ib7) including SysRq l and t dumps, just after the router has been powered on and when the problem happens.&lt;/p&gt;</comment>
                            <comment id="316715" author="eaujames" created="Wed, 27 Oct 2021 16:59:17 +0000"  >&lt;p&gt;Hello,&lt;br/&gt;
Sorry to interfere here. We observe the same type of issue under network load at the CEA.&lt;br/&gt;
Could it  be related to &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-15068&quot; title=&quot;Race between commit callback and reply_out_callback::LNET_EVENT_SEND&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-15068&quot;&gt;&lt;del&gt;LU-15068&lt;/del&gt;&lt;/a&gt; (&quot;Race between commit callback and reply_out_callback::LNET_EVENT_SEND&quot;)?&lt;/p&gt;</comment>
                            <comment id="316722" author="sthiell" created="Wed, 27 Oct 2021 17:22:49 +0000"  >&lt;p&gt;Hi Etienne,&lt;/p&gt;

&lt;p&gt;Please interfere! &lt;img class=&quot;emoticon&quot; src=&quot;https://jira.whamcloud.com/images/icons/emoticons/smile.png&quot; height=&quot;16&quot; width=&quot;16&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&#160;&lt;/p&gt;

&lt;p&gt;It&apos;s interesting, and could perhaps match our problem. I also noticed occasional messages like this one on the OSS (here, 10.0.10.225@o2ib7 is one of the bad router):&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;00000400:00000100:24.0:1635289738.440616:0:56886:0:(lib-move.c:976:lnet_post_send_locked()) Aborting message for 12345-10.0.10.225@o2ib7: LNetM[DE]Unlink() already called on the MD/ME.
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;I&apos;m still wondering why this issue could survive OSS and/or router reboot.&lt;/p&gt;

&lt;p&gt;Have you tried a patch for b2_12?&lt;/p&gt;

&lt;p&gt;Thanks!&lt;/p&gt;</comment>
                            <comment id="317048" author="sthiell" created="Fri, 29 Oct 2021 20:22:11 +0000"  >&lt;p&gt;Hi Etienne and Serguei,&lt;/p&gt;

&lt;p&gt;I added the patch from &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-15068&quot; title=&quot;Race between commit callback and reply_out_callback::LNET_EVENT_SEND&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-15068&quot;&gt;&lt;del&gt;LU-15068&lt;/del&gt;&lt;/a&gt; on top of 2.12.7 and we tested it on one OSS and one problematic Lnet router, and unfortunately, that patch doesn&apos;t fix this problem. The patch seems to work ok otherwise (no regression noticed).&lt;/p&gt;

&lt;p&gt;In this instance,  on the router &lt;tt&gt;10.0.10.225@o2ib7&lt;/tt&gt;, the rtr credits for OSS &lt;tt&gt;10.0.10.102@o2ib7&lt;/tt&gt; were quickly dropping way below 0 immediately after putting the router online:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[root@sh02-fir02 ~]# lnetctl peer show --nid 10.0.10.102@o2ib7 -v
peer:
    - primary nid: 10.0.10.102@o2ib7
      Multi-Rail: False
      peer ni:
        - nid: 10.0.10.102@o2ib7
          state: up
          max_ni_tx_credits: 8
          available_tx_credits: 8
          min_tx_credits: -1
          tx_q_num_of_buf: 0
          available_rtr_credits: -176
          min_rtr_credits: -176
          refcount: 185
          statistics:
              send_count: 18606
              recv_count: 2924
              drop_count: 0
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="317051" author="ssmirnov" created="Fri, 29 Oct 2021 21:05:38 +0000"  >&lt;p&gt;Hi Stephane,&lt;/p&gt;

&lt;p&gt;Just to clarify, was the&#160;patch from &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-15068&quot; title=&quot;Race between commit callback and reply_out_callback::LNET_EVENT_SEND&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-15068&quot;&gt;&lt;del&gt;LU-15068&lt;/del&gt;&lt;/a&gt;&#160;applied on the&#160;10.0.10.102@o2ib7 in the example above?&lt;/p&gt;

&lt;p&gt;Thanks,&lt;/p&gt;

&lt;p&gt;Serguei.&lt;/p&gt;</comment>
                            <comment id="317057" author="sthiell" created="Fri, 29 Oct 2021 21:24:48 +0000"  >&lt;p&gt;Hi Seirguei,&lt;/p&gt;

&lt;p&gt;Yes, we actually started by applying the patch on this OSS (at 10.0.10.102@o2ib7), tested that without success, then we also applied the patch on the router, but hit the same issue, unfortunately.&lt;/p&gt;</comment>
                            <comment id="317784" author="ssmirnov" created="Tue, 9 Nov 2021 23:04:43 +0000"  >&lt;p&gt;Amir and I went over the logs and tried to track down the issue. I don&apos;t have a conclusive answer, but here&apos;s roughly what appears to be happening, based on the logs provided for the lnetctl ping above:&lt;/p&gt;

&lt;p&gt;1) There&apos;s a client that stops being responsive. Lnd-level &quot;no credits&quot; messages in the router debug log point to that.&#160;&lt;/p&gt;

&lt;p&gt;2) If the client is unresponsive, the messages going to it from the server (via router) are not getting finalized. It causes the router credits count for the server to go negative and get stuck at that.&#160;&lt;/p&gt;

&lt;p&gt;3) The message queue on the server backs up. There are lnd-level &quot;no credits&quot; from ko2iblnd in the server debug log. Normally when the high-watermark level is hit, the node sends a special &quot;NOOP&quot; request bypassing the queue, requesting to release the credits from the other side (the router)&lt;/p&gt;

&lt;p&gt;4) If there&apos;s no reaction to the &quot;noop&quot;,&#160; lnd credits are not released and all of them are used up. The server is not able to respond to&#160; the lnetctl ping or send any message via the router using this connection.&lt;/p&gt;

&lt;p&gt;There are possible causes:&lt;/p&gt;
&lt;ul&gt;
	&lt;li&gt;There&apos;s bug with sending the &quot;noop&quot; (server or router)&lt;/li&gt;
	&lt;li&gt;There&apos;s bug with receiving the &quot;noop&quot; (router or client)&lt;/li&gt;
	&lt;li&gt;Nodes are running out of credits legitimately. There&apos;s too much to do, they can&apos;t keep up.&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;There are some additional questions/ideas:&lt;/p&gt;

&lt;p&gt;At the time when the router debug log was taken, there are actually messages for multiple clients with &quot;no credits&quot; issue.&#160; How is the routing configured on the clients? Are all clients able to use all routers or just a subset?&lt;/p&gt;

&lt;p&gt;Have you tried increasing the number of credits (go up to peer_credits = 32 peer_credits_hiw = 16 concurrent_sends = 64) ?&#160;&lt;/p&gt;

&lt;p&gt;Thanks,&lt;/p&gt;

&lt;p&gt;Serguei.&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;</comment>
                            <comment id="317790" author="sthiell" created="Tue, 9 Nov 2021 23:40:15 +0000"  >&lt;p&gt;Hi Serguei,&lt;/p&gt;

&lt;p&gt;Thanks much for spending time to look at this in more details with Amir. We really appreciate it.&lt;/p&gt;

&lt;p&gt;I&apos;m attaching an image to give you an overview LNet architecture of /scratch on Sherlock:&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;image-wrap&quot; style=&quot;&quot;&gt;&lt;a id=&quot;41320_thumb&quot; href=&quot;https://jira.whamcloud.com/secure/attachment/41320/41320_image-2021-11-09-15-37-32-277.png&quot; title=&quot;image-2021-11-09-15-37-32-277.png&quot; file-preview-type=&quot;image&quot; file-preview-id=&quot;41320&quot; file-preview-title=&quot;image-2021-11-09-15-37-32-277.png&quot;&gt;&lt;img src=&quot;https://jira.whamcloud.com/secure/thumbnail/41320/_thumb_41320.png&quot; style=&quot;border: 0px solid black&quot; role=&quot;presentation&quot;/&gt;&lt;/a&gt;&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;Note: In orange, those are the servers/routers with the repeating credit issues.&lt;/p&gt;

&lt;p&gt;There is nothing too fancy here I believe. We have clients on 3 generations of IB fabric (o2ib1, o2ib2 and o2ib3), each connected to Fir (the servers) via their own LNet routers (so o2ib1&amp;lt;&amp;gt;o2ib7, o2ib2&amp;lt;&amp;gt;o2ib7 and o2ib3&amp;lt;&amp;gt;o2ib7). On a single cluster fabric, all compute nodes have the same routing configuration, so yes all clients are using the same routers per fabric. This is a an example from a compute node in o2ib2:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[root@sh02-10n19 ~]# lnetctl route show
route:
&#160; &#160; - net: o2ib5
&#160; &#160; &#160; gateway: 10.50.0.132@o2ib2
&#160; &#160; - net: o2ib5
&#160; &#160; &#160; gateway: 10.50.0.131@o2ib2
&#160; &#160; - net: o2ib7
&#160; &#160; &#160; gateway: 10.50.0.112@o2ib2
&#160; &#160; - net: o2ib7
&#160; &#160; &#160; gateway: 10.50.0.111@o2ib2
&#160; &#160; - net: o2ib7
&#160; &#160; &#160; gateway: 10.50.0.113@o2ib2
&#160; &#160; - net: o2ib7
&#160; &#160; &#160; gateway: 10.50.0.114@o2ib2 &lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;(you can ignore o2ib5 here, which are routers to another storage system)&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;To answer your second question, no we haven&apos;t tried to increase the number of credits. We have been using the default settings. Changing that would mean a full cluster down time and it&apos;s ... complicated. &lt;img class=&quot;emoticon&quot; src=&quot;https://jira.whamcloud.com/images/icons/emoticons/smile.png&quot; height=&quot;16&quot; width=&quot;16&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/p&gt;</comment>
                            <comment id="318283" author="ofaaland" created="Mon, 15 Nov 2021 18:59:07 +0000"  >&lt;p&gt;Hi Serguei,&lt;/p&gt;

&lt;p&gt;You wrote:&lt;/p&gt;
&lt;blockquote&gt;&lt;p&gt;The message queue on the server backs up. There are lnd-level &quot;no credits&quot; from ko2iblnd in the server debug log. Normally when the high-watermark level is hit, the node sends a special &quot;NOOP&quot; request bypassing the queue, requesting to release the credits from the other side (the router)&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;Assuming the router or client receives the &quot;NOOP&quot; request, does it release those credits by dropping the messages that were sent but not yet acknowledged (ie freeing the buffers holding those messages, releasing the credits, emitting the LNetError message)?&lt;/p&gt;

&lt;p&gt;thanks&lt;/p&gt;</comment>
                            <comment id="318356" author="hornc" created="Tue, 16 Nov 2021 18:07:09 +0000"  >&lt;blockquote&gt;&lt;p&gt;4) If there&apos;s no reaction to the &quot;noop&quot;,  lnd credits are not released and all of them are used up. The server is not able to respond to  the lnetctl ping or send any message via the router using this connection.&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;I don&apos;t understand why the LND doesn&apos;t eventually just timeout these transactions when it can&apos;t acquire the necessary credits. That would then return the tx credits when those messages are finalized.&lt;/p&gt;</comment>
                            <comment id="318363" author="ssmirnov" created="Tue, 16 Nov 2021 22:21:33 +0000"  >&lt;p&gt;Chris,&lt;/p&gt;

&lt;p&gt;I do believe that If noop msg doesn&apos;t come so that there are no more credits available to send, eventually there&apos;s going to be a timeout for the transaction. The problem may be that during this time the server was not able to talk via the given router to any other node, causing all kinds of backups.&lt;/p&gt;

&lt;p&gt;Olaf,&lt;/p&gt;

&lt;p&gt;Looks like I got it a bit backwards it my summary. In fact, if the server is sending, it is the receiving side (router), once it figures it accumulated &quot;high-water-mark&quot; or more credits on the receiving end, that is expected to send the &quot;noop&quot; to the server, which, when received by the server, releases the credits to be used again for sending.&lt;/p&gt;

&lt;p&gt;Thanks,&lt;/p&gt;

&lt;p&gt;Serguei&#160;&lt;/p&gt;</comment>
                            <comment id="319589" author="sthiell" created="Tue, 30 Nov 2021 16:44:40 +0000"  >&lt;p&gt;It looks like the race that Chris Horn found in &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-15234&quot; title=&quot;LNet high peer reference counts inconsistent with queue&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-15234&quot;&gt;&lt;del&gt;LU-15234&lt;/del&gt;&lt;/a&gt;&#160;could be the cause of our occasional stuck routers and lnet high ref count problems. We look forward to testing the patch when y&apos;all think it&apos;s ready on top of 2.12.7. &lt;img class=&quot;emoticon&quot; src=&quot;https://jira.whamcloud.com/images/icons/emoticons/smile.png&quot; height=&quot;16&quot; width=&quot;16&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/p&gt;</comment>
                            <comment id="324175" author="sthiell" created="Thu, 27 Jan 2022 19:38:56 +0000"  >&lt;p&gt;Hello! It looks like the patch I mentioned above has been abandoned as it doesn&apos;t resolve this issue. I believe it is now our most impactful problem with 2.12 as it seems to start randomly. For example, this morning, we noticed lots of errors of the type on an OSS:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Jan 27 11:33:36 fir-io1-s2 kernel: LustreError: 31477:0:(events.c:455:server_bulk_callback()) event type 5, status -103, desc ffffa1313e562400
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;(that&apos;s 10.0.10.102@o2ib7)&lt;/p&gt;

&lt;p&gt;And checking the ref count on the routers, we saw a very high refcnt on the router (sh02-fir02) for this OSS (10.0.10.102@o2ib7):&lt;br/&gt;
&#160;&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[root@sh02-hn01 sthiell.root]# clush -u5 -Lw @rtr_fir  -b cat /sys/kernel/debug/lnet/peers | awk &apos;$3 &amp;gt; 20 {print }&apos;
sh03-fir[01-08],sh02-fir[01-04]: nid                      refs state  last   max   rtr   min    tx   min queue
sh02-fir02: 10.0.10.102@o2ib7        8829    up   110     8 -8820 -8820     8   -87 0
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Rebooting the router didn&apos;t help, its refcnt for 10.0.10.102@o2ib7 is increasing again at this time. Sometimes, rebooting the &quot;bad&quot; router does help though, but unfortunately not always.&lt;br/&gt;
Any update on this issue? Thanks!&lt;br/&gt;
&#160;&lt;/p&gt;</comment>
                            <comment id="324636" author="ssmirnov" created="Mon, 31 Jan 2022 20:16:08 +0000"  >&lt;p&gt;Hi Stephane,&lt;/p&gt;

&lt;p&gt;There was a couple more patches added for &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-15234&quot; title=&quot;LNet high peer reference counts inconsistent with queue&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-15234&quot;&gt;&lt;del&gt;LU-15234&lt;/del&gt;&lt;/a&gt;: one fixes potential cause of the reference leak, another adds debug information in order to narrow down where the leak may be. The results should be available later this week.&lt;/p&gt;

&lt;p&gt;Thanks,&lt;/p&gt;

&lt;p&gt;Serguei.&lt;/p&gt;</comment>
                            <comment id="324663" author="sthiell" created="Tue, 1 Feb 2022 02:06:54 +0000"  >&lt;p&gt;Hi Serguei,&lt;/p&gt;

&lt;p&gt;OK &#8211; Thank you!&lt;/p&gt;</comment>
                    </comments>
                    <attachments>
                            <attachment id="41059" name="fir-io1-s1_lctl_ping_sync.log.gz" size="13867491" author="sthiell" created="Thu, 21 Oct 2021 01:19:43 +0000"/>
                            <attachment id="41116" name="fir-io1-s2_dk_net_dlmtrace_20211026.log.gz" size="8849758" author="sthiell" created="Tue, 26 Oct 2021 22:48:28 +0000"/>
                            <attachment id="41115" name="fir-io1-s2_kernel_sysrq-l_20211026.log.gz" size="15531" author="sthiell" created="Tue, 26 Oct 2021 22:47:30 +0000"/>
                            <attachment id="41320" name="image-2021-11-09-15-37-32-277.png" size="379041" author="sthiell" created="Tue, 9 Nov 2021 23:37:33 +0000"/>
                            <attachment id="38470" name="net_show_client_2.12.txt" size="2395" author="sthiell" created="Fri, 30 Apr 2021 23:40:18 +0000"/>
                            <attachment id="38469" name="net_show_client_2.13.txt" size="2457" author="sthiell" created="Fri, 30 Apr 2021 23:37:19 +0000"/>
                            <attachment id="38468" name="net_show_routers.txt" size="31327" author="sthiell" created="Fri, 30 Apr 2021 23:36:18 +0000"/>
                            <attachment id="38467" name="net_show_servers.txt" size="49713" author="sthiell" created="Fri, 30 Apr 2021 23:35:33 +0000"/>
                            <attachment id="38556" name="oak-io2-s1.dknet10.gz" size="5574004" author="sthiell" created="Mon, 10 May 2021 17:46:27 +0000"/>
                            <attachment id="38555" name="oak-io2-s1.dknet9.gz" size="4328863" author="sthiell" created="Mon, 10 May 2021 17:46:15 +0000"/>
                            <attachment id="38554" name="oak-io2-s1.kern.log" size="114504" author="sthiell" created="Mon, 10 May 2021 17:43:56 +0000"/>
                            <attachment id="38557" name="sh02-14n15.dknet1.gz" size="809062" author="sthiell" created="Mon, 10 May 2021 17:49:16 +0000"/>
                            <attachment id="38558" name="sh02-14n15.dknet2.gz" size="357957" author="sthiell" created="Mon, 10 May 2021 17:49:28 +0000"/>
                            <attachment id="38559" name="sh02-14n15.dknet3.gz" size="562564" author="sthiell" created="Mon, 10 May 2021 17:49:32 +0000"/>
                            <attachment id="41058" name="sh02-fir03-debug-lctl_ping_sync.log.gz" size="1699960" author="sthiell" created="Thu, 21 Oct 2021 01:13:53 +0000"/>
                            <attachment id="41118" name="sh02-fir03-kern_sysrq-l-t_20211026.log.gz" size="52477" author="sthiell" created="Tue, 26 Oct 2021 23:09:40 +0000"/>
                            <attachment id="41060" name="sh02-fir03.20211020.peers.txt" size="64310" author="sthiell" created="Thu, 21 Oct 2021 01:29:01 +0000"/>
                            <attachment id="38471" name="sh02-fir04-dknet.log.gz" size="6162933" author="sthiell" created="Sat, 1 May 2021 06:18:17 +0000"/>
                            <attachment id="38560" name="sh03-01n14.dknet.gz" size="3135384" author="sthiell" created="Mon, 10 May 2021 18:00:59 +0000"/>
                            <attachment id="38453" name="sh03-fir06-20210428.zip" size="10722221" author="sthiell" created="Thu, 29 Apr 2021 16:28:02 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|i01tdj:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>