<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 03:16:35 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-15234] LNet high peer reference counts inconsistent with queue</title>
                <link>https://jira.whamcloud.com/browse/LU-15234</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;I believe that peer reference counts may not be decremented in some LNet error path, or that the size of the queue is not accurately reported by &quot;lctl get_param peers&quot;.&lt;/p&gt;

&lt;p&gt;The reference counts reported as &quot;refs&quot; by &quot;lctl get_param peers&quot; are increasing linearly with time. This is in contrast with &quot;queue&quot; which periodically spikes but then drops to 0 again.&#160; Below shows 4 routers on ruby which have refs &amp;gt; 46,000 for a route to 72.19.2.24@o2ib100 even though the reported queue is 0.&#160; This is just a little over 6 days since the ruby routers were rebooted during an update.&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[root@ruby1009:~]# pdsh -v -g router lctl get_param peers 2&amp;gt;/dev/null | awk &apos;$3 &amp;gt; 20 {print}&apos; | sed &apos;s/^.*://&apos; | sort -V -u
 172.19.2.24@o2ib100      46957    up     5     8 -46945 -46945     8   -13 0
 172.19.2.24@o2ib100      47380    up     1     8 -47368 -47368     8   -23 0
 172.19.2.24@o2ib100      48449    up    15     8 -48437 -48437     8   -17 0
 172.19.2.24@o2ib100      49999    up     3     8 -49987 -49987     8    -7 0
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;The ruby routers&#160; have an intermittent LNet communication problem (the fabric itself seems fine according to several tests, so the underlying issue is still under investigation).&lt;/p&gt;</description>
                <environment>lustre-2.12.7_2.llnl-2.ch6.x86_64&lt;br/&gt;
3.10.0-1160.45.1.1chaos.ch6.x86_64</environment>
        <key id="67186">LU-15234</key>
            <summary>LNet high peer reference counts inconsistent with queue</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="4" iconUrl="https://jira.whamcloud.com/images/icons/priorities/minor.svg">Minor</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="ssmirnov">Serguei Smirnov</assignee>
                                    <reporter username="ofaaland">Olaf Faaland</reporter>
                        <labels>
                            <label>llnl</label>
                    </labels>
                <created>Tue, 16 Nov 2021 00:05:34 +0000</created>
                <updated>Sat, 17 Dec 2022 02:28:08 +0000</updated>
                            <resolved>Tue, 25 Oct 2022 19:09:53 +0000</resolved>
                                                    <fixVersion>Lustre 2.16.0</fixVersion>
                                        <due></due>
                            <votes>0</votes>
                                    <watches>10</watches>
                                                                            <comments>
                            <comment id="318303" author="ofaaland" created="Tue, 16 Nov 2021 00:07:31 +0000"  >&lt;p&gt;My local issue is TOSS5305&lt;/p&gt;</comment>
                            <comment id="318309" author="pjones" created="Tue, 16 Nov 2021 04:21:16 +0000"  >&lt;p&gt;Serguei&lt;/p&gt;

&lt;p&gt;Could you please advise?&lt;/p&gt;

&lt;p&gt;Thanks&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="318362" author="ssmirnov" created="Tue, 16 Nov 2021 22:00:50 +0000"  >&lt;p&gt;Hi Olaf,&lt;/p&gt;

&lt;p&gt;From the graph it looks like the refcount is growing slowly and constantly. If it grows fast enough such that it is likely to increment within a reasonably short window (short enough so the debug log is not overwritten), could you please capture the net debug log for the window:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
lctl set_param debug=+net
lctl dk clear
---- wait &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; the incrememnt ----
lctl dk &amp;gt; log.txt
lctl set_param debug=-net&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;If the increment window is too long so it is not practical to capture debug log, please provide the syslog instead.&#160;&lt;/p&gt;

&lt;p&gt;Before and after the debug window, please capture:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
lnetctl stats show
lnetctl peer show -v 4 --nid &amp;lt;peer nid that leaks refcount&amp;gt; &lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;I agree that there may be a problem with not decrementing the refcount on some error path. Hopefully the debug data can help narrow down which path it is.&lt;/p&gt;

&lt;p&gt;Thanks,&lt;/p&gt;

&lt;p&gt;Serguei.&lt;/p&gt;</comment>
                            <comment id="318364" author="ofaaland" created="Tue, 16 Nov 2021 23:14:28 +0000"  >&lt;p&gt;Hi Serguei, I&apos;ve attached debug logs and the peer and stats before-and-after output.&lt;br/&gt;
thanks&lt;br/&gt;
Olaf&lt;/p&gt;</comment>
                            <comment id="318560" author="ofaaland" created="Thu, 18 Nov 2021 21:37:56 +0000"  >&lt;p&gt;Hi Serguei,&lt;br/&gt;
Do you have any news, or need any additional information?&lt;br/&gt;
thanks,&lt;br/&gt;
Olaf&lt;/p&gt;</comment>
                            <comment id="318576" author="ssmirnov" created="Thu, 18 Nov 2021 23:34:04 +0000"  >&lt;p&gt;Hi Olaf,&lt;/p&gt;

&lt;p&gt;I haven&apos;t had a chance yet to properly process what you have provided. I should be able to give you an update tomorrow.&lt;/p&gt;

&lt;p&gt;Thanks,&lt;/p&gt;

&lt;p&gt;Serguei.&lt;/p&gt;</comment>
                            <comment id="318717" author="ssmirnov" created="Sat, 20 Nov 2021 00:41:39 +0000"  >&lt;p&gt;refcount appears to be going up at the same rate as rtr_credits are going down (64 between the two &quot;peer show&quot; snapshots. Peer status changed to &quot;down&quot; as we likely didn&apos;t receive a router check ping. Nothing is received from 172.19.2.24@o2ib100 on the lnet level, but it appears that the peer is receiving at least some messages and is returning credits on the lnd level.&lt;/p&gt;

&lt;p&gt;Could you please list&#160;&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
dead_router_check_interval
live_router_check_interval
router_ping_timeout
peer_timeout&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;from both nodes?&lt;/p&gt;

&lt;p&gt;Is 172.19.2.24@o2ib100 another router?&lt;/p&gt;

&lt;p&gt;Thanks,&lt;/p&gt;

&lt;p&gt;Serguei.&lt;/p&gt;</comment>
                            <comment id="318719" author="ofaaland" created="Sat, 20 Nov 2021 01:39:05 +0000"  >&lt;p&gt;Hi Serguei,&lt;/p&gt;

&lt;p&gt;Yes, 172.19.2.24 is orelic4, one of the IB-to-TCP &quot;RELIC&quot; routers that is still at Lustre 2.10.&lt;/p&gt;

&lt;p&gt;ruby1016:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;/sys/module/lnet/parameters/dead_router_check_interval:60
/sys/module/lnet/parameters/live_router_check_interval:60
/sys/module/lnet/parameters/router_ping_timeout:50
/sys/module/ko2iblnd/parameters/peer_timeout:180
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;orelic4:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;/sys/module/lnet/parameters/dead_router_check_interval:60
/sys/module/lnet/parameters/live_router_check_interval:60
/sys/module/lnet/parameters/router_ping_timeout:50
/sys/module/ko2iblnd/parameters/peer_timeout:180
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;thanks,&lt;br/&gt;
Olaf&lt;/p&gt;</comment>
                            <comment id="318900" author="ssmirnov" created="Mon, 22 Nov 2021 20:07:02 +0000"  >&lt;p&gt;It could be related to&#160;&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-12569&quot; title=&quot;IBLND_CREDITS_HIGHWATER does not check connection queue depth&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-12569&quot;&gt;&lt;del&gt;LU-12569&lt;/del&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The 2.10 version probably doesn&apos;t have the fix for it.&#160;&lt;/p&gt;

&lt;p&gt;Could you please provide, from both sides&#160;&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
lctl --version
lnetctl net show -v 4 &lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;to check what the credit-related settings are?&lt;/p&gt;

&lt;p&gt;Thanks,&lt;/p&gt;

&lt;p&gt;Serguei.&lt;/p&gt;</comment>
                            <comment id="318913" author="ofaaland" created="Mon, 22 Nov 2021 22:21:18 +0000"  >&lt;p&gt;Hi Serguei, attached as&lt;/p&gt;

&lt;p&gt;lctl.version.orelic4.1637616867.txt&lt;br/&gt;
lctl.version.ruby1016.1637616519.txt&lt;br/&gt;
lnetctl.net-show.orelic4.1637616889.txt&lt;br/&gt;
lnetctl.net-show.ruby1016.1637616206.txt&lt;/p&gt;

&lt;p&gt;along with module params for orelic4 since 2.10 lnetctl doesn&apos;t report as much with &quot;net show&quot;&lt;br/&gt;
ko2iblnd.parameters.orelic4.1637617473.txt&lt;br/&gt;
ksocklnd.parameters.orelic4.1637617487.txt&lt;br/&gt;
lnet.parameters.orelic4.1637617458.txt&lt;/p&gt;

&lt;p&gt;thanks&lt;br/&gt;
Olaf&lt;/p&gt;</comment>
                            <comment id="319143" author="hornc" created="Wed, 24 Nov 2021 21:13:45 +0000"  >&lt;p&gt;FYI, I think this may be caused by a bug in discovery. I noticed this same symptom on a router while I was doing some internal testing.&lt;/p&gt;

&lt;p&gt;The router had high reference counts for three peers (all Lustre servers):&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;nid00053:~ # awk &apos;{if ($2 &amp;gt; 1){print $0}}&apos; /sys/kernel/debug/lnet/peers
nid                      refs state  last   max   rtr   min    tx   min queue
10.13.100.57@o2ib11      2488    up    -1    16 -2359 -2359    16   -18 0
10.13.100.53@o2ib11      1917    up    -1    16 -1788 -1788    16    -5 0
10.13.100.55@o2ib11      2582    up    -1    16 -2453 -2453    16   -41 0
nid00053:~ #
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Here I&apos;m just locating the lnet_peer.lp_rtrq for each of these peers:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;crash_x86_64&amp;gt; p &amp;amp;the_lnet.ln_peer_tables[0].pt_peer_list
$7 = (struct list_head *) 0xffff8803ea46af90
crash_x86_64&amp;gt; p &amp;amp;the_lnet.ln_peer_tables[1].pt_peer_list
$8 = (struct list_head *) 0xffff8803e9298450
crash_x86_64&amp;gt; p &amp;amp;the_lnet.ln_peer_tables[2].pt_peer_list
$9 = (struct list_head *) 0xffff8803e9298750
crash_x86_64&amp;gt; list -H 0xffff8803ea46af90 -s lnet_peer.lp_primary_nid | egrep -B 1 -e 1407422296843321 -e 1407422296843317 -e 1407422296843319
ffff8803e9297400
  lp_primary_nid = 1407422296843321 &amp;lt;&amp;lt;&amp;lt; 10.13.100.57@o2ib11
crash_x86_64&amp;gt; list -H 0xffff8803e9298450 -s lnet_peer.lp_primary_nid | egrep -B 1 -e 1407422296843321 -e 1407422296843317 -e 1407422296843319
crash_x86_64&amp;gt; list -H 0xffff8803e9298750 -s lnet_peer.lp_primary_nid | egrep -B 1 -e 1407422296843321 -e 1407422296843317 -e 1407422296843319
ffff8803ea3e2e00
  lp_primary_nid = 1407422296843319 &amp;lt;&amp;lt;&amp;lt; 10.13.100.55@o2ib11
ffff8803864ab600
  lp_primary_nid = 1407422296843317 &amp;lt;&amp;lt;&amp;lt; 10.13.100.53@o2ib11
crash_x86_64&amp;gt; struct -o lnet_peer ffff8803e9297400 | grep lp_rtrq
  [ffff8803e9297470] struct list_head lp_rtrq;
crash_x86_64&amp;gt; struct -o lnet_peer ffff8803ea3e2e00 | grep lp_rtrq
  [ffff8803ea3e2e70] struct list_head lp_rtrq;
crash_x86_64&amp;gt; struct -o lnet_peer ffff8803864ab600 | grep lp_rtrq
  [ffff8803864ab670] struct list_head lp_rtrq;
crash_x86_64&amp;gt; list -H ffff8803e9297470 | wc -l
2389
crash_x86_64&amp;gt; list -H ffff8803ea3e2e70 | wc -l
2481
crash_x86_64&amp;gt; list -H ffff8803864ab670 | wc -l
1815
crash_x86_64&amp;gt;
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;While trying to track down where the bottleneck was, I noticed that there are two @gni peers that seem to be stuck in discovery:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;crash_x86_64&amp;gt; struct -o lnet the_lnet | grep ln_dc
  [ffffffffa0307078] lnet_handler_t ln_dc_handler;
  [ffffffffa0307080] struct list_head ln_dc_request;
  [ffffffffa0307090] struct list_head ln_dc_working;
  [ffffffffa03070a0] struct list_head ln_dc_expired;
  [ffffffffa03070b0] wait_queue_head_t ln_dc_waitq;
  [ffffffffa03070c8] int ln_dc_state;
crash_x86_64&amp;gt; list -H ffffffffa0307090 -o 224
ffff8803e9297c00
ffff8803e91cc200
crash_x86_64&amp;gt; lnet_peer.lp_primary_nid ffff8803e9297c00
  lp_primary_nid = 3659174697238582
crash_x86_64&amp;gt; lnet_peer.lp_primary_nid ffff8803e91cc200
  lp_primary_nid = 3659174697238578
crash_x86_64&amp;gt; epython nid2str.py 3659174697238582
54@gni
crash_x86_64&amp;gt; epython nid2str.py 3659174697238578
50@gni
crash_x86_64&amp;gt;
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;These peers were last processed by the discovery thread &lt;em&gt;hours&lt;/em&gt; ago:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;crash_x86_64&amp;gt; lnet_peer.lp_last_queued ffff8803e9297c00
  lp_last_queued = 1637750512
crash_x86_64&amp;gt; lnet_peer.lp_last_queued ffff8803e91cc200
  lp_last_queued = 1637750211
crash_x86_64&amp;gt;

pollux-p4:~ # date -d @1637750512
Wed Nov 24 04:41:52 CST 2021
pollux-p4:~ # date -d @1637750211
Wed Nov 24 04:36:51 CST 2021
pollux-p4:~ #
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The router was dumped a little under 10 hours later:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;        DATE: Wed Nov 24 14:28:12 2021
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The stuck peers have a state that is inconsistent with being on the ln_dc_working queue:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;crash_x86_64&amp;gt; lnet_peer.lp_state ffff8803e9297c00
  lp_state = 338
crash_x86_64&amp;gt; lnet_peer.lp_state ffff8803e91cc200
  lp_state = 338
crash_x86_64&amp;gt;

*hornc@cflosbld09 fs4 $ lpst2str.sh 338
LNET_PEER_NO_DISCOVERY
LNET_PEER_DISCOVERED
LNET_PEER_DISCOVERING
LNET_PEER_NIDS_UPTODATE
*hornc@cflosbld09 fs4 $
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;These @gni peers have numerous messages on the lnet_peer.lp_dc_pendq:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;crash_x86_64&amp;gt; struct -o lnet_peer ffff8803e9297c00 | grep lp_dc_pendq
  [ffff8803e9297c20] struct list_head lp_dc_pendq;
crash_x86_64&amp;gt; struct -o lnet_peer ffff8803e91cc200 | grep lp_dc_pendq
  [ffff8803e91cc220] struct list_head lp_dc_pendq;
crash_x86_64&amp;gt; list -H ffff8803e9297c20 | wc -l
214
crash_x86_64&amp;gt; list -H ffff8803e91cc220 | wc -l
170
crash_x86_64&amp;gt;
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;It is likely the case that those messages are what&apos;s consuming and not letting go of the lpni_rtrcredits for the three @o2ib11 peers that show the high reference counts.&lt;/p&gt;

&lt;p&gt;I haven&apos;t yet figured out why the @gni peers are stuck in discovery.&lt;/p&gt;</comment>
                            <comment id="319149" author="hornc" created="Wed, 24 Nov 2021 22:40:33 +0000"  >&lt;p&gt;Olaf, does your 2.12.7 router have the fix from &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-13883&quot; title=&quot;LNetPrimaryNID assumes lpni for a particular NID will not change through discovery but lnet_peer_add_nid() may allocate a new one for it.&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-13883&quot;&gt;&lt;del&gt;LU-13883&lt;/del&gt;&lt;/a&gt; backported to it?&lt;/p&gt;</comment>
                            <comment id="319298" author="hornc" created="Sat, 27 Nov 2021 17:25:56 +0000"  >&lt;p&gt;I discovered a race between the discovery thread and other threads that are queueing a peer for discovery.&lt;/p&gt;

&lt;p&gt;When the discovery thread finishes processing a peer it calls lnet_peer_discovered() which clears the LNET_PEER_DISCOVERING bit from the peer state:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;        lp-&amp;gt;lp_state |= LNET_PEER_DISCOVERED;
        lp-&amp;gt;lp_state &amp;amp;= ~(LNET_PEER_DISCOVERING |
                          LNET_PEER_REDISCOVER);
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;At this point, the peer is on the lnet.ln_dc_working queue. When lnet_peer_discovered() returns, the lnet_peer.lp_lock spinlock is dropped, and the discovery thread acquires the lnet_net_lock/EX. This is where the race window exists:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;                        spin_unlock(&amp;amp;lp-&amp;gt;lp_lock);
&amp;lt;&amp;lt;&amp;lt; Race window &amp;gt;&amp;gt;&amp;gt;
                        lnet_net_lock(LNET_LOCK_EX);
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;If another threads queues this peer for discovery during this window, then the LNET_PEER_DISCOVERING bit is added back to the peer state, but since the peer is already on the lnet.ln_dc_working queue, it does &lt;em&gt;not&lt;/em&gt; get added to the lnet.ln_dc_request queue.&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;static int lnet_peer_queue_for_discovery(struct lnet_peer *lp)
...
        spin_lock(&amp;amp;lp-&amp;gt;lp_lock);
        if (!(lp-&amp;gt;lp_state &amp;amp; LNET_PEER_DISCOVERING))
                lp-&amp;gt;lp_state |= LNET_PEER_DISCOVERING;
        spin_unlock(&amp;amp;lp-&amp;gt;lp_lock);
        if (list_empty(&amp;amp;lp-&amp;gt;lp_dc_list)) {  &amp;lt;&amp;lt;&amp;lt; Peer is on ln_dc_working
                lnet_peer_addref_locked(lp);
                list_add_tail(&amp;amp;lp-&amp;gt;lp_dc_list, &amp;amp;the_lnet.ln_dc_request);
...
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;When the discovery thread acquires the lnet_net_lock/EX, it sees that the LNET_PEER_DISCOVERING bit has not been cleared, so it does not call lnet_peer_discovery_complete() which is responsible for sending messages on the peer&apos;s discovery pending queue.&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;                        spin_unlock(&amp;amp;lp-&amp;gt;lp_lock);
&amp;lt;&amp;lt;&amp;lt; Race window &amp;gt;&amp;gt;&amp;gt;
                        lnet_net_lock(LNET_LOCK_EX);
...
                        if (!(lp-&amp;gt;lp_state &amp;amp; LNET_PEER_DISCOVERING))
                                lnet_peer_discovery_complete(lp);
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;At this point, the peer is stuck on the lnet.ln_dc_working queue, and messages may continue to accumulate on the peer&apos;s lnet_peer.lp_dc_pendq.&lt;/p&gt;</comment>
                            <comment id="319355" author="ofaaland" created="Mon, 29 Nov 2021 18:03:04 +0000"  >&lt;p&gt;Hi Chris,&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Olaf, does your 2.12.7 router have the fix from &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-13883&quot; title=&quot;LNetPrimaryNID assumes lpni for a particular NID will not change through discovery but lnet_peer_add_nid() may allocate a new one for it.&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-13883&quot;&gt;&lt;del&gt;LU-13883&lt;/del&gt;&lt;/a&gt; backported to it?&lt;/p&gt;&lt;/blockquote&gt;
&lt;p&gt;No, it does not.&lt;/p&gt;</comment>
                            <comment id="319359" author="gerrit" created="Mon, 29 Nov 2021 18:12:51 +0000"  >&lt;p&gt;&quot;Chris Horn &amp;lt;chris.horn@hpe.com&amp;gt;&quot; uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/45670&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/45670&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-15234&quot; title=&quot;LNet high peer reference counts inconsistent with queue&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-15234&quot;&gt;&lt;del&gt;LU-15234&lt;/del&gt;&lt;/a&gt; lnet: Race on discovery queue&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: 43d9fe70f33defb31e7402d35474b4ef39560657&lt;/p&gt;</comment>
                            <comment id="319392" author="hornc" created="Mon, 29 Nov 2021 20:43:50 +0000"  >&lt;p&gt;Thanks, Olaf.&lt;/p&gt;

&lt;blockquote&gt;&lt;p&gt;This is just a little over 6 days since the ruby routers were rebooted during an update.&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;From what I can tell, the discovery defect that I found has been there a long time, but maybe I am missing something that has caused this issue to start manifesting. Can you provide any additional detail about when you started seeing this issue? Was everything working fine until this update that you referenced in the description? What did the update entail?&lt;/p&gt;</comment>
                            <comment id="319403" author="ofaaland" created="Mon, 29 Nov 2021 22:10:19 +0000"  >&lt;p&gt;Hi Chris,&lt;/p&gt;

&lt;p&gt;I first documented the high refcounts 2021-09-17.&#160; I recall seeing it before then, but I&apos;m not sure how long before.&#160; Our systems were updated to our 2.12.7_2.llnl tag (the tag we&apos;re still on) about 2021-08-10.&#160;&lt;/p&gt;

&lt;p&gt;There was a new issue subsequent to that August update - we started seeing some router nodes reporting &quot;Timed out RDMA&quot; with some other routers for no reason we could find, like this:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[root@orelic1:~]# grep &apos;LNetError.*Timed out RDMA&apos; /var/log//conman/console.orelic4 | grep ^2021-08 | nidgrep | sort -V | uniq -c
      1 19.1.104@o2ib10
      1 172.19.1.101@o2ib100
    896 172.19.1.103@o2ib100
    674 172.19.1.104@o2ib100
      1 172.19.2.7@o2ib100
      1 172.19.2.8@o2ib100
      1 172.19.2.40@o2ib100
      1 172.19.2.43@o2ib100
      1 172.19.2.44@o2ib100
      1 172.19.2.46@o2ib100 &lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;where 172.19.1.&lt;span class=&quot;error&quot;&gt;&amp;#91;101-104&amp;#93;&lt;/span&gt; all have the same hardware, same LNet versions, and same role (IE were on the path to the same endpoints).&lt;/p&gt;

&lt;p&gt;The LNet related patches that were new to the _2.llnl tag, which was what we updated to in August, were:&lt;/p&gt;
&lt;ul&gt;
	&lt;li&gt;&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-14627&quot; title=&quot;Lost ref on lnet_peer in discovery leads to LNetError: 24909:0:(peer.c:292:lnet_destroy_peer_locked()) ASSERTION( list_empty(&amp;amp;lp-&amp;gt;lp_peer_nets) ) failed:&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-14627&quot;&gt;&lt;del&gt;LU-14627&lt;/del&gt;&lt;/a&gt; lnet: Allow delayed sends&lt;/li&gt;
	&lt;li&gt;&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-14627&quot; title=&quot;Lost ref on lnet_peer in discovery leads to LNetError: 24909:0:(peer.c:292:lnet_destroy_peer_locked()) ASSERTION( list_empty(&amp;amp;lp-&amp;gt;lp_peer_nets) ) failed:&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-14627&quot;&gt;&lt;del&gt;LU-14627&lt;/del&gt;&lt;/a&gt; lnet: Ensure ref taken when queueing for discovery&lt;/li&gt;
	&lt;li&gt;&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-13972&quot; title=&quot;kiblnd can continue attempting to reconnect indefinitely.&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-13972&quot;&gt;&lt;del&gt;LU-13972&lt;/del&gt;&lt;/a&gt; o2iblnd: Don&apos;t retry indefinitely&lt;/li&gt;
	&lt;li&gt;&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-14488&quot; title=&quot;Support rdma_connect_locked()&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-14488&quot;&gt;&lt;del&gt;LU-14488&lt;/del&gt;&lt;/a&gt; o2ib: Use rdma_connect_locked if it is defined&lt;/li&gt;
	&lt;li&gt;&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-14588&quot; title=&quot;LNet: make config script aware of the ofed symbols &quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-14588&quot;&gt;&lt;del&gt;LU-14588&lt;/del&gt;&lt;/a&gt; o2ib: make config script aware of the ofed symbols&lt;/li&gt;
	&lt;li&gt;&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-14733&quot; title=&quot;brw_bulk_ready() BRW bulk READ failed for RPC from 12345-192.168.128.126@o2ib18: -103&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-14733&quot;&gt;&lt;del&gt;LU-14733&lt;/del&gt;&lt;/a&gt; o2iblnd: Avoid double posting invalidate&lt;/li&gt;
	&lt;li&gt;&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-14733&quot; title=&quot;brw_bulk_ready() BRW bulk READ failed for RPC from 12345-192.168.128.126@o2ib18: -103&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-14733&quot;&gt;&lt;del&gt;LU-14733&lt;/del&gt;&lt;/a&gt; o2iblnd: Move racy NULL assignment&lt;/li&gt;
&lt;/ul&gt;
</comment>
                            <comment id="319587" author="gerrit" created="Tue, 30 Nov 2021 16:25:20 +0000"  >&lt;p&gt;&quot;Chris Horn &amp;lt;chris.horn@hpe.com&amp;gt;&quot; uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/45681&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/45681&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-15234&quot; title=&quot;LNet high peer reference counts inconsistent with queue&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-15234&quot;&gt;&lt;del&gt;LU-15234&lt;/del&gt;&lt;/a&gt; lnet: Race on discovery queue&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: b2_12&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: f7e0853c7da89724cb89a8b3bad972c661b55794&lt;/p&gt;</comment>
                            <comment id="319592" author="hornc" created="Tue, 30 Nov 2021 16:56:04 +0000"  >&lt;p&gt;Olaf, I pushed a backport of this patch to b2_12, just in case you want to try it and see if it resolves your issue.&lt;/p&gt;</comment>
                            <comment id="319602" author="ofaaland" created="Tue, 30 Nov 2021 17:40:13 +0000"  >&lt;p&gt;Thank you, Chris.&#160; I will try it.&lt;/p&gt;</comment>
                            <comment id="319604" author="ofaaland" created="Tue, 30 Nov 2021 17:40:45 +0000"  >&lt;p&gt;Serguei, before I do try Chris&apos; patch, can you or Amir review it (at least the patch against master)?&#160; Thank you.&lt;/p&gt;</comment>
                            <comment id="320362" author="ofaaland" created="Thu, 9 Dec 2021 01:14:03 +0000"  >&lt;p&gt;Serguei and Chris,&lt;br/&gt;
We applied the patch from &lt;a href=&quot;https://review.whamcloud.com/#/c/45681/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/#/c/45681/&lt;/a&gt; (version 3) to our 2.12.8 branch, and still see the climbing refcount and dropping min_rtr.  See attached&lt;br/&gt;
 &lt;span class=&quot;image-wrap&quot; style=&quot;&quot;&gt;&lt;a id=&quot;41672_thumb&quot; href=&quot;https://jira.whamcloud.com/secure/attachment/41672/41672_peer+status+orelic4+with+discovery+race+patch+v3.png&quot; title=&quot;peer status orelic4 with discovery race patch v3.png&quot; file-preview-type=&quot;image&quot; file-preview-id=&quot;41672&quot; file-preview-title=&quot;peer status orelic4 with discovery race patch v3.png&quot;&gt;&lt;img src=&quot;https://jira.whamcloud.com/secure/thumbnail/41672/_thumb_41672.png&quot; style=&quot;border: 0px solid black&quot; role=&quot;presentation&quot;/&gt;&lt;/a&gt;&lt;/span&gt; &lt;/p&gt;

&lt;p&gt;What is different, is that we now see queue (reported by lctl get_param peers) climb along with refs.  So, I think we have progress, but some other issue.&lt;/p&gt;</comment>
                            <comment id="320421" author="hornc" created="Thu, 9 Dec 2021 17:48:22 +0000"  >&lt;p&gt;Olaf, would you be able to collect a crash dump, vmlinux and lustre kos from a router experiencing the high refcount issue? preferably from a router running the patch?&lt;/p&gt;</comment>
                            <comment id="320427" author="ofaaland" created="Thu, 9 Dec 2021 18:54:06 +0000"  >&lt;p&gt;Hi Chris, I need to check whether I can make a crash dump available.&lt;/p&gt;</comment>
                            <comment id="320443" author="ofaaland" created="Thu, 9 Dec 2021 22:05:13 +0000"  >&lt;p&gt;Chris, I can make a crash dump available to you.  Can you let me know a way I can send it to you privately?&lt;/p&gt;

&lt;p&gt;Serguei, if it would help you to have it also, please let me know a way to send it.&lt;/p&gt;

&lt;p&gt;thanks,&lt;br/&gt;
Olaf&lt;/p&gt;</comment>
                            <comment id="320543" author="hornc" created="Fri, 10 Dec 2021 21:02:12 +0000"  >&lt;blockquote&gt;&lt;p&gt;Chris, I can make a crash dump available to you. Can you let me know a way I can send it to you privately?&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;Yes, I need to check with our new IT org on the best way to do it. I&apos;ll let you know.&lt;/p&gt;</comment>
                            <comment id="320545" author="hornc" created="Fri, 10 Dec 2021 21:17:45 +0000"  >&lt;p&gt;Olaf, you can upload to this ftp:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;sftp -o Port=2222 lu15234@ftp.ext.hpe.com
pass: e$VS3mw_ 
&amp;gt; put &amp;lt;filename&amp;gt;
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="320557" author="defazio" created="Fri, 10 Dec 2021 22:00:23 +0000"  >&lt;p&gt;Olaf is not officially working today. I&apos;ll send you the tarball he made yesterday, which includes the kernel dump.&lt;/p&gt;

&lt;p&gt;Sent. Hopefully it showed up on your side.&lt;/p&gt;</comment>
                            <comment id="320780" author="hornc" created="Mon, 13 Dec 2021 16:38:29 +0000"  >&lt;p&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/ViewProfile.jspa?name=defazio&quot; class=&quot;user-hover&quot; rel=&quot;defazio&quot;&gt;defazio&lt;/a&gt; I got it. Thank you&lt;/p&gt;</comment>
                            <comment id="320782" author="hornc" created="Mon, 13 Dec 2021 16:42:22 +0000"  >&lt;p&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/ViewProfile.jspa?name=defazio&quot; class=&quot;user-hover&quot; rel=&quot;defazio&quot;&gt;defazio&lt;/a&gt; &lt;a href=&quot;https://jira.whamcloud.com/secure/ViewProfile.jspa?name=ofaaland&quot; class=&quot;user-hover&quot; rel=&quot;ofaaland&quot;&gt;ofaaland&lt;/a&gt; can you provide the lustre-debuginfo package that matches kmod-lustre-2.12.8_3.llnl-1.ch6.x86_64.rpm ?&lt;/p&gt;</comment>
                            <comment id="320786" author="defazio" created="Mon, 13 Dec 2021 17:06:33 +0000"  >&lt;p&gt;I&apos;ve sent lustre-debuginfo-2.12.8_3.llnl-1.ch6.x86_64.rpm&lt;/p&gt;</comment>
                            <comment id="320799" author="hornc" created="Mon, 13 Dec 2021 19:59:58 +0000"  >&lt;p&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/ViewProfile.jspa?name=defazio&quot; class=&quot;user-hover&quot; rel=&quot;defazio&quot;&gt;defazio&lt;/a&gt; &lt;a href=&quot;https://jira.whamcloud.com/secure/ViewProfile.jspa?name=ofaaland&quot; class=&quot;user-hover&quot; rel=&quot;ofaaland&quot;&gt;ofaaland&lt;/a&gt; can you provide the output of &apos;&lt;tt&gt;lnetctl global show&lt;/tt&gt;&apos; from one peer on each cluster? What I mean is, I believe you have something like:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;(Cluster A) &amp;lt;-&amp;gt; (Router Cluster B) &amp;lt;-&amp;gt; (Router Cluster C) &amp;lt;-&amp;gt; (Cluster D)
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;and I&apos;m looking for &apos;&lt;tt&gt;lnetctl global show&lt;/tt&gt;&apos; output from one peer in each Cluster A, B, C, and D. Also, could you let me know if you have any tuning changes in place on each cluster? By this I mean, if you are explicitly setting any lnet/ko2iblnd/ksocklnd kernel module parameters, or if you are doing any tuning by executing &apos;&lt;tt&gt;lnetctl set&lt;/tt&gt;&apos; commands, etc. Thanks.&lt;/p&gt;</comment>
                            <comment id="320807" author="hornc" created="Mon, 13 Dec 2021 21:35:26 +0000"  >&lt;p&gt;The fix I authored is only applicable to LNet peers that undergo discovery. In 2.12 LTS, router peers do not undergo discovery, so that explains why the fix didn&apos;t help with your issue.&lt;/p&gt;

&lt;p&gt;In the dump, we can see this peer with a high refcount (relative to other peers; there are a handful that have refcount between 10-20) and negative lpni_rtrcredits:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;crash_x86_64&amp;gt; epython lnet.py -p --nid 172.16.70.63@tcp -d
lnet_peer: ffff9c9bc20cb980
  lp_primary_nid: 172.16.70.63@tcp
  lp_state:
  lp_dc_pendq: ffff9c9bc20cb9a0(0)
  lp_dc_list: ffff9c9bc20cba18(0)
  lp_peer_nets: ffff9c9bc20cb990
    - lnet_peer_net: tcp(ffff9cbba3ff4cc0)
      - lpn_peer_nis: ffff9cbba3ff4cd0
        - lnet_peer_ni: ffff9c9bc3b18600
          - lpni_nid: 172.16.70.63@tcp
          - lpni_refcount: !!!!!!245!!!!!!
          - lpni_healthv: 1000
          - lpni_txcredits: 3
          - lpni_mintxcredits: 0
          - lpni_rtrcredits: -226
          - lpni_minrtrcredits: -226
          - lpni_rtrq: ffff9c9bc3b18650(226)
          - lpni_last_alive: 847
          - lpni_txq: ffff9c9bc3b18640(0)
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;In the debug log extracted from the dump, we can see timeout errors for this peer:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;00000800:00000100:21.0:1639088392.413370:0:16075:0:(socklnd_cb.c:2390:ksocknal_find_timed_out_conn()) Timeout receiving from 12345-172.16.70.63@tcp (172.16.70.63:988), state 4 wanted 0 left 0
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;*hornc@cflosbld09 dec9-lu-15234-llnl $ grep 172.16.70.63@tcp dk.log.fmt | grep -c ksocknal_find_timed_out_conn
116
*hornc@cflosbld09 dec9-lu-15234-llnl $
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Your LND timeout is currently set to the b2_12 default, which is 5 seconds:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;crash_x86_64&amp;gt; lnet_lnd_timeout
lnet_lnd_timeout = $497 = 5
crash_x86_64&amp;gt;
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;I think we need to increase this timeout value. However, there are some quirks with how this value is set in b2_12. I&apos;m guessing that you have lnet_health_sensitivity=0. That setting results in lnet_transaction_timeout being set to 50, and lnet_retry_count being set to 0, but it doesn&apos;t update the lnet_lnd_timeout correctly (see the patches for &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-13510&quot; title=&quot;Allow control over LND timeouts independent of lnet_transaction_timeout and retry_count&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-13510&quot;&gt;&lt;del&gt;LU-13510&lt;/del&gt;&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;The easiest solution is to just explicitly set the lnet_transaction_timeout to some value not equal to 50. This needs to be done &lt;em&gt;after&lt;/em&gt; the lnet_health_sensitivity is set to 0.&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;options lnet lnet_health_sensitivity=0
options lnet lnet_transaction_timeout=49
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;or at runtime (doesn&apos;t persist across reboots)&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;lnetctl set transaction_timeout 49
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;You might experiment with this value to see what&apos;s the smallest value that resolves the issue, but I wouldn&apos;t go any lower than 10 seconds, and I would avoid anything &amp;gt; 50. If you&apos;re still seeing timeouts/network errors with it set to 49 then you may have other issues with your network that warrant investigation (bad cables, etc.).&lt;/p&gt;

&lt;p&gt;The above assumes you are tuning all of our clusters the same way. If that isn&apos;t the case, then if you provide the information I requested in my previous comment then I can provide specific tuning guidance for each cluster.&lt;/p&gt;</comment>
                            <comment id="320809" author="hornc" created="Mon, 13 Dec 2021 21:43:05 +0000"  >&lt;p&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/ViewProfile.jspa?name=ssmirnov&quot; class=&quot;user-hover&quot; rel=&quot;ssmirnov&quot;&gt;ssmirnov&lt;/a&gt; whamcloud might want to consider backporting &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-13510&quot; title=&quot;Allow control over LND timeouts independent of lnet_transaction_timeout and retry_count&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-13510&quot;&gt;&lt;del&gt;LU-13510&lt;/del&gt;&lt;/a&gt; patches to b2_12, or authoring a fix specific to b2_12 to address the issue with setting lnet_lnd_timeout correctly.&lt;/p&gt;</comment>
                            <comment id="320822" author="defazio" created="Tue, 14 Dec 2021 02:01:22 +0000"  >&lt;p&gt;Uploading file params_20211213.tar.gz&lt;/p&gt;</comment>
                            <comment id="320948" author="hornc" created="Wed, 15 Dec 2021 17:15:19 +0000"  >&lt;p&gt;Thanks &lt;a href=&quot;https://jira.whamcloud.com/secure/ViewProfile.jspa?name=defazio&quot; class=&quot;user-hover&quot; rel=&quot;defazio&quot;&gt;defazio&lt;/a&gt;, but I&apos;m a little confused by the information that you&apos;ve provided. The data in the tarball suggests that you have the following:&lt;/p&gt;

&lt;p&gt;orelic2 - Lustre 2.10 - local networks tcp0, o2ib100 w/routes to various other o2ib networks&lt;br/&gt;
zrelic2 - Lustre 2.10 - local networks tcp0, o2ib600 w/routes to various other o2ib networks&lt;br/&gt;
ruby1009 - Lustre 2.12 - local networks o2ib39, o2ib100 w/routes to o2ib600&lt;br/&gt;
zinc2 - Lustre 2.12 - local network o2ib600 w/routes to tcp0 and various other o2ib networks&lt;/p&gt;

&lt;p&gt;Is that right?&lt;/p&gt;

&lt;p&gt;The crash dump you provided has &quot;NODENAME: orelic4&quot;, but this node was running 2.12. Can you clarify?&lt;/p&gt;

&lt;p&gt;Also, we can see in crash dump that the peer received a connection request from node with ko2iblnd peer_credits=32, but I do not see that parameter specified anywhere in the tarball:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;crash_x86_64&amp;gt; dmesg | grep kiblnd_passive_connect
[  145.267098] LNetError: 437:0:(o2iblnd_cb.c:2554:kiblnd_passive_connect()) Can&apos;t accept conn from 172.19.1.54@o2ib100, queue depth too large:  32 (&amp;lt;=8 wanted)
[  185.922065] LNetError: 437:0:(o2iblnd_cb.c:2554:kiblnd_passive_connect()) Can&apos;t accept conn from 172.19.1.55@o2ib100, queue depth too large:  32 (&amp;lt;=8 wanted)
[  185.938289] LNetError: 437:0:(o2iblnd_cb.c:2554:kiblnd_passive_connect()) Skipped 3 previous similar messages
crash_x86_64&amp;gt;
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;hornc@C02V50B9HTDG params_20211213 % grep -a &apos;ko2iblnd peer_credits&apos; *
hornc@C02V50B9HTDG params_20211213 % grep -a peer_credits *
ko2iblnd.parameters.orelic2.1639444262:/sys/module/ko2iblnd/parameters/peer_credits:8
ko2iblnd.parameters.orelic2.1639444262:/sys/module/ko2iblnd/parameters/peer_credits_hiw:0
ko2iblnd.parameters.orelic2.1639444262:/sys/module/ksocklnd/parameters/peer_credits:8
ko2iblnd.parameters.orelic2.1639444262:/sys/module/ksocklnd/parameters/peer_credits:8
ko2iblnd.parameters.orelic2.1639444262:              peer_credits: 0
ko2iblnd.parameters.orelic2.1639444262:              peer_credits: 8
ko2iblnd.parameters.orelic2.1639444262:              peer_credits: 8
ko2iblnd.parameters.orelic2.1639444262:              peer_credits: 0
ko2iblnd.parameters.orelic2.1639444262:              peer_credits: 8
ko2iblnd.parameters.orelic2.1639444262:              peer_credits: 8
ko2iblnd.parameters.orelic2.1639444262:              peer_credits: 0
ko2iblnd.parameters.orelic2.1639444262:              peer_credits: 8
ko2iblnd.parameters.orelic2.1639444262:              peer_credits: 8
ko2iblnd.parameters.zrelic2.1639443533:/sys/module/ko2iblnd/parameters/peer_credits:8
ko2iblnd.parameters.zrelic2.1639443533:/sys/module/ko2iblnd/parameters/peer_credits_hiw:0
ksocklnd.parameters.orelic2.1639444379:/sys/module/ksocklnd/parameters/peer_credits:8
ksocklnd.parameters.zrelic2.1639443594:/sys/module/ksocklnd/parameters/peer_credits:8
lnetctl-net-show.orelic2.1639444148:              peer_credits: 0
lnetctl-net-show.orelic2.1639444148:              peer_credits: 8
lnetctl-net-show.orelic2.1639444148:              peer_credits: 8
lnetctl-net-show.zrelic2.1639443299:              peer_credits: 0
lnetctl-net-show.zrelic2.1639443299:              peer_credits: 8
lnetctl-net-show.zrelic2.1639443299:              peer_credits: 8
lnetctl-net-show.zrelic2.1639443299~:              peer_credits: 0
lnetctl-net-show.zrelic2.1639443299~:              peer_credits: 8
lnetctl-net-show.zrelic2.1639443299~:              peer_credits: 8
hornc@C02V50B9HTDG params_20211213 %
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;What am I missing?&lt;/p&gt;</comment>
                            <comment id="320971" author="hornc" created="Wed, 15 Dec 2021 21:02:10 +0000"  >&lt;p&gt;In any case, getting back to the timeout issue.&lt;/p&gt;

&lt;p&gt;Lustre 2.10 has default LND timeouts of 50 seconds for both ksocklnd and ko2iblnd. You can see that in the parameters file for orelic2:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;ko2iblnd.parameters.orelic2.1639444262:/sys/module/ko2iblnd/parameters/timeout:50
ko2iblnd.parameters.orelic2.1639444262:/sys/module/ksocklnd/parameters/sock_timeout:50
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;So we probably want to get the 2.12 nodes to match that. I figured out that we can get exactly 50 with this set of parameters (note, the order is important):&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;options lnet lnet_retry_count=0 # Sets lnet_lnd_timeout = lnet_transaction_timeout (lnet_transaction_timeout should have default value of 50)
options lnet lnet_health_sensitivity=0
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;I would suggest putting this configuration in place on all your Lustre 2.12 nodes. If you are still seeing network timeout issues then I would suggest doing some investigation into the network to see if it is healthy.&lt;/p&gt;</comment>
                            <comment id="321057" author="ofaaland" created="Thu, 16 Dec 2021 23:18:14 +0000"  >&lt;p&gt;Hi Chris,&lt;/p&gt;

&lt;p&gt;Regarding the parameters, I&apos;m attaching them for an orelic node running Lustre 2.12, which is when we see the symptoms described in this issue.&#160; The orelic/zrelic nodes currently run lustre 2.10 because of the issues we&apos;ve seen, which is why you got the parameters for that configuration.&#160; orelic and zrelic are configured the same, except for the routes.&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;nobr&quot;&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/attachment/41732/41732_orelic4-lustre212-20211216.tgz&quot; title=&quot;orelic4-lustre212-20211216.tgz attached to LU-15234&quot;&gt;orelic4-lustre212-20211216.tgz&lt;sup&gt;&lt;img class=&quot;rendericon&quot; src=&quot;https://jira.whamcloud.com/images/icons/link_attachment_7.gif&quot; height=&quot;7&quot; width=&quot;7&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/sup&gt;&lt;/a&gt;&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;After gathering these parameters, we put the configuration you suggest in place on orelic4 while it was running lustre 2.12 and verified that lnet_lnd_timeout was set to 50.&lt;/p&gt;

&lt;p&gt;The node ran first with lustre 2.12 and our stock settings (as in the attached tarball) and refs built up to about 590.&#160; We then set lnet_health_sensitivity=100, set lnet_retry__count=2, then set lnet_retry_count=0, then set lnet_health_sensitivity=0.&lt;/p&gt;

&lt;p&gt;After this we observed refs continued to climb, but much more slowly - the rate was probably 1/4 or less of the rate of climb before changing lnet_lnd_timeout.&lt;/p&gt;

&lt;p&gt;We&apos;ll make that change more widely and see how it goes.&lt;/p&gt;</comment>
                            <comment id="321234" author="hornc" created="Mon, 20 Dec 2021 20:08:54 +0000"  >&lt;p&gt;Hi Olaf,&lt;/p&gt;

&lt;p&gt;How&apos;s it going with the param changes? As I noted earlier, if you continue to see network/timeout errors after increasing the LND timeout, then you may have some other issue going on with your network.&lt;/p&gt;

&lt;p&gt;Also, can you clarify where the ko2iblnd peer_credits=32 is coming from that I asked about earlier?&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Also, we can see in crash dump that the peer received a connection request from node with ko2iblnd peer_credits=32, but I do not see that parameter specified anywhere in the tarball:

crash_x86_64&amp;gt; dmesg | grep kiblnd_passive_connect
[  145.267098] LNetError: 437:0:(o2iblnd_cb.c:2554:kiblnd_passive_connect()) Can&apos;t accept conn from 172.19.1.54@o2ib100, queue depth too large:  32 (&amp;lt;=8 wanted)
[  185.922065] LNetError: 437:0:(o2iblnd_cb.c:2554:kiblnd_passive_connect()) Can&apos;t accept conn from 172.19.1.55@o2ib100, queue depth too large:  32 (&amp;lt;=8 wanted)
[  185.938289] LNetError: 437:0:(o2iblnd_cb.c:2554:kiblnd_passive_connect()) Skipped 3 previous similar messages
crash_x86_64&amp;gt;
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;I only ask because ideally you would have peer_credits the same on all o2iblnd peers, although it is not fatal to have them different as long as the peers are able to negotiate to a lower value.&lt;/p&gt;</comment>
                            <comment id="321333" author="ofaaland" created="Wed, 22 Dec 2021 01:31:25 +0000"  >&lt;p&gt;Hi Chris,&lt;/p&gt;

&lt;p&gt;I ran out of time before I went on vacation, so I won&apos;t know for a couple weeks.&#160; I&apos;ll post here as soon as I&apos;ve made the change.&lt;/p&gt;

&lt;p&gt;Yes, those two peers are running Lustre 2.14, and may have ended up with different settings accidentally.&#160; I&apos;ll have to check.&lt;/p&gt;

&lt;p&gt;Thanks!&lt;/p&gt;</comment>
                            <comment id="321406" author="gerrit" created="Thu, 23 Dec 2021 07:19:54 +0000"  >&lt;p&gt;&quot;Oleg Drokin &amp;lt;green@whamcloud.com&amp;gt;&quot; merged in patch &lt;a href=&quot;https://review.whamcloud.com/45670/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/45670/&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-15234&quot; title=&quot;LNet high peer reference counts inconsistent with queue&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-15234&quot;&gt;&lt;del&gt;LU-15234&lt;/del&gt;&lt;/a&gt; lnet: Race on discovery queue&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: &lt;br/&gt;
Commit: 852a4b264a984979dcef1fbd4685cab1350010ca&lt;/p&gt;</comment>
                            <comment id="321462" author="pjones" created="Thu, 23 Dec 2021 14:42:50 +0000"  >&lt;p&gt;Landed for 2.15&lt;/p&gt;</comment>
                            <comment id="321501" author="hornc" created="Thu, 23 Dec 2021 21:01:53 +0000"  >&lt;p&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/ViewProfile.jspa?name=pjones&quot; class=&quot;user-hover&quot; rel=&quot;pjones&quot;&gt;pjones&lt;/a&gt; I&apos;m going to re-open this ticket until Olaf can verify that the tuning recommendations have alleviated his issue. It was probably a mistake for me to push that code change against this ticket as it turned out not to be the root cause of Olaf&apos;s problem. I&apos;m sorry about that.&lt;/p&gt;</comment>
                            <comment id="321502" author="hornc" created="Thu, 23 Dec 2021 21:02:59 +0000"  >&lt;p&gt;Alternatively, &lt;a href=&quot;https://jira.whamcloud.com/secure/ViewProfile.jspa?name=ofaaland&quot; class=&quot;user-hover&quot; rel=&quot;ofaaland&quot;&gt;ofaaland&lt;/a&gt; if you&apos;re okay with it we could open a new ticket to continue investigation into your issue.&lt;/p&gt;</comment>
                            <comment id="321848" author="ofaaland" created="Wed, 5 Jan 2022 19:59:32 +0000"  >&lt;p&gt;Chris, I&apos;m fine with either continuing in this ticket or opening a new one.&#160; I&apos;m updating tunings over the next couple of days.&#160; Thanks!&lt;/p&gt;</comment>
                            <comment id="321973" author="ofaaland" created="Thu, 6 Jan 2022 22:45:58 +0000"  >&lt;p&gt;Hi Chris and Serguei,&lt;/p&gt;

&lt;p&gt;&amp;gt; How&apos;s it going with the param changes?&lt;/p&gt;

&lt;p&gt;I changed the timeout as prescribed above, on all the systems (clients, routers, servers).&#160; I then rebooted orelic4 into an image with lustre 2.12.&#160; The changed timeout did not change the symptoms.&#160; I still see the climbing &quot;refs&quot; on orelic4 node when I boot it into Lustre 2.12, sadly.&lt;/p&gt;

&lt;p&gt;&amp;gt; As I noted earlier, if you continue to see network/timeout errors after increasing the LND timeout, then you may have some other issue going on with your network.&lt;/p&gt;

&lt;p&gt;I don&apos;t think problems with the network (ie switches, cables, NICs, drivers) can explain this, because we don&apos;t see these issues when orelic4 (and other nodes in the orelic cluster) are running Lustre 2.10 - only when they are running 2.12.&#160; Do you have other ideas?&lt;/p&gt;

&lt;p&gt;I&apos;ve gathered information from the node (dmesg, lctl dk, module params, etc.) and also gathered a crash dump.&lt;/p&gt;</comment>
                            <comment id="322650" author="ssmirnov" created="Thu, 13 Jan 2022 23:14:51 +0000"  >&lt;p&gt;Olaf,&lt;/p&gt;

&lt;p&gt;I was wondering whether before I provide an instrumented build for debugging, in the meantime you could try the test making sure that o2iblnd parameters are consistent between orelic4 and the nodes it is talking to directly, specifically &lt;em&gt;peer_credits_hiw=4&lt;/em&gt; and&#160;&lt;em&gt;concurrent_sends=8&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Thanks,&lt;/p&gt;

&lt;p&gt;Serguei.&lt;/p&gt;</comment>
                            <comment id="322653" author="ofaaland" created="Fri, 14 Jan 2022 00:01:25 +0000"  >&lt;blockquote&gt;&lt;p&gt;I was wondering whether before I provide an instrumented build for debugging, in the meantime you could try the test making sure that o2iblnd parameters are consistent between orelic4 and the nodes it is talking to directly, specifically&#160;&lt;em&gt;peer_credits_hiw=4&lt;/em&gt;&#160;and&#160;&lt;em&gt;concurrent_sends=8&lt;/em&gt;&lt;/p&gt;&lt;/blockquote&gt;
&lt;p&gt;Yes, I&apos;ll check whether those parameters are consistent.&lt;/p&gt;</comment>
                            <comment id="323071" author="ofaaland" created="Tue, 18 Jan 2022 18:18:47 +0000"  >&lt;p&gt;Hi Serguei, orelic4 and all of the nodes it talks to directly over o2ib have &lt;em&gt;peer_credits_hiw=0&lt;/em&gt;&#160;and&#160;&lt;em&gt;concurrent_sends=0&lt;/em&gt; (we don&apos;t set those values).&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;</comment>
                            <comment id="324208" author="ssmirnov" created="Fri, 28 Jan 2022 00:33:55 +0000"  >&lt;p&gt;Hi Olaf,&lt;/p&gt;

&lt;p&gt;I prepared a patch that can be applied on top of 2.12.7-llnl&lt;/p&gt;

&lt;p&gt;This patch provides more detailed info on lpni refcounts. There are individual counts for each instance the lpni refcount is incremented and decremented in the code.&lt;/p&gt;

&lt;p&gt;After applying the patch, once the peer with excessive refcount is identified, you can use&#160;&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
lnetctl peer show -v 5 --nid &amp;lt;nid_of_peer_with_high_refcount&amp;gt;&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;command to display detailed counts which will be dumped in the end.&lt;/p&gt;

&lt;p&gt;This should help narrow down the issue a bit.&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;nobr&quot;&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/attachment/42035/42035_debug_refcount_01.patch&quot; title=&quot;debug_refcount_01.patch attached to LU-15234&quot;&gt;debug_refcount_01.patch&lt;sup&gt;&lt;img class=&quot;rendericon&quot; src=&quot;https://jira.whamcloud.com/images/icons/link_attachment_7.gif&quot; height=&quot;7&quot; width=&quot;7&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/sup&gt;&lt;/a&gt;&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;Thanks,&lt;/p&gt;

&lt;p&gt;Serguei.&lt;/p&gt;</comment>
                            <comment id="324218" author="ofaaland" created="Fri, 28 Jan 2022 02:08:52 +0000"  >&lt;p&gt;Hi Serguei,&lt;/p&gt;

&lt;p&gt;Thank you, that looks good.&#160; We&apos;re at 2.12.8 these days, but the patch applies cleanly.&#160; Is there any reason not to push it to gerrit with &quot;fortestonly&quot;?&lt;/p&gt;</comment>
                            <comment id="324309" author="ssmirnov" created="Fri, 28 Jan 2022 15:56:15 +0000"  >&lt;p&gt;Hi Olaf,&lt;/p&gt;

&lt;p&gt;I didn&apos;t think I could push it to LLNL repo. Do you mean I should push it to b2_12 of Lustre?&lt;/p&gt;</comment>
                            <comment id="324321" author="ofaaland" created="Fri, 28 Jan 2022 17:32:20 +0000"  >&lt;p&gt;&amp;gt; Do you mean I should push it to b2_12 of Lustre?&lt;/p&gt;

&lt;p&gt;Yes, that&apos;s what I meant, with &quot;&lt;tt&gt;Test-Parameters:&lt;/tt&gt;&#160;&lt;tt&gt;fortestonly&lt;/tt&gt;&quot; and an appropriately limited set of tests.&#160; Would that be inappropriate?&#160;&lt;/p&gt;</comment>
                            <comment id="324336" author="gerrit" created="Fri, 28 Jan 2022 19:17:08 +0000"  >&lt;p&gt;&quot;Serguei Smirnov &amp;lt;ssmirnov@whamcloud.com&amp;gt;&quot; uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/46364&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/46364&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-15234&quot; title=&quot;LNet high peer reference counts inconsistent with queue&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-15234&quot;&gt;&lt;del&gt;LU-15234&lt;/del&gt;&lt;/a&gt; lnet: add debug info for lpni refcounts&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: b2_12&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: d4abf0db289afad72c7d4ac468aec4e2c7c2f935&lt;/p&gt;</comment>
                            <comment id="324355" author="hornc" created="Fri, 28 Jan 2022 21:28:09 +0000"  >&lt;p&gt;I&apos;ve been reviewing the related code off and on and I have found one reference leak, though I doubt it is responsible for your issue because it would only be hit on ENOMEM error (which is probably rare), and this code path deals with resizing the ping buffer which should not happen very often. This ping code is suspicious though, because it is something that has changed from 2.10 -&amp;gt; 2.12.&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;static void
lnet_ping_router_locked(struct lnet_peer_ni *rtr)
{
        struct lnet_rc_data *rcd = NULL;
        time64_t now = ktime_get_seconds();
        time64_t secs;
        struct lnet_ni *ni;

        lnet_peer_ni_addref_locked(rtr); &amp;lt;&amp;lt;&amp;lt;&amp;lt;&amp;lt; Addref

        if (rtr-&amp;gt;lpni_ping_deadline != 0 &amp;amp;&amp;amp; /* ping timed out? */
            now &amp;gt;  rtr-&amp;gt;lpni_ping_deadline)
                lnet_notify_locked(rtr, 1, 0, now);

        /* Run any outstanding notifications */
        ni = lnet_get_next_ni_locked(rtr-&amp;gt;lpni_net, NULL);
        lnet_ni_notify_locked(ni, rtr);

        if (!lnet_isrouter(rtr) ||
            the_lnet.ln_mt_state != LNET_MT_STATE_RUNNING) {
                /* router table changed or router checker is shutting down */
                lnet_peer_ni_decref_locked(rtr);
                return;
        }

        rcd = rtr-&amp;gt;lpni_rcd;

        /*
         * The response to the router checker ping could&apos;ve timed out and
         * the mdh might&apos;ve been invalidated, so we need to update it
         * again.
         */
        if (!rcd || rcd-&amp;gt;rcd_nnis &amp;gt; rcd-&amp;gt;rcd_pingbuffer-&amp;gt;pb_nnis ||
            LNetMDHandleIsInvalid(rcd-&amp;gt;rcd_mdh))
                rcd = lnet_update_rc_data_locked(rtr);
        if (rcd == NULL)
                return; &amp;lt;&amp;lt;&amp;lt;&amp;lt;&amp;lt; Reference leak
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="324362" author="gerrit" created="Fri, 28 Jan 2022 21:36:30 +0000"  >&lt;p&gt;&quot;Chris Horn &amp;lt;chris.horn@hpe.com&amp;gt;&quot; uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/46367&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/46367&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-15234&quot; title=&quot;LNet high peer reference counts inconsistent with queue&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-15234&quot;&gt;&lt;del&gt;LU-15234&lt;/del&gt;&lt;/a&gt; lnet: ref leak in lnet_ping_router_locked&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: b2_12&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: c28647196038c4c4bdae113c02e6898821fcaa8f&lt;/p&gt;</comment>
                            <comment id="324403" author="ofaaland" created="Sat, 29 Jan 2022 01:02:34 +0000"  >&lt;p&gt;Thanks Serguei.&#160; I&apos;ll be able to run with this patch Tuesday morning, and I&apos;ll get results right away.&lt;/p&gt;</comment>
                            <comment id="324407" author="ofaaland" created="Sat, 29 Jan 2022 01:27:41 +0000"  >&lt;p&gt;Thanks Chris.&#160; I&apos;m adding your patch to my stack so the Tuesday morning test will be with both patches.&lt;/p&gt;</comment>
                            <comment id="324771" author="ofaaland" created="Tue, 1 Feb 2022 19:41:40 +0000"  >&lt;p&gt;Hi Serguei and Chris, I&apos;ve uploaded orelic4.debug_refcount_01.tar.gz which has dmesg, lustre debug log (with our default flags, errors only), and lnetctl peer show for a peer with a high refcount.&#160; Thanks&lt;/p&gt;</comment>
                            <comment id="324789" author="hornc" created="Tue, 1 Feb 2022 22:08:36 +0000"  >&lt;p&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/ViewProfile.jspa?name=faaland1&quot; class=&quot;user-hover&quot; rel=&quot;faaland1&quot;&gt;faaland1&lt;/a&gt; I&apos;ll let Serguei dig into the data from the debug patch, but I noticed in the lustre debug log you provided that a few nodes appear to be down/unreachable:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;hornc@C02V50B9HTDG pass1 % grep &apos;ADDR ERR&apos; dk.1.txt | awk &apos;{print $2}&apos; | sort -u
172.19.1.59@o2ib100:
172.19.1.91@o2ib100:
172.19.1.92@o2ib100:
172.19.2.26@o2ib100:
172.19.2.27@o2ib100:
hornc@C02V50B9HTDG pass1 %
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Is this expected? i.e. were these peers actually down during the time captured by the log?&lt;/p&gt;</comment>
                            <comment id="324792" author="ofaaland" created="Tue, 1 Feb 2022 22:29:29 +0000"  >&lt;p&gt;Hi Chris,&lt;/p&gt;

&lt;p&gt;172.19.1.59@o2ib100: This host was not down.  I noticed afterwards that it failed to reconnect with orelic4 after orelic4 was rebooted;  I haven&apos;t had a chance to look into it yet, so I don&apos;t know why.&lt;/p&gt;

&lt;p&gt;172.19.1.91@o2ib100: retired/expected&lt;br/&gt;
172.19.1.92@o2ib100: retired/expected&lt;br/&gt;
172.19.2.26@o2ib100: down/expected&lt;br/&gt;
172.19.2.27@o2ib100: down/expected&lt;/p&gt;</comment>
                            <comment id="324822" author="ssmirnov" created="Wed, 2 Feb 2022 03:32:39 +0000"  >&lt;p&gt;Hi Olaf,&lt;/p&gt;

&lt;p&gt;The debug data listing itemized lpni refcounts looks incorrect: the sum of &quot;decrement&quot; counts exceeds the sum of &quot;increment&quot; counts by a lot, while I was expecting to see the difference no less than the reported total peer ni ref count. I wonder if the debug patch got applied correctly.&#160;&lt;/p&gt;

&lt;p&gt;Was the patch applied to LLNL repo or was lustre b2_12 used? I&apos;d like to check if the resulting code is missing some of the &quot;addref&quot; cases. Could you please provide the diff of the change and specify which commit was used as a base?&lt;/p&gt;

&lt;p&gt;Thanks,&lt;/p&gt;

&lt;p&gt;Serguei.&lt;/p&gt;</comment>
                            <comment id="324927" author="ofaaland" created="Wed, 2 Feb 2022 16:42:36 +0000"  >&lt;p&gt;Hi Serguei,&lt;/p&gt;

&lt;p&gt;I applied the patch to lustre&apos;s branch based on 2.12.8.  Here&apos;s what I built:&lt;br/&gt;
branch &lt;a href=&quot;https://github.com/LLNL/lustre/tree/debug-refcount-01&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://github.com/LLNL/lustre/tree/debug-refcount-01&lt;/a&gt; tag 2.12.8_6.llnl.olaf3&lt;/p&gt;

&lt;p&gt;There was a merge conflict because I had both Chris&apos; refcount leak patch and your patch on the same branch.  Maybe I goofed up the conflict resolution.&lt;/p&gt;

&lt;p&gt;thanks,&lt;br/&gt;
Olaf&lt;/p&gt;</comment>
                            <comment id="325003" author="hornc" created="Wed, 2 Feb 2022 21:35:46 +0000"  >&lt;p&gt;If system was under load then the counts might not match up just from activity on the router while the stats were being dumped.&lt;/p&gt;</comment>
                            <comment id="325007" author="ssmirnov" created="Wed, 2 Feb 2022 21:51:29 +0000"  >&lt;p&gt;I&apos;m going to update the debug patch to make use of atomic increments. I&apos;ll also rebase it on top of Chris&apos;s fix.&lt;/p&gt;</comment>
                            <comment id="325620" author="ssmirnov" created="Tue, 8 Feb 2022 18:14:41 +0000"  >&lt;p&gt;Hi,&lt;/p&gt;

&lt;p&gt;The debug patch has been updated. Please try the same test again and use&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
lnetctl peer show -v 5 --nid &amp;lt;nid_of_peer_with_high_refcount&amp;gt; &lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;to dump the debug data.&lt;/p&gt;

&lt;p&gt;Thanks,&lt;/p&gt;

&lt;p&gt;Serguei.&#160;&lt;/p&gt;</comment>
                            <comment id="326400" author="ofaaland" created="Tue, 15 Feb 2022 19:50:10 +0000"  >&lt;p&gt;Serguei,&lt;/p&gt;

&lt;p&gt;I&apos;ve attached 4 files - peer.show.172.16.70.*_at_tcp.orelic4.1644951836&lt;br/&gt;
peers .63,65 had climbing refcounts&lt;br/&gt;
peers .62,64 did not have climbing refcounts.&lt;br/&gt;
orelic4 was running your latest refcount debug patch using an atomic_t array.&lt;/p&gt;</comment>
                            <comment id="326570" author="ssmirnov" created="Thu, 17 Feb 2022 01:35:41 +0000"  >&lt;p&gt;Olaf,&lt;/p&gt;

&lt;p&gt;Now the debug counts add up and make more sense. However, so far I haven&apos;t been able to find any obvious leaks, so the most likely explanation is still messages just not getting finalized in LNet, and it is still unclear exactly why. I found a patch that may be related (&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-10428&quot; title=&quot;LNet events should generated without resource lock held&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-10428&quot;&gt;&lt;del&gt;LU-10428&lt;/del&gt;&lt;/a&gt;), but because I&apos;m not certain about it, I&apos;ll probably also add some more debugging. I&apos;ll let you know when the patches are ready.&lt;/p&gt;

&lt;p&gt;Thanks,&lt;/p&gt;

&lt;p&gt;Serguei.&lt;/p&gt;</comment>
                            <comment id="326572" author="ofaaland" created="Thu, 17 Feb 2022 01:50:39 +0000"  >&lt;p&gt;Hi Serguei,&lt;/p&gt;

&lt;p&gt;OK, thanks.&#160; If you want, you could base your patches on 2.14.&lt;/p&gt;

&lt;p&gt;-Olaf&lt;/p&gt;</comment>
                            <comment id="327665" author="gerrit" created="Mon, 28 Feb 2022 19:10:29 +0000"  >&lt;p&gt;&quot;Serguei Smirnov &amp;lt;ssmirnov@whamcloud.com&amp;gt;&quot; uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/46650&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/46650&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-15234&quot; title=&quot;LNet high peer reference counts inconsistent with queue&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-15234&quot;&gt;&lt;del&gt;LU-15234&lt;/del&gt;&lt;/a&gt; lnet: add mechanism for dumping lnd peer debug info&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: b2_14&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: fa91a09fe507d17bc04568266589d7298c4e4025&lt;/p&gt;</comment>
                            <comment id="327890" author="ssmirnov" created="Wed, 2 Mar 2022 17:23:26 +0000"  >&lt;p&gt;Hi Olaf,&#160;&lt;/p&gt;

&lt;p&gt;The new patch is based on b2_14. It adds ability to examine the lnd peer.&#160;&lt;/p&gt;

&lt;p&gt;If there&apos;s a peer with climbing refcounts&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
lnetctl debug peer --prim_nid=&amp;lt;peer nid&amp;gt; &lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;will dump the peer info to the log (&quot;console&quot;, should be enabled by default), to be retrieved with&#160; &quot;lctl dk&quot;.&lt;/p&gt;

&lt;p&gt;Note that &quot;prim_nid&quot; parameter doesn&apos;t really require the primary nid of the peer, but rather the specific lpni. I&apos;ll update the patch to make it more clear a bit later.&lt;/p&gt;

&lt;p&gt;The purpose of this is to check the idea that something on lnd level is preventing messages from being finalized.&lt;/p&gt;

&lt;p&gt;Thanks,&lt;/p&gt;

&lt;p&gt;Serguei.&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;</comment>
                            <comment id="330269" author="ofaaland" created="Fri, 25 Mar 2022 17:29:13 +0000"  >&lt;p&gt;Hi Serguei,&lt;/p&gt;

&lt;p&gt;Sorry for the long delay.&lt;/p&gt;

&lt;p&gt;With 2.14 + this patch running on orelic4, I did not see the climbing peer refcounts, even though I still saw the symptoms described in &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-14026&quot; title=&quot;symptoms of message loss or corruption after upgrading routers to lustre 2.12.5&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-14026&quot;&gt;LU-14026&lt;/a&gt; (reconnects between clients and targets, timed out messages on clients and servers, &quot;Lost connection to MGS&quot; on clients.  &lt;/p&gt;

&lt;p&gt;So perhaps my suggestion to use 2.14 wasn&apos;t a good one.  I&apos;m wondering if there are multiple issues and 2.14 doesn&apos;t have the one that causes climbing refcounts.  Ideas?&lt;/p&gt;

&lt;p&gt;thanks,&lt;br/&gt;
Olaf&lt;/p&gt;</comment>
                            <comment id="336059" author="gerrit" created="Thu, 26 May 2022 00:27:29 +0000"  >&lt;p&gt;&quot;Gian-Carlo DeFazio &amp;lt;defazio1@llnl.gov&amp;gt;&quot; uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/47460&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/47460&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-15234&quot; title=&quot;LNet high peer reference counts inconsistent with queue&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-15234&quot;&gt;&lt;del&gt;LU-15234&lt;/del&gt;&lt;/a&gt; lnet: add mechanism for dumping lnd peer debug info&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: b2_12&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: 5dec282aff43c739e8fd422df9e0de8fd93e35ba&lt;/p&gt;</comment>
                            <comment id="336193" author="ofaaland" created="Fri, 27 May 2022 16:02:07 +0000"  >&lt;p&gt;Hi Serguei,&lt;/p&gt;

&lt;p&gt;Can you review Gian-Carlo&apos;s backport of your lnd peer debug info patch?&#160; We&apos;re seeing this climbing refcount issue more widely.&lt;/p&gt;

&lt;p&gt;thanks,&lt;br/&gt;
Olaf&lt;/p&gt;</comment>
                            <comment id="336199" author="ssmirnov" created="Fri, 27 May 2022 17:20:05 +0000"  >&lt;p&gt;Hi Olaf,&lt;/p&gt;

&lt;p&gt;Would it be possible to check which Mellanox FW versions are used? There was a recent investigation at one of the DDN sites which isolated xxx.30.xxxx FW version as problematic: there&apos;s a bug in this version which can cause &quot;stuck qp&quot; in IB layer. I&apos;d like to make sure we&apos;re not affected by the same problem here.&lt;/p&gt;

&lt;p&gt;Thanks,&lt;/p&gt;

&lt;p&gt;Serguei.&lt;/p&gt;</comment>
                            <comment id="336556" author="ofaaland" created="Wed, 1 Jun 2022 19:23:42 +0000"  >&lt;p&gt;Hi Serguei,&lt;/p&gt;

&lt;p&gt;Routers with this symptom recently have FW&#160;16.29.2002.&#160; We don&apos;t have any routers running xxx.30.xxxx.&lt;/p&gt;

&lt;p&gt;thanks&lt;/p&gt;</comment>
                            <comment id="337110" author="ofaaland" created="Thu, 9 Jun 2022 00:46:45 +0000"  >&lt;p&gt;Hi Serguei,&lt;/p&gt;

&lt;p&gt;We reproduced the issue on orelic2, with &lt;a href=&quot;https://review.whamcloud.com/47460&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/47460&lt;/a&gt;, under Lustre 2.12.8.&#160;&lt;/p&gt;

&lt;p&gt;There were 4 peers with high refcounts, with NIDs&#160; 172.16.70.6&lt;span class=&quot;error&quot;&gt;&amp;#91;2-5&amp;#93;&lt;/span&gt;@tcp.&#160; I captured the debug information multiple times for some those peers, but I may not be able to identify which peer a set of debug output is for.&#160; I&apos;ll post that mapping if I find it. The debug information, as well as the output of &quot;lnetctl peer show --details&quot;, is attached.&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;nobr&quot;&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/attachment/43990/43990_lnetctl.peer.show.orelic2.1654723542.txt&quot; title=&quot;lnetctl.peer.show.orelic2.1654723542.txt attached to LU-15234&quot;&gt;lnetctl.peer.show.orelic2.1654723542.txt&lt;sup&gt;&lt;img class=&quot;rendericon&quot; src=&quot;https://jira.whamcloud.com/images/icons/link_attachment_7.gif&quot; height=&quot;7&quot; width=&quot;7&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/sup&gt;&lt;/a&gt;&lt;/span&gt;&lt;br/&gt;
&lt;span class=&quot;nobr&quot;&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/attachment/43991/43991_lnetctl.peer.show.orelic2.1654724780.txt&quot; title=&quot;lnetctl.peer.show.orelic2.1654724780.txt attached to LU-15234&quot;&gt;lnetctl.peer.show.orelic2.1654724780.txt&lt;sup&gt;&lt;img class=&quot;rendericon&quot; src=&quot;https://jira.whamcloud.com/images/icons/link_attachment_7.gif&quot; height=&quot;7&quot; width=&quot;7&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/sup&gt;&lt;/a&gt;&lt;/span&gt;&lt;br/&gt;
&lt;span class=&quot;nobr&quot;&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/attachment/43992/43992_dk.orelic2.1654723678.txt&quot; title=&quot;dk.orelic2.1654723678.txt attached to LU-15234&quot;&gt;dk.orelic2.1654723678.txt&lt;sup&gt;&lt;img class=&quot;rendericon&quot; src=&quot;https://jira.whamcloud.com/images/icons/link_attachment_7.gif&quot; height=&quot;7&quot; width=&quot;7&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/sup&gt;&lt;/a&gt;&lt;/span&gt;&lt;br/&gt;
&lt;span class=&quot;nobr&quot;&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/attachment/43993/43993_dk.orelic2.1654723686.txt&quot; title=&quot;dk.orelic2.1654723686.txt attached to LU-15234&quot;&gt;dk.orelic2.1654723686.txt&lt;sup&gt;&lt;img class=&quot;rendericon&quot; src=&quot;https://jira.whamcloud.com/images/icons/link_attachment_7.gif&quot; height=&quot;7&quot; width=&quot;7&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/sup&gt;&lt;/a&gt;&lt;/span&gt;&lt;br/&gt;
&lt;span class=&quot;nobr&quot;&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/attachment/43994/43994_dk.orelic2.1654724730.txt&quot; title=&quot;dk.orelic2.1654724730.txt attached to LU-15234&quot;&gt;dk.orelic2.1654724730.txt&lt;sup&gt;&lt;img class=&quot;rendericon&quot; src=&quot;https://jira.whamcloud.com/images/icons/link_attachment_7.gif&quot; height=&quot;7&quot; width=&quot;7&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/sup&gt;&lt;/a&gt;&lt;/span&gt;&lt;br/&gt;
&lt;span class=&quot;nobr&quot;&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/attachment/43995/43995_dk.orelic2.1654724740.txt&quot; title=&quot;dk.orelic2.1654724740.txt attached to LU-15234&quot;&gt;dk.orelic2.1654724740.txt&lt;sup&gt;&lt;img class=&quot;rendericon&quot; src=&quot;https://jira.whamcloud.com/images/icons/link_attachment_7.gif&quot; height=&quot;7&quot; width=&quot;7&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/sup&gt;&lt;/a&gt;&lt;/span&gt;&lt;br/&gt;
&lt;span class=&quot;nobr&quot;&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/attachment/43996/43996_dk.orelic2.1654724745.txt&quot; title=&quot;dk.orelic2.1654724745.txt attached to LU-15234&quot;&gt;dk.orelic2.1654724745.txt&lt;sup&gt;&lt;img class=&quot;rendericon&quot; src=&quot;https://jira.whamcloud.com/images/icons/link_attachment_7.gif&quot; height=&quot;7&quot; width=&quot;7&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/sup&gt;&lt;/a&gt;&lt;/span&gt;&lt;br/&gt;
&lt;span class=&quot;nobr&quot;&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/attachment/43997/43997_dk.orelic2.1654724751.txt&quot; title=&quot;dk.orelic2.1654724751.txt attached to LU-15234&quot;&gt;dk.orelic2.1654724751.txt&lt;sup&gt;&lt;img class=&quot;rendericon&quot; src=&quot;https://jira.whamcloud.com/images/icons/link_attachment_7.gif&quot; height=&quot;7&quot; width=&quot;7&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/sup&gt;&lt;/a&gt;&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;Thanks,&lt;br/&gt;
Olaf&lt;/p&gt;</comment>
                            <comment id="337673" author="ofaaland" created="Mon, 13 Jun 2022 22:00:03 +0000"  >&lt;p&gt;Hi Serguei,&lt;br/&gt;
I don&apos;t have a record of which peer NID was given as the argument, for the above debug sessions.  Do you need me to reproduce this and keep track of that?&lt;br/&gt;
thanks,&lt;br/&gt;
Olaf&lt;/p&gt;</comment>
                            <comment id="337676" author="ssmirnov" created="Mon, 13 Jun 2022 22:19:55 +0000"  >&lt;p&gt;Hi Olaf,&lt;/p&gt;

&lt;p&gt;So far, from looking at the logs you provided, I haven&apos;t seen any outputs with abnormal stats for any of the peers you dumped, which may mean that the problem is not reflected at lnd level.&lt;/p&gt;

&lt;p&gt;If you do reproduce again, you could try using &quot;lnetctl peer show -v 4&quot; (vs. just &quot;lnetctl peer show&quot;). To reduce the amount of output this produces, you can use &quot; --nid &quot; option to dump for specific peer only.&lt;/p&gt;

&lt;p&gt;In the meantime I&apos;m looking at how instrumentation can be extended to yield more useful info.&lt;/p&gt;

&lt;p&gt;Thanks,&lt;/p&gt;

&lt;p&gt;Serguei.&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;</comment>
                            <comment id="337899" author="ssmirnov" created="Thu, 16 Jun 2022 03:32:38 +0000"  >&lt;p&gt;Hi Olaf,&lt;/p&gt;

&lt;p&gt;After discussing with &lt;a href=&quot;https://jira.whamcloud.com/secure/ViewProfile.jspa?name=ashehata&quot; class=&quot;user-hover&quot; rel=&quot;ashehata&quot;&gt;ashehata&lt;/a&gt;, I wonder if we could revisit testing with the &quot;detailed peer refcount summary&quot; patch &lt;a href=&quot;https://review.whamcloud.com/46364&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/46364&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I&apos;d like to clarify the following:&lt;/p&gt;

&lt;p&gt;1) How do the &quot;detailed&quot; counts change over time (for a peer which has refcount steadily increasing)? This means taking more than one snapshot of lnetctl output: e.g. at refcount 100 and refcount = 500.&lt;/p&gt;

&lt;p&gt;2) The increasing peer refcount appears to be associated with negative number of router credits, i.e. we&apos;re slow routing messages from this peer. What happens if the corresponding route is removed from the peer?&#160;&lt;/p&gt;

&lt;p&gt;Not sure if it is easy enough to arrange, but for &quot;2&quot; it should be possible to remove the route dynamically using lnetctl. After the route is removed, we should stop receiving traffic from this peer. We would finish forwarding whatever messages we had queued up and rtr_credits should return to normal value. In order to avoid issues with &quot;symmetry&quot;, it would be best to remove the route from all peers. Then we can check what happened to the peer refcount: dump the &quot;detailed&quot; counts again and try to delete the peer using lnetctl (won&apos;t work if there&apos;s actually a leak). Maybe dump a core, too.&lt;/p&gt;

&lt;p&gt;Thanks,&lt;/p&gt;

&lt;p&gt;Serguei.&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;</comment>
                            <comment id="338289" author="ofaaland" created="Tue, 21 Jun 2022 22:54:48 +0000"  >&lt;p&gt;Hi Serguei,&lt;/p&gt;

&lt;p&gt;I was able to gather detailed counts over time, remove the affected node from all routes so no messages should have been coming in to be routed, attempt to stop lnet, and obtain a crash dump.  The node that ran 2.12 with the debug patch was &quot;orelic2&quot;.&lt;/p&gt;

&lt;p&gt;The detailed counts and debug logs are attached:&#160; &lt;span class=&quot;nobr&quot;&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/attachment/44111/44111_2022-jun-21.tgz&quot; title=&quot;2022-jun-21.tgz attached to LU-15234&quot;&gt;2022-jun-21.tgz&lt;sup&gt;&lt;img class=&quot;rendericon&quot; src=&quot;https://jira.whamcloud.com/images/icons/link_attachment_7.gif&quot; height=&quot;7&quot; width=&quot;7&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/sup&gt;&lt;/a&gt;&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;To provide context:&lt;br/&gt;
2022-06-21 14:36:56 LNet started with debug patch&lt;br/&gt;
2022-06-21 14:55:00 Removed routes on other clusters where gateway == orelic2. (time approximate)&lt;br/&gt;
2022-06-21 15:21:34 issued &quot;lnetctl lnet unconfigure&quot;&lt;br/&gt;
2022-06-21 15:26:21 crashed orelic2 to gather the dump&lt;/p&gt;

&lt;p&gt;The timestamps on the files in the tarball will tell you when counts, debug logs, etc. were gathered.&lt;/p&gt;

&lt;p&gt;Before removing routes, the refcounts continued to climb.&lt;br/&gt;
After removing routes, the refcounts plateaued at 82&lt;br/&gt;
The &quot;lnetctl lnet unconfigure&quot; command hung&lt;/p&gt;

&lt;p&gt;I&apos;ve also included debug logs for the period.  I changed the debug mask to -1 after removing routes but before issuing &quot;lnetctl lnet unconfigure&quot;.&lt;/p&gt;

&lt;p&gt;I can send you the crash dump.&lt;/p&gt;

&lt;p&gt;Thanks,&lt;br/&gt;
Olaf&lt;/p&gt;</comment>
                            <comment id="340077" author="ofaaland" created="Mon, 11 Jul 2022 16:55:25 +0000"  >&lt;p&gt;Hi Serguei, do you have any update on this?&lt;br/&gt;
Thanks,&lt;br/&gt;
Olaf&lt;/p&gt;</comment>
                            <comment id="340109" author="ssmirnov" created="Mon, 11 Jul 2022 21:14:43 +0000"  >&lt;p&gt;Hi Olaf,&lt;/p&gt;

&lt;p&gt;I examined the traces you provided. It still looks like some messages are just not getting finalized. One idea I have is that they might have gotten stuck in resend queue somehow.&lt;/p&gt;

&lt;p&gt;Could please give me access to the crash dump if you still have it, along with debuginfo rpms?&lt;/p&gt;

&lt;p&gt;Thanks,&lt;/p&gt;

&lt;p&gt;Serguei.&lt;/p&gt;</comment>
                            <comment id="340123" author="ofaaland" created="Mon, 11 Jul 2022 23:37:24 +0000"  >&lt;p&gt;Hi Serguei,&lt;br/&gt;
I&apos;ve uploaded the dump and debuginfos via ftp.  Please confirm you received them.&lt;br/&gt;
thanks,&lt;br/&gt;
Olaf&lt;/p&gt;</comment>
                            <comment id="340183" author="ssmirnov" created="Tue, 12 Jul 2022 18:40:21 +0000"  >&lt;p&gt;Hi Olaf,&lt;/p&gt;

&lt;p&gt;I found these files&#160;&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
-rw-r--r-- &#160;1 sdsmirnov &#160;staff &#160; 469346936 12 Jul 11:36 kernel-debuginfo-3.10.0-1160.66.1.1chaos.ch6.x86_64.rpm
-rw-r--r-- &#160;1 sdsmirnov &#160;staff &#160; &#160;65354176 12 Jul 11:37 kernel-debuginfo-common-x86_64-3.10.0-1160.66.1.1chaos.ch6.x86_64.rpm
-rw-r--r-- &#160;1 sdsmirnov &#160;staff &#160; &#160;19370216 12 Jul 11:37 lustre-debuginfo-2.12.8_9.llnl.olaf1.toss5305-1.ch6_1.x86_64.rpm
-rw-r--r-- &#160;1 sdsmirnov &#160;staff &#160;1270395238 12 Jul 11:34 vmcore
-rw-r--r-- &#160;1 sdsmirnov &#160;staff &#160; &#160; &#160;148855 12 Jul 11:34 vmcore-dmesg.txt&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;and copied them over to my machine. I&apos;ll take a look and keep you updated.&lt;/p&gt;

&lt;p&gt;Thanks,&lt;/p&gt;

&lt;p&gt;Serguei.&lt;/p&gt;</comment>
                            <comment id="341508" author="ofaaland" created="Mon, 25 Jul 2022 22:55:54 +0000"  >&lt;p&gt;Hi Serguei,&lt;br/&gt;
Do you have any updates?&lt;br/&gt;
thanks,&lt;br/&gt;
Olaf&lt;/p&gt;</comment>
                            <comment id="341525" author="ssmirnov" created="Tue, 26 Jul 2022 03:35:10 +0000"  >&lt;p&gt;Hi Olaf,&lt;/p&gt;

&lt;p&gt;I had too many distractions and haven&apos;t finished looking at the core yet.&lt;/p&gt;

&lt;p&gt;Basically I believe what I see in the core so far does confirm the idea that messages are not getting finalized, but I still haven&apos;t understood why. In LNet layer the number of queued messages on the problem peer looks consistent with the high refcount, but I still need to dig more at the LND level and examine message queues there.&lt;/p&gt;

&lt;p&gt;Thanks,&lt;/p&gt;

&lt;p&gt;Serguei.&lt;/p&gt;</comment>
                            <comment id="342976" author="gerrit" created="Mon, 8 Aug 2022 22:26:48 +0000"  >&lt;p&gt;&quot;Serguei Smirnov &amp;lt;ssmirnov@whamcloud.com&amp;gt;&quot; uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/48163&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/48163&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-15234&quot; title=&quot;LNet high peer reference counts inconsistent with queue&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-15234&quot;&gt;&lt;del&gt;LU-15234&lt;/del&gt;&lt;/a&gt; lnet: test for race when completing discovery&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: b2_12&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: 0eb36b2ace98b0c57595098a3a6d9f5de8e6045c&lt;/p&gt;</comment>
                            <comment id="342983" author="ssmirnov" created="Mon, 8 Aug 2022 22:48:45 +0000"  >&lt;p&gt;Hi Olaf,&lt;/p&gt;

&lt;p&gt;While examining the core I found that messages causing the delay are waiting to be sent: they are listed on lp_dc_pendq of the destination peer.&lt;/p&gt;

&lt;p&gt;At the same time, the destination peer is not queued to be discovered, so it appears that there&apos;s no good reason for the messages to be delayed.&lt;/p&gt;

&lt;p&gt;I pushed a test patch in order to rule out a race condition which somehow enables a thread to queue a message for a peer which is not (or no longer) going to be discovered. The new patch is going to attempt to recognize this situation on discovery completion, print an error and handle any messages which are still pending. This should help locate the race condition if it is actually occurring. If this is the only cause, with this patch we should see the error message &quot;Peer X msg list not empty on disc comp&quot; and no more refcount increase.&lt;/p&gt;

&lt;p&gt;Otherwise, I&apos;ll have to look for other possible causes.&lt;/p&gt;

&lt;p&gt;Thanks,&lt;/p&gt;

&lt;p&gt;Serguei.&lt;/p&gt;</comment>
                            <comment id="343118" author="hornc" created="Tue, 9 Aug 2022 16:23:12 +0000"  >&lt;p&gt;Sounds like &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-12739&quot; class=&quot;external-link&quot; rel=&quot;nofollow&quot;&gt;https://jira.whamcloud.com/browse/LU-12739&lt;/a&gt; ?&lt;/p&gt;</comment>
                            <comment id="343142" author="ssmirnov" created="Tue, 9 Aug 2022 20:07:48 +0000"  >&lt;p&gt;Chris,&lt;/p&gt;

&lt;p&gt;Yes indeed, it looks very much like &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-12739&quot; title=&quot;Race with discovery thread completion and message queueing&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-12739&quot;&gt;&lt;del&gt;LU-12739&lt;/del&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I&apos;ll port these changes.&lt;/p&gt;

&lt;p&gt;Thanks,&lt;/p&gt;

&lt;p&gt;Serguei.&lt;/p&gt;</comment>
                            <comment id="343388" author="ssmirnov" created="Thu, 11 Aug 2022 17:14:36 +0000"  >&lt;p&gt;Hi Olaf,&lt;/p&gt;

&lt;p&gt;I ported Chris&apos;s fix for &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-12739&quot; title=&quot;Race with discovery thread completion and message queueing&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-12739&quot;&gt;&lt;del&gt;LU-12739&lt;/del&gt;&lt;/a&gt; to b2_12: &lt;a href=&quot;https://review.whamcloud.com/#/c/48190/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/#/c/48190/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Please give this patch a try. It is aiming to eliminate a race condition with effects potentially similar to what is seen in the coredump you provided.&lt;/p&gt;

&lt;p&gt;Thanks,&lt;/p&gt;

&lt;p&gt;Serguei.&lt;/p&gt;</comment>
                            <comment id="344587" author="ofaaland" created="Wed, 24 Aug 2022 22:59:04 +0000"  >&lt;p&gt;Hi Serguei&lt;/p&gt;

&lt;p&gt;I tested 2.12.9 + change 48190 today and results so far are promising.   I&apos;ll test it further and post here in the next couple of days.&lt;/p&gt;</comment>
                            <comment id="344659" author="ofaaland" created="Thu, 25 Aug 2022 16:00:14 +0000"  >&lt;p&gt;Hi Serguei,&lt;/p&gt;

&lt;p&gt;2.12.9 + change 48190 held up well overnight which is far beyond how long we&apos;ve needed to wait for symptoms in the past.  If you can get someone to perform a second review on the patch in gerrit that would be great.&lt;/p&gt;

&lt;p&gt;I&apos;ll deploy more widely and update here early next week.&lt;/p&gt;

&lt;p&gt;thanks,&lt;br/&gt;
Olaf&lt;/p&gt;</comment>
                            <comment id="345153" author="ofaaland" created="Wed, 31 Aug 2022 00:54:08 +0000"  >&lt;p&gt;Hi Serguei,&lt;/p&gt;

&lt;p&gt;2.12.9 + change 48190 appears to have resolved this issue on orelic, which has been a reliable reproducer.&lt;/p&gt;

&lt;p&gt;Olaf&lt;/p&gt;</comment>
                            <comment id="346404" author="ofaaland" created="Mon, 12 Sep 2022 17:59:53 +0000"  >&lt;p&gt;As far as I&apos;m concerned, this will be resolved when the patch lands to b2_12.&#160; Do you agree?&#160; If so, what is the plan for that?&lt;/p&gt;

&lt;p&gt;thanks&lt;/p&gt;</comment>
                            <comment id="346409" author="pjones" created="Mon, 12 Sep 2022 18:38:05 +0000"  >&lt;p&gt;Yes I think that we can mark this ticket as a duplicate of &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-12739&quot; title=&quot;Race with discovery thread completion and message queueing&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-12739&quot;&gt;&lt;del&gt;LU-12739&lt;/del&gt;&lt;/a&gt; once 48190 has been merged to b2_12. It should be included in the next b2_12-next batch we test&lt;/p&gt;</comment>
                            <comment id="346849" author="gerrit" created="Thu, 15 Sep 2022 22:52:32 +0000"  >&lt;p&gt;&quot;Serguei Smirnov &amp;lt;ssmirnov@whamcloud.com&amp;gt;&quot; uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/48566&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/48566&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-15234&quot; title=&quot;LNet high peer reference counts inconsistent with queue&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-15234&quot;&gt;&lt;del&gt;LU-15234&lt;/del&gt;&lt;/a&gt; lnet: add mechanism for dumping lnd peer debug info&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: dc704df0be48fc9f933e6f2c6fede3c5991a951a&lt;/p&gt;</comment>
                            <comment id="347175" author="pjones" created="Tue, 20 Sep 2022 12:21:22 +0000"  >&lt;p&gt;The &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-12739&quot; title=&quot;Race with discovery thread completion and message queueing&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-12739&quot;&gt;&lt;del&gt;LU-12739&lt;/del&gt;&lt;/a&gt; fix has landed to b2_12 but perhaps this ticket should remain open to track the landing of &lt;a href=&quot;https://review.whamcloud.com/#/c/fs/lustre-release/+/48566/?&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/#/c/fs/lustre-release/+/48566/?&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="348107" author="ofaaland" created="Tue, 27 Sep 2022 21:16:33 +0000"  >&lt;p&gt;&amp;gt;&#160; The &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-12739&quot; title=&quot;Race with discovery thread completion and message queueing&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-12739&quot;&gt;&lt;del&gt;LU-12739&lt;/del&gt;&lt;/a&gt; fix has landed to b2_12 but perhaps this ticket should remain open to track the landing of &lt;a href=&quot;https://review.whamcloud.com/#/c/fs/lustre-release/+/48566/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/#/c/fs/lustre-release/+/48566/&lt;/a&gt; ?&lt;/p&gt;

&lt;p&gt;No opinion from me.&lt;/p&gt;

&lt;p&gt;Thanks for getting this fixed.&lt;/p&gt;</comment>
                            <comment id="348457" author="pjones" created="Sat, 1 Oct 2022 06:42:56 +0000"  >&lt;p&gt;I think it is really a call for &lt;a href=&quot;https://jira.whamcloud.com/secure/ViewProfile.jspa?name=ssmirnov&quot; class=&quot;user-hover&quot; rel=&quot;ssmirnov&quot;&gt;ssmirnov&lt;/a&gt;&#160;. Do you still think that there is value in landing &lt;a href=&quot;https://review.whamcloud.com/#/c/fs/lustre-release/+/48566/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/#/c/fs/lustre-release/+/48566/&lt;/a&gt;&#160;or do you intend to abandon it in light of the review comments?&lt;/p&gt;</comment>
                            <comment id="348497" author="ssmirnov" created="Mon, 3 Oct 2022 00:43:15 +0000"  >&lt;p&gt;No I would prefer to address the comments and land this patch. Even though it is not fixing anything for this ticket (it is a debugging enhancement), it happens to have been created as a result of investigating this issue.&#160;&lt;/p&gt;</comment>
                            <comment id="350720" author="gerrit" created="Tue, 25 Oct 2022 17:25:59 +0000"  >&lt;p&gt;&quot;Oleg Drokin &amp;lt;green@whamcloud.com&amp;gt;&quot; merged in patch &lt;a href=&quot;https://review.whamcloud.com/c/fs/lustre-release/+/48566/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/c/fs/lustre-release/+/48566/&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-15234&quot; title=&quot;LNet high peer reference counts inconsistent with queue&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-15234&quot;&gt;&lt;del&gt;LU-15234&lt;/del&gt;&lt;/a&gt; lnet: add mechanism for dumping lnd peer debug info&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: &lt;br/&gt;
Commit: 950e59ced18d49e9fdd31c1e9de43b89a0bc1c1d&lt;/p&gt;</comment>
                            <comment id="350749" author="pjones" created="Tue, 25 Oct 2022 19:09:53 +0000"  >&lt;p&gt;Landed for 2.16&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10010">
                    <name>Duplicate</name>
                                            <outwardlinks description="duplicates">
                                        <issuelink>
            <issuekey id="56864">LU-12739</issuekey>
        </issuelink>
                            </outwardlinks>
                                                        </issuelinktype>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                                                <inwardlinks description="is related to">
                                        <issuelink>
            <issuekey id="68047">LU-15453</issuekey>
        </issuelink>
                            </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                            <attachment id="44111" name="2022-jun-21.tgz" size="272962" author="ofaaland" created="Tue, 21 Jun 2022 22:45:17 +0000"/>
                            <attachment id="42035" name="debug_refcount_01.patch" size="19109" author="ssmirnov" created="Fri, 28 Jan 2022 00:31:23 +0000"/>
                            <attachment id="43992" name="dk.orelic2.1654723678.txt" size="6766" author="ofaaland" created="Thu, 9 Jun 2022 00:43:55 +0000"/>
                            <attachment id="43993" name="dk.orelic2.1654723686.txt" size="2232" author="ofaaland" created="Thu, 9 Jun 2022 00:43:55 +0000"/>
                            <attachment id="43994" name="dk.orelic2.1654724730.txt" size="27317" author="ofaaland" created="Thu, 9 Jun 2022 00:43:55 +0000"/>
                            <attachment id="43995" name="dk.orelic2.1654724740.txt" size="2427" author="ofaaland" created="Thu, 9 Jun 2022 00:43:55 +0000"/>
                            <attachment id="43996" name="dk.orelic2.1654724745.txt" size="2231" author="ofaaland" created="Thu, 9 Jun 2022 00:43:55 +0000"/>
                            <attachment id="43997" name="dk.orelic2.1654724751.txt" size="2233" author="ofaaland" created="Thu, 9 Jun 2022 00:43:55 +0000"/>
                            <attachment id="41391" name="dk.ruby1016.1637103254.txt.bz2" size="8995306" author="ofaaland" created="Tue, 16 Nov 2021 23:12:52 +0000"/>
                            <attachment id="41468" name="ko2iblnd.parameters.orelic4.1637617473.txt" size="1183" author="ofaaland" created="Mon, 22 Nov 2021 21:48:30 +0000"/>
                            <attachment id="41474" name="ksocklnd.parameters.orelic4.1637617487.txt" size="1345" author="ofaaland" created="Mon, 22 Nov 2021 21:48:30 +0000"/>
                            <attachment id="41473" name="lctl.version.orelic4.1637616867.txt" size="21" author="ofaaland" created="Mon, 22 Nov 2021 21:48:30 +0000"/>
                            <attachment id="41472" name="lctl.version.ruby1016.1637616519.txt" size="26" author="ofaaland" created="Mon, 22 Nov 2021 21:48:30 +0000"/>
                            <attachment id="41471" name="lnet.parameters.orelic4.1637617458.txt" size="1798" author="ofaaland" created="Mon, 22 Nov 2021 21:48:30 +0000"/>
                            <attachment id="41470" name="lnetctl.net-show.orelic4.1637616889.txt" size="1686" author="ofaaland" created="Mon, 22 Nov 2021 21:48:30 +0000"/>
                            <attachment id="41469" name="lnetctl.net-show.ruby1016.1637616206.txt" size="4885" author="ofaaland" created="Mon, 22 Nov 2021 21:48:30 +0000"/>
                            <attachment id="43990" name="lnetctl.peer.show.orelic2.1654723542.txt" size="1254504" author="ofaaland" created="Thu, 9 Jun 2022 00:43:17 +0000"/>
                            <attachment id="43991" name="lnetctl.peer.show.orelic2.1654724780.txt" size="1254535" author="ofaaland" created="Thu, 9 Jun 2022 00:43:30 +0000"/>
                            <attachment id="41732" name="orelic4-lustre212-20211216.tgz" size="2113" author="ofaaland" created="Thu, 16 Dec 2021 23:14:03 +0000"/>
                            <attachment id="42142" name="orelic4.debug_refcount_01.tar.gz" size="27301" author="ofaaland" created="Tue, 1 Feb 2022 19:39:36 +0000"/>
                            <attachment id="41710" name="params_20211213.tar.gz" size="5901" author="defazio" created="Tue, 14 Dec 2021 02:01:46 +0000"/>
                            <attachment id="41672" name="peer status orelic4 with discovery race patch v3.png" size="417279" author="ofaaland" created="Thu, 9 Dec 2021 01:13:49 +0000"/>
                            <attachment id="42378" name="peer.show.172.16.70.62_at_tcp.orelic4.1644951836" size="2246" author="ofaaland" created="Tue, 15 Feb 2022 19:44:21 +0000"/>
                            <attachment id="42381" name="peer.show.172.16.70.63_at_tcp.orelic4.1644951836" size="2233" author="ofaaland" created="Tue, 15 Feb 2022 19:44:22 +0000"/>
                            <attachment id="42380" name="peer.show.172.16.70.64_at_tcp.orelic4.1644951836" size="2244" author="ofaaland" created="Tue, 15 Feb 2022 19:44:21 +0000"/>
                            <attachment id="42379" name="peer.show.172.16.70.65_at_tcp.orelic4.1644951836" size="2235" author="ofaaland" created="Tue, 15 Feb 2022 19:44:21 +0000"/>
                            <attachment id="41394" name="peer.show.ruby1016.1637103254.txt" size="1100" author="ofaaland" created="Tue, 16 Nov 2021 23:12:47 +0000"/>
                            <attachment id="41392" name="peer.show.ruby1016.1637103865.txt" size="1102" author="ofaaland" created="Tue, 16 Nov 2021 23:12:47 +0000"/>
                            <attachment id="41395" name="stats.show.ruby1016.1637103254.txt" size="583" author="ofaaland" created="Tue, 16 Nov 2021 23:12:47 +0000"/>
                            <attachment id="41393" name="stats.show.ruby1016.1637103865.txt" size="583" author="ofaaland" created="Tue, 16 Nov 2021 23:12:47 +0000"/>
                            <attachment id="41377" name="toss-5305 queue 2021-11-15.png" size="63528" author="ofaaland" created="Tue, 16 Nov 2021 00:05:10 +0000"/>
                            <attachment id="41378" name="toss-5305 refs 2021-11-15.png" size="67614" author="ofaaland" created="Tue, 16 Nov 2021 00:05:10 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|i029zz:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>