<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 03:11:23 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-14627] Lost ref on lnet_peer in discovery leads to LNetError: 24909:0:(peer.c:292:lnet_destroy_peer_locked()) ASSERTION( list_empty(&amp;lp-&gt;lp_peer_nets) ) failed:</title>
                <link>https://jira.whamcloud.com/browse/LU-14627</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;I am fairly certain this bug can also result in this assert:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;LNetError: 9873:0:(peer.c:305:lnet_destroy_peer_locked()) ASSERTION( lp-&amp;gt;lp_rtr_refcount == 0 ) failed:
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;We saw this bug with HPE&apos;s Lustre 2.12 which differs quite a bit from the community 2.12 (namely because we backported the multi-rail routing feature). I&apos;m pretty sure the bug is present in community 2.14+ and it may be present in older versions.&lt;/p&gt;

&lt;p&gt;TL;DR - Flaw in discovery logic results in a lost reference on an lnet_peer object which results in it being improperly destroyed.&lt;/p&gt;

&lt;p&gt;MDS sends an OST_CONNECT RPC to 10.249.249.206@o2ib. The message is queued for discovery. This adds a reference to the lnet_peer object:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;00000100:00000200:9.0:1618858502.715082:0:51842:0:(niobuf.c:85:ptl_send_buf()) Sending 520 bytes to portal 28, xid 1697249298880320, offset 0
00000400:00000200:9.0:1618858502.715083:0:51842:0:(lib-move.c:4905:LNetPut()) LNetPut -&amp;gt; 12345-10.249.249.206@o2ib
00000400:00000200:9.0:1618858502.715086:0:51842:0:(peer.c:1937:lnet_peer_queue_for_discovery()) Queue peer 10.249.249.206@o2ib: 0 &amp;lt;&amp;lt;&amp;lt;&amp;lt; Reference added
00000400:00000200:9.0:1618858502.715086:0:51842:0:(peer.c:2250:lnet_discover_peer_locked()) Discovery attempt # 1
00000400:00000200:9.0:1618858502.715087:0:51842:0:(peer.c:2291:lnet_discover_peer_locked()) non-blocking discovery
00000400:00000200:9.0:1618858502.715088:0:51842:0:(peer.c:2298:lnet_discover_peer_locked()) peer 10.249.249.206@o2ib NID 10.249.249.206@o2ib: 0. pending discovery
00000400:00000200:9.0:1618858502.715089:0:51842:0:(lib-move.c:2050:lnet_initiate_peer_discovery()) msg ffff90cd4773d5f0 delayed. 10.249.249.206@o2ib pending discovery
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Discovery thread wakes and begins processing the peer by sending a discovery ping (NB ping sent to 10.249.249.207@o2ib).&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;00000400:00000200:22.0:1618858502.715088:0:43692:0:(peer.c:3353:lnet_peer_discovery_wait_for_work()) woken: 0
00000400:00000200:22.0:1618858502.715091:0:43692:0:(peer.c:3468:lnet_peer_discovery()) peer 10.249.249.206@o2ib(ffff90de2632b800) state 0x6171
00000400:00000010:22.0:1618858502.715093:0:43692:0:(api-ni.c:1671:lnet_ping_buffer_alloc()) alloc &apos;(pbuf)&apos;: 281 at ffff90ddf2942000 (tot 485268226).
00000400:00000010:22.0:1618858502.715093:0:43692:0:(lib-lnet.h:259:lnet_md_alloc()) slab-alloced &apos;md&apos; of size 136 at ffff90cd54ed27f8 &amp;lt;&amp;lt;&amp;lt; PING MD.
00000400:00000010:22.0:1618858502.715095:0:43692:0:(lib-lnet.h:535:lnet_rspt_alloc()) rspt alloc ffff90ddcde80360
...
00000400:00000200:22.0:1618858502.715114:0:43692:0:(lib-move.c:1897:lnet_handle_send()) TRACE: 10.249.248.3@o2ib(10.249.248.3@o2ib:&amp;lt;?&amp;gt;) -&amp;gt; 10.249.249.207@o2ib(10.249.249.206@o2ib:10.249.249.207@o2ib) &amp;lt;?&amp;gt; : GET try# 0
...
00000400:00000200:22.0:1618858502.715149:0:43692:0:(peer.c:3107:lnet_peer_send_ping()) peer 10.249.249.206@o2ib
00000400:00000200:22.0:1618858502.715150:0:43692:0:(peer.c:3487:lnet_peer_discovery()) peer 10.249.249.206@o2ib(ffff90de2632b800) state 0x4371 rc 0
*hornc@cflosbld08 hornc $ lpst2str.sh 0x4371
LNET_PEER_MULTI_RAIL
LNET_PEER_DISCOVERED
LNET_PEER_REDISCOVER
LNET_PEER_DISCOVERING
LNET_PEER_NIDS_UPTODATE
LNET_PEER_PING_SENT
LNET_PEER_FORCE_PUSH
*hornc@cflosbld08 hornc $
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;With the ping sent, discovery is waiting for either a response or timeout. Meanwhile, an incoming discovery push from 10.249.249.206@o2ib puts the peer back on the discovery queue:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;00000400:00000200:27.0:1618858538.614280:0:43825:0:(lib-move.c:4396:lnet_parse()) TRACE: 10.249.248.2@o2ib(10.249.248.2@o2ib) &amp;lt;- 10.249.249.206@o2ib : PUT - for me
...
00000400:00000200:27.0:1618858538.614298:0:43825:0:(peer.c:2094:lnet_peer_push_event()) peer 10.249.249.206@o2ib(ffff90de2632b800) is MR
00000400:00000200:27.0:1618858538.614301:0:43825:0:(peer.c:2171:lnet_peer_push_event()) Received Push 10.249.249.206@o2ib 3
00000400:00000200:27.0:1618858538.614302:0:43825:0:(peer.c:1937:lnet_peer_queue_for_discovery()) Queue peer 10.249.249.206@o2ib: -114 &amp;lt;&amp;lt;&amp;lt;&amp;lt; Peer already queued; no reference added
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Discovery thread wakes and begins processing the peer. It first merges the data in the push buffer:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;*hornc@cflosbld08 hornc $ lpst2str.sh 0x42f1
LNET_PEER_MULTI_RAIL
LNET_PEER_DISCOVERED
LNET_PEER_REDISCOVER
LNET_PEER_DISCOVERING
LNET_PEER_DATA_PRESENT
LNET_PEER_PING_SENT
LNET_PEER_FORCE_PUSH
*hornc@cflosbld08 hornc $
00000400:00000200:22.0:1618858538.614308:0:43692:0:(peer.c:3353:lnet_peer_discovery_wait_for_work()) woken: 0
00000400:00000200:22.0:1618858538.614316:0:43692:0:(peer.c:3468:lnet_peer_discovery()) peer 10.249.249.206@o2ib(ffff90de2632b800) state 0x42f1
00000400:00000200:22.0:1618858538.614326:0:43692:0:(peer.c:2776:lnet_peer_merge_data()) peer 10.249.249.206@o2ib (ffff90de2632b800): 0
00000400:00000200:22.0:1618858538.614327:0:43692:0:(peer.c:2985:lnet_peer_data_present()) peer 10.249.249.206@o2ib(ffff90de2632b800): 0. state = 0x4371
00000400:00000200:22.0:1618858538.614328:0:43692:0:(peer.c:3487:lnet_peer_discovery()) peer 10.249.249.206@o2ib(ffff90de2632b800) state 0x4371 rc 1
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Next, it sends a discovery PUSH to the peer (NB push sent to 10.249.249.207@o2ib):&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;*hornc@cflosbld08 hornc $ lpst2str.sh 0x4371
LNET_PEER_MULTI_RAIL
LNET_PEER_DISCOVERED
LNET_PEER_REDISCOVER
LNET_PEER_DISCOVERING
LNET_PEER_NIDS_UPTODATE
LNET_PEER_PING_SENT
LNET_PEER_FORCE_PUSH
*hornc@cflosbld08 hornc $
00000400:00000200:22.0:1618858538.614329:0:43692:0:(peer.c:3468:lnet_peer_discovery()) peer 10.249.249.206@o2ib(ffff90de2632b800) state 0x4371
00000400:00000010:22.0:1618858538.614330:0:43692:0:(lib-lnet.h:259:lnet_md_alloc()) slab-alloced &apos;md&apos; of size 136 at ffff90cd54ed3430 &amp;lt;&amp;lt;&amp;lt; PUSH MD.
00000400:00000010:22.0:1618858538.614333:0:43692:0:(lib-lnet.h:535:lnet_rspt_alloc()) rspt alloc ffff90ddcde803f0
...
00000400:00000200:22.0:1618858538.614351:0:43692:0:(lib-move.c:1897:lnet_handle_send()) TRACE: 10.249.248.2@o2ib(10.249.248.2@o2ib:&amp;lt;?&amp;gt;) -&amp;gt; 10.249.249.207@o2ib(10.249.249.206@o2ib:10.249.249.207@o2ib) &amp;lt;?&amp;gt; : PUT try# 0
...
00000400:00000200:22.0:1618858538.614463:0:43692:0:(peer.c:3241:lnet_peer_send_push()) peer 10.249.249.206@o2ib
00000400:00000200:22.0:1618858538.614463:0:43692:0:(peer.c:3487:lnet_peer_discovery()) peer 10.249.249.206@o2ib(ffff90de2632b800) state 0x771 rc 0
*hornc@cflosbld08 hornc $ lpst2str.sh 0x771
LNET_PEER_MULTI_RAIL
LNET_PEER_DISCOVERED
LNET_PEER_REDISCOVER
LNET_PEER_DISCOVERING
LNET_PEER_NIDS_UPTODATE
LNET_PEER_PING_SENT
LNET_PEER_PUSH_SENT
*hornc@cflosbld08 hornc $
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This is the first potential problem. I am not sure whether it is a valid state for the peer to have both PING_SENT and PUSH_SENT at the same time. I suspect it is not.&lt;/p&gt;

&lt;p&gt;The discovery ping eventually fails because o2iblnd cannot establish a connection with .207. The lnet_discovery_event_handler() sets the PING_FAILED state (via lnet_discovery_event_send()), and puts the peer back on the discovery queue.:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;00000800:00000100:29.0:1618858552.405824:0:147651:0:(o2iblnd_cb.c:3183:kiblnd_cm_callback()) 10.249.249.207@o2ib: ADDR ERROR -110
00000800:00000200:29.0:1618858552.405825:0:147651:0:(o2iblnd.c:419:kiblnd_unlink_peer_locked()) peer_ni[ffff90cdedce3e80] -&amp;gt; 10.249.249.207@o2ib (2)--
00000400:00000200:29.0:1618858552.405830:0:147651:0:(router.c:1720:lnet_notify()) 10.249.248.3@o2ib notifying 10.249.249.207@o2ib: down
00000800:00000100:29.0:1618858552.405833:0:147651:0:(o2iblnd_cb.c:2294:kiblnd_peer_connect_failed()) Deleting messages for 10.249.249.207@o2ib: connection failed
00000400:00000200:29.0:1618858552.405834:0:147651:0:(lib-msg.c:1011:lnet_is_health_check()) health check = 1, status = -113, hstatus = 2
00000400:00000200:29.0:1618858552.405835:0:147651:0:(lib-msg.c:860:lnet_health_check()) health check: 10.249.248.3@o2ib-&amp;gt;10.249.249.207@o2ib: GET: LOCAL_DROPPED
00000400:00000200:29.0:1618858552.405836:0:147651:0:(lib-msg.c:479:lnet_handle_local_failure()) ni 10.249.248.3@o2ib added to recovery queue. Health = 900
00000400:00000100:29.0:1618858552.405837:0:147651:0:(lib-msg.c:710:lnet_attempt_msg_resend()) msg 0@&amp;lt;0:0&amp;gt;-&amp;gt;10.249.249.207@o2ib exceeded retry count 2
00000400:00000200:29.0:1618858552.405838:0:147651:0:(peer.c:2556:lnet_discovery_event_handler()) Received event: 5
00000400:00000200:29.0:1618858552.405839:0:147651:0:(peer.c:2508:lnet_discovery_event_send()) Ping Send to 10.249.249.206@o2ib: 1  &amp;lt;&amp;lt;&amp;lt;&amp;lt; Clears PING_SENT, sets PING_FAILED
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Here we see a second problem. Note the &quot;GET: LOCAL_DROPPED&quot; from lnet_health_check() and subsequent call to lnet_handle_local_failure(). The LOCAL_DROPPED health status will only cause the LNet health value of the local NI to be lowered even though the problem could be with the remote interface. This is why LNet is repeatedly trying to send to .207 even though the interface is not functioning properly. The good news is this is a known issue with a fix. The ticket for this is &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-13571&quot; class=&quot;external-link&quot; rel=&quot;nofollow&quot;&gt;https://jira.whamcloud.com/browse/LU-13571&lt;/a&gt; .&lt;/p&gt;

&lt;p&gt;Discovery thread wakes and processes the ping failure:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;*hornc@cflosbld08 hornc $ lpst2str.sh 0x6d71
LNET_PEER_MULTI_RAIL
LNET_PEER_DISCOVERED
LNET_PEER_REDISCOVER
LNET_PEER_DISCOVERING
LNET_PEER_NIDS_UPTODATE
LNET_PEER_PUSH_SENT
LNET_PEER_PING_FAILED
LNET_PEER_FORCE_PING
LNET_PEER_FORCE_PUSH
*hornc@cflosbld08 hornc $
00000400:00000200:21.0:1618858552.405845:0:43692:0:(peer.c:3353:lnet_peer_discovery_wait_for_work()) woken: 0
00000400:00000200:21.0:1618858552.405848:0:43692:0:(peer.c:3468:lnet_peer_discovery()) peer 10.249.249.206@o2ib(ffff90de2632b800) state 0x6d71
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The discovery thread is going to process the ping failure. lnet_peer_ping_failed() is called. The lp_ping_mdh is unlinked. The unlink handler sees LNET_PEER_PUSH_SENT so it thinks that the PUSH is what was unlinked:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;00000400:00000200:21.0:1618858552.405850:0:43692:0:(peer.c:2556:lnet_discovery_event_handler()) Received event: 6
00000400:00000200:21.0:1618858552.405851:0:43692:0:(peer.c:2536:lnet_discovery_event_unlink()) Push Unlink for message to peer 10.249.249.206@o2ib &amp;lt;&amp;lt;&amp;lt;&amp;lt; Clears PUSH_SENT, sets PUSH_FAILED
00000400:00000200:21.0:1618858552.405853:0:43692:0:(lib-md.c:69:lnet_md_unlink()) Unlinking md ffff90cd54ed27f8 &amp;lt;&amp;lt;&amp;lt; PING MD
00000400:00000200:21.0:1618858552.405855:0:43692:0:(peer.c:3018:lnet_peer_ping_failed()) peer 10.249.249.206@o2ib:-113
00000400:00000200:21.0:1618858552.405856:0:43692:0:(peer.c:3487:lnet_peer_discovery()) peer 10.249.249.206@o2ib(ffff90de2632b800) state 0x7171 rc -113
00000400:00000200:21.0:1618858552.405858:0:43692:0:(peer.c:3270:lnet_peer_discovery_error()) Discovery error 10.249.249.206@o2ib: -113 &amp;lt;&amp;lt;&amp;lt;&amp;lt; Clears LNET_PEER_DISCOVERING, state is 0x7131
00000400:00000200:21.0:1618858552.405859:0:43692:0:(peer.c:1955:lnet_peer_discovery_complete()) Discovery complete. Dequeue peer 10.249.249.206@o2ib &amp;lt;&amp;lt;&amp;lt;&amp;lt; Reference dropped
*hornc@cflosbld08 hornc $ lpst2str.sh 0x7131
LNET_PEER_MULTI_RAIL
LNET_PEER_DISCOVERED
LNET_PEER_REDISCOVER
LNET_PEER_NIDS_UPTODATE
LNET_PEER_PUSH_FAILED
LNET_PEER_FORCE_PING
LNET_PEER_FORCE_PUSH
*hornc@cflosbld08 hornc $
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This is another potential problem. I do not know if it is valid to have a PUSH_FAILED state when discovery completes. Similarly, it seems wrong to have FORCE_PING/FORCE_PUSH set when discovery completes.&lt;/p&gt;

&lt;p&gt;As part of lnet_peer_discovery_complete(), any messages that were queued for discovery are sent/finalized (includes our OST_CONNECT that started this all off):&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;00000100:00000200:21.0:1618858552.405863:0:43692:0:(events.c:63:request_out_callback()) @@@ type 5, status -113  req@ffff90bde43e5a00 x1697249298880320/t0(0) o8-&amp;gt;scratch-OST00bc-osc-MDT0000@10.249.249.206@o2ib:28/4 lens 520/544 e 0 to 0 dl 1618858552 ref 2 fl Rpc:N/0/ffffffff rc 0/-1 job:&apos;&apos;
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;An MGS_CONNECT reply results in peer being queued for discovery again in the same manner as the earlier OST_CONNECT:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;00010000:00000200:19.0:1618858552.405884:0:136340:0:(ldlm_lib.c:2986:target_send_reply_msg()) @@@ sending reply  req@ffff90cdf3d8a850 x1697492294959168/t0(0) o250-&amp;gt;69523420-f6a9-9da1-ae47-897c569694e9@10.249.249.206@o2ib:0/0 lens 520/416 e 0 to 0 dl 1618858838 ref 1 fl Interpret:/0/0 rc 0/0 job:&apos;&apos;
00000400:00000010:19.0:1618858552.405888:0:136340:0:(lib-lnet.h:259:lnet_md_alloc()) slab-alloced &apos;md&apos; of size 136 at ffff90cdf451caa0.
00000100:00000200:19.0:1618858552.405888:0:136340:0:(niobuf.c:85:ptl_send_buf()) Sending 416 bytes to portal 25, xid 1697492294959168, offset 0
00000400:00000200:19.0:1618858552.405890:0:136340:0:(lib-move.c:4905:LNetPut()) LNetPut -&amp;gt; 12345-10.249.249.206@o2ib
00000400:00000200:19.0:1618858552.405900:0:136340:0:(peer.c:1937:lnet_peer_queue_for_discovery()) Queue peer 10.249.249.206@o2ib: 0   &amp;lt;&amp;lt;&amp;lt;&amp;lt; Reference added
00000400:00000200:19.0:1618858552.405900:0:136340:0:(peer.c:2250:lnet_discover_peer_locked()) Discovery attempt # 1
00000400:00000200:19.0:1618858552.405901:0:136340:0:(peer.c:2291:lnet_discover_peer_locked()) non-blocking discovery
00000400:00000200:19.0:1618858552.405902:0:136340:0:(peer.c:2298:lnet_discover_peer_locked()) peer 10.249.249.206@o2ib NID 10.249.249.206@o2ib: 0. pending discovery
00000400:00000200:19.0:1618858552.405903:0:136340:0:(lib-move.c:2050:lnet_initiate_peer_discovery()) msg ffff90de1d222940 delayed. 10.249.249.206@o2ib pending discovery
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;


&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;*hornc@cflosbld08 hornc $ lpst2str.sh 0x7171
LNET_PEER_MULTI_RAIL
LNET_PEER_DISCOVERED
LNET_PEER_REDISCOVER
LNET_PEER_DISCOVERING
LNET_PEER_NIDS_UPTODATE
LNET_PEER_PUSH_FAILED
LNET_PEER_FORCE_PING
LNET_PEER_FORCE_PUSH
*hornc@cflosbld08 hornc $
00000400:00000200:21.0:1618858552.405902:0:43692:0:(peer.c:3353:lnet_peer_discovery_wait_for_work()) woken: 0
00000400:00000200:21.0:1618858552.405906:0:43692:0:(peer.c:3468:lnet_peer_discovery()) peer 10.249.249.206@o2ib(ffff90de2632b800) state 0x7171
00000400:00000200:21.0:1618858552.405906:0:43692:0:(lib-md.c:65:lnet_md_unlink()) Queueing unlink of md ffff90cd54ed3430
00000400:00000200:21.0:1618858552.405907:0:43692:0:(peer.c:3146:lnet_peer_push_failed()) peer 10.249.249.206@o2ib
00000400:00000200:21.0:1618858552.405908:0:43692:0:(peer.c:3487:lnet_peer_discovery()) peer 10.249.249.206@o2ib(ffff90de2632b800) state 0x6171 rc -110
00000400:00000200:21.0:1618858552.405913:0:43692:0:(peer.c:3270:lnet_peer_discovery_error()) Discovery error 10.249.249.206@o2ib: -110  &amp;lt;&amp;lt;&amp;lt;&amp;lt; Clears LNET_PEER_DISCOVERING, state is 0x6131
00000400:00000200:21.0:1618858552.405915:0:43692:0:(peer.c:1955:lnet_peer_discovery_complete()) Discovery complete. Dequeue peer 10.249.249.206@o2ib 
*hornc@cflosbld08 hornc $ lpst2str.sh 0x6131
LNET_PEER_MULTI_RAIL
LNET_PEER_DISCOVERED
LNET_PEER_REDISCOVER
LNET_PEER_NIDS_UPTODATE
LNET_PEER_FORCE_PING
LNET_PEER_FORCE_PUSH
*hornc@cflosbld08 hornc $
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Monitor thread attempts to resend PUSH&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;00000400:00000200:11.0:1618858555.397794:0:43693:0:(lib-move.c:3156:lnet_resend_pending_msgs_locked()) resending &amp;lt;?&amp;gt;-&amp;gt;12345-10.249.249.207@o2ib: PUT recovery 0 try# 1
00000400:00000200:11.0:1618858555.397796:0:43693:0:(lib-move.c:2677:lnet_handle_send_case_locked()) Source ANY to MR:  10.249.249.207@o2ib local destination
00000400:00000200:11.0:1618858555.397798:0:43693:0:(lib-move.c:1659:lnet_get_best_ni()) compare ni 10.249.248.2@o2ib [c:509, d:10, s:12279791] with best_ni not seleced [c:-2147483648, d:-1, s:0]
00000400:00000200:11.0:1618858555.397800:0:43693:0:(lib-move.c:1659:lnet_get_best_ni()) compare ni 10.249.248.3@o2ib [c:509, d:12, s:7451] with best_ni 10.249.248.2@o2ib [c:509, d:10, s:12279791]
00000400:00000200:11.0:1618858555.397801:0:43693:0:(lib-move.c:1702:lnet_get_best_ni()) selected best_ni 10.249.248.2@o2ib
00000400:00000200:11.0:1618858555.397803:0:43693:0:(lib-move.c:1410:lnet_select_peer_ni()) n:[10.249.249.207@o2ib, 10.249.249.206@o2ib] h:[1000, 1000] r:[n, n] c:[8, 8] s:[67, 50963]
00000400:00000200:11.0:1618858555.397804:0:43693:0:(lib-move.c:1460:lnet_select_peer_ni()) sd_best_lpni = 10.249.249.207@o2ib
00000400:00000100:11.0:1618858555.397807:0:43693:0:(lib-move.c:937:lnet_post_send_locked()) Aborting message for 12345-10.249.249.207@o2ib: LNetM[DE]Unlink() already called on the MD/ME.
00000400:00020000:11.0:1618858555.414727:0:43693:0:(lib-move.c:3162:lnet_resend_pending_msgs_locked()) Error sending PUT to 12345-10.249.249.207@o2ib: -125
00000400:00000200:11.0:1618858555.429119:0:43693:0:(lib-msg.c:1011:lnet_is_health_check()) health check = 1, status = -125, hstatus = 5
00000400:00000200:11.0:1618858555.429120:0:43693:0:(lib-msg.c:860:lnet_health_check()) health check: 10.249.248.2@o2ib-&amp;gt;10.249.249.207@o2ib: PUT: LOCAL_ERROR
00000400:00000200:11.0:1618858555.429121:0:43693:0:(peer.c:2556:lnet_discovery_event_handler()) Received event: 5
00000400:00000200:11.0:1618858555.429122:0:43693:0:(peer.c:2508:lnet_discovery_event_send()) Push Send to 10.249.249.206@o2ib: 1   &amp;lt;&amp;lt;&amp;lt; Sets PUSH_FAILED, Peer placed back on discovery queue without reference taken BUG HERE
00000400:00000200:11.0:1618858555.429125:0:43693:0:(lib-md.c:69:lnet_md_unlink()) Unlinking md ffff90cd54ed3430 &amp;lt;&amp;lt;&amp;lt; PUSH MD
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Discovery wakes, processes the failed push. Peer is not in DISCOVERING state, so lnet_peer_discovery_complete() is called. We drop an extra reference and trip the LBUG.&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;*hornc@cflosbld08 hornc $ lpst2str.sh 0x7131
LNET_PEER_MULTI_RAIL
LNET_PEER_DISCOVERED
LNET_PEER_REDISCOVER
LNET_PEER_NIDS_UPTODATE
LNET_PEER_PUSH_FAILED
LNET_PEER_FORCE_PING
LNET_PEER_FORCE_PUSH
*hornc@cflosbld08 hornc $
00000400:00000200:22.0:1618858555.429128:0:43692:0:(peer.c:3353:lnet_peer_discovery_wait_for_work()) woken: 0
00000400:00000200:22.0:1618858555.429130:0:43692:0:(peer.c:3468:lnet_peer_discovery()) peer 10.249.249.206@o2ib(ffff90de2632b800) state 0x7131
00000400:00000200:22.0:1618858555.429131:0:43692:0:(peer.c:3146:lnet_peer_push_failed()) peer 10.249.249.206@o2ib
00000400:00000200:22.0:1618858555.429132:0:43692:0:(peer.c:3487:lnet_peer_discovery()) peer 10.249.249.206@o2ib(ffff90de2632b800) state 0x6131 rc -125
00000400:00000200:22.0:1618858555.429134:0:43692:0:(peer.c:3270:lnet_peer_discovery_error()) Discovery error 10.249.249.206@o2ib: -125
00000400:00000200:22.0:1618858555.429135:0:43692:0:(peer.c:1955:lnet_peer_discovery_complete()) Discovery complete. Dequeue peer 10.249.249.206@o2ib &amp;lt;&amp;lt;&amp;lt; Reference dropped -&amp;gt; LBUG
00000400:00000200:22.0:1618858555.429137:0:43692:0:(peer.c:293:lnet_destroy_peer_locked()) ffff90de2632b800 nid 10.249.249.206@o2ib
*hornc@cflosbld08 hornc $ lpst2str.sh 0x6131
LNET_PEER_MULTI_RAIL
LNET_PEER_DISCOVERED
LNET_PEER_REDISCOVER
LNET_PEER_NIDS_UPTODATE
LNET_PEER_FORCE_PING
LNET_PEER_FORCE_PUSH
*hornc@cflosbld08 hornc $
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</description>
                <environment></environment>
        <key id="63863">LU-14627</key>
            <summary>Lost ref on lnet_peer in discovery leads to LNetError: 24909:0:(peer.c:292:lnet_destroy_peer_locked()) ASSERTION( list_empty(&amp;lp-&gt;lp_peer_nets) ) failed:</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="2" iconUrl="https://jira.whamcloud.com/images/icons/priorities/critical.svg">Critical</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="hornc">Chris Horn</assignee>
                                    <reporter username="hornc">Chris Horn</reporter>
                        <labels>
                    </labels>
                <created>Tue, 20 Apr 2021 20:09:56 +0000</created>
                <updated>Mon, 30 May 2022 19:02:30 +0000</updated>
                            <resolved>Mon, 14 Jun 2021 19:33:56 +0000</resolved>
                                    <version>Lustre 2.14.0</version>
                    <version>Lustre 2.15.0</version>
                                    <fixVersion>Lustre 2.12.7</fixVersion>
                    <fixVersion>Lustre 2.15.0</fixVersion>
                                        <due></due>
                            <votes>0</votes>
                                    <watches>9</watches>
                                                                            <comments>
                            <comment id="299317" author="hornc" created="Tue, 20 Apr 2021 21:25:40 +0000"  >&lt;p&gt;This is an abridged description of the bug which I will use to try and create a reproducer. May require instrumenting the code to get the timing to line up:&lt;/p&gt;

&lt;p&gt;peer queued for discovery&lt;br/&gt;
ping sent to peer&lt;br/&gt;
incoming push from peer -&amp;gt; peer put on discovery queue&lt;br/&gt;
push sent to peer&lt;br/&gt;
peer state:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;*hornc@cflosbld08 hornc $ lpst2str.sh 0x771
LNET_PEER_MULTI_RAIL
LNET_PEER_DISCOVERED
LNET_PEER_REDISCOVER
LNET_PEER_DISCOVERING
LNET_PEER_NIDS_UPTODATE
LNET_PEER_PING_SENT
LNET_PEER_PUSH_SENT
*hornc@cflosbld08 hornc $
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;ping fails -&amp;gt; PING_SENT cleared&lt;br/&gt;
peer state after processing ping failure:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;*hornc@cflosbld08 hornc $ lpst2str.sh 0x7131
LNET_PEER_MULTI_RAIL
LNET_PEER_DISCOVERED
LNET_PEER_REDISCOVER
LNET_PEER_NIDS_UPTODATE
LNET_PEER_PUSH_FAILED
LNET_PEER_FORCE_PING
LNET_PEER_FORCE_PUSH
*hornc@cflosbld08 hornc $
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;peer queued for discovery&lt;br/&gt;
discovery procceses push &quot;failure&quot; -&amp;gt; unlinks MD&lt;br/&gt;
peer state:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;*hornc@cflosbld08 hornc $ lpst2str.sh 0x6131
LNET_PEER_MULTI_RAIL
LNET_PEER_DISCOVERED
LNET_PEER_REDISCOVER
LNET_PEER_NIDS_UPTODATE
LNET_PEER_FORCE_PING
LNET_PEER_FORCE_PUSH
*hornc@cflosbld08 hornc $
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;monitor thread:&lt;br/&gt;
tries to resend push&lt;br/&gt;
fails because MD already unlinked&lt;br/&gt;
reference lost when peer put back on discovery queue&lt;/p&gt;

&lt;p&gt;discovery thread processes peer and we hit lbug&lt;/p&gt;</comment>
                            <comment id="299527" author="gerrit" created="Thu, 22 Apr 2021 21:58:16 +0000"  >&lt;p&gt;Chris Horn (chris.horn@hpe.com) uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/43416&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/43416&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-14627&quot; title=&quot;Lost ref on lnet_peer in discovery leads to LNetError: 24909:0:(peer.c:292:lnet_destroy_peer_locked()) ASSERTION( list_empty(&amp;amp;lp-&amp;gt;lp_peer_nets) ) failed:&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-14627&quot;&gt;&lt;del&gt;LU-14627&lt;/del&gt;&lt;/a&gt; lnet: Allow delayed sends&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: 298b5bfc8defdc684732223bb1d91336d0dae650&lt;/p&gt;</comment>
                            <comment id="299528" author="gerrit" created="Thu, 22 Apr 2021 21:58:16 +0000"  >&lt;p&gt;-Chris Horn (chris.horn@hpe.com) uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/43417&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/43417&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-14627&quot; title=&quot;Lost ref on lnet_peer in discovery leads to LNetError: 24909:0:(peer.c:292:lnet_destroy_peer_locked()) ASSERTION( list_empty(&amp;amp;lp-&amp;gt;lp_peer_nets) ) failed:&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-14627&quot;&gt;&lt;del&gt;LU-14627&lt;/del&gt;&lt;/a&gt; tests: Add test for discovery refcount loss&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: 807900960972ee5b1a3eba241e2869c731d7e45f-&lt;/p&gt;

&lt;p&gt;This patch was squashed with &lt;a href=&quot;https://review.whamcloud.com/43418&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/43418&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="299529" author="gerrit" created="Thu, 22 Apr 2021 21:58:17 +0000"  >&lt;p&gt;Chris Horn (chris.horn@hpe.com) uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/43418&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/43418&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-14627&quot; title=&quot;Lost ref on lnet_peer in discovery leads to LNetError: 24909:0:(peer.c:292:lnet_destroy_peer_locked()) ASSERTION( list_empty(&amp;amp;lp-&amp;gt;lp_peer_nets) ) failed:&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-14627&quot;&gt;&lt;del&gt;LU-14627&lt;/del&gt;&lt;/a&gt; lnet: Ensure ref taken when queueing for discovery&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: 0ca766281e0099d5ff03944f499d45a8ba7dbcbb&lt;/p&gt;</comment>
                            <comment id="299617" author="gerrit" created="Fri, 23 Apr 2021 20:13:07 +0000"  >&lt;p&gt;Chris Horn (chris.horn@hpe.com) uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/43425&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/43425&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-14627&quot; title=&quot;Lost ref on lnet_peer in discovery leads to LNetError: 24909:0:(peer.c:292:lnet_destroy_peer_locked()) ASSERTION( list_empty(&amp;amp;lp-&amp;gt;lp_peer_nets) ) failed:&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-14627&quot;&gt;&lt;del&gt;LU-14627&lt;/del&gt;&lt;/a&gt; tests: Create unload_modules_local&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: 9c861d52354a939065a3897ea1cdc539fdd5a513&lt;/p&gt;</comment>
                            <comment id="299620" author="gerrit" created="Fri, 23 Apr 2021 20:38:11 +0000"  >&lt;p&gt;&amp;lt;deleted&amp;gt;&lt;/p&gt;</comment>
                            <comment id="300495" author="gerrit" created="Wed, 5 May 2021 02:49:06 +0000"  >&lt;p&gt;Oleg Drokin (green@whamcloud.com) merged in patch &lt;a href=&quot;https://review.whamcloud.com/43416/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/43416/&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-14627&quot; title=&quot;Lost ref on lnet_peer in discovery leads to LNetError: 24909:0:(peer.c:292:lnet_destroy_peer_locked()) ASSERTION( list_empty(&amp;amp;lp-&amp;gt;lp_peer_nets) ) failed:&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-14627&quot;&gt;&lt;del&gt;LU-14627&lt;/del&gt;&lt;/a&gt; lnet: Allow delayed sends&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: &lt;br/&gt;
Commit: ab14f3bc852e708100d21770c00235f95841708a&lt;/p&gt;</comment>
                            <comment id="301284" author="gerrit" created="Tue, 11 May 2021 22:54:23 +0000"  >&lt;p&gt;Oleg Drokin (green@whamcloud.com) merged in patch &lt;a href=&quot;https://review.whamcloud.com/43425/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/43425/&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-14627&quot; title=&quot;Lost ref on lnet_peer in discovery leads to LNetError: 24909:0:(peer.c:292:lnet_destroy_peer_locked()) ASSERTION( list_empty(&amp;amp;lp-&amp;gt;lp_peer_nets) ) failed:&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-14627&quot;&gt;&lt;del&gt;LU-14627&lt;/del&gt;&lt;/a&gt; tests: Create unload_modules_local&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: &lt;br/&gt;
Commit: 32304d863ae98c641f541362f54e7b1f24b350a6&lt;/p&gt;</comment>
                            <comment id="303146" author="scadmin" created="Tue, 1 Jun 2021 10:52:21 +0000"  >&lt;p&gt;FYI we encountered something that looks like this with 2.12.6 client and servers.&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;2021-05-29 12:04:53 [5935871.518144] LNetError: 911:0:(peer.c:282:lnet_destroy_peer_locked()) ASSERTION( list_empty(&amp;amp;lp-&amp;gt;lp_peer_nets) ) failed:
2021-05-29 12:04:53 [5935871.530347] LNetError: 911:0:(peer.c:282:lnet_destroy_peer_locked()) LBUG
2021-05-29 12:04:53 [5935871.538442] Pid: 911, comm: lnet_discovery 3.10.0-1160.21.1.el7.x86_64 #1 SMP Tue Mar 16 18:28:22 UTC 2021
2021-05-29 12:04:53 [5935871.549410] Call Trace:
2021-05-29 12:04:53 [5935871.553213]  [&amp;lt;ffffffffc055d7cc&amp;gt;] libcfs_call_trace+0x8c/0xc0 [libcfs]
2021-05-29 12:04:53 [5935871.561049]  [&amp;lt;ffffffffc055d87c&amp;gt;] lbug_with_loc+0x4c/0xa0 [libcfs]
2021-05-29 12:04:53 [5935871.568503]  [&amp;lt;ffffffffc078ffca&amp;gt;] lnet_destroy_peer_locked+0x24a/0x350 [lnet]
2021-05-29 12:04:53 [5935871.576898]  [&amp;lt;ffffffffc0790605&amp;gt;] lnet_peer_discovery_complete+0x2a5/0x350 [lnet]
2021-05-29 12:04:53 [5935871.585617]  [&amp;lt;ffffffffc0795340&amp;gt;] lnet_peer_discovery+0x6c0/0x1140 [lnet]
2021-05-29 12:04:53 [5935871.593632]  [&amp;lt;ffffffffac0c5da1&amp;gt;] kthread+0xd1/0xe0
2021-05-29 12:04:53 [5935871.599733]  [&amp;lt;ffffffffac795df7&amp;gt;] ret_from_fork_nospec_end+0x0/0x39
2021-05-29 12:04:53 [5935871.607198]  [&amp;lt;ffffffffffffffff&amp;gt;] 0xffffffffffffffff
2021-05-29 12:04:53 [5935871.613357] Kernel panic - not syncing: LBUG
2021-05-29 12:04:53 [5935871.618770] CPU: 2 PID: 911 Comm: lnet_discovery Kdump: loaded Tainted: P           OE  ------------ T 3.10.0-1160.21.1.el7.x86_64 #1
2021-05-29 12:04:53 [5935871.632856] Hardware name: Dell Inc. PowerEdge R740/06G98X, BIOS 2.10.0 11/12/2020
2021-05-29 12:04:53 [5935871.641552] Call Trace:
2021-05-29 12:04:53 [5935871.645136]  [&amp;lt;ffffffffac78305a&amp;gt;] dump_stack+0x19/0x1b
2021-05-29 12:04:53 [5935871.651390]  [&amp;lt;ffffffffac77c5b2&amp;gt;] panic+0xe8/0x21f
2021-05-29 12:04:53 [5935871.657283]  [&amp;lt;ffffffffc055d8cb&amp;gt;] lbug_with_loc+0x9b/0xa0 [libcfs]
2021-05-29 12:04:53 [5935871.664553]  [&amp;lt;ffffffffc078ffca&amp;gt;] lnet_destroy_peer_locked+0x24a/0x350 [lnet]
2021-05-29 12:04:53 [5935871.672756]  [&amp;lt;ffffffffc0790605&amp;gt;] lnet_peer_discovery_complete+0x2a5/0x350 [lnet]
2021-05-29 12:04:53 [5935871.681295]  [&amp;lt;ffffffffc0795340&amp;gt;] lnet_peer_discovery+0x6c0/0x1140 [lnet]
2021-05-29 12:04:53 [5935871.689117]  [&amp;lt;ffffffffac0c6e90&amp;gt;] ? wake_up_atomic_t+0x30/0x30
2021-05-29 12:04:53 [5935871.695985]  [&amp;lt;ffffffffc0794c80&amp;gt;] ? lnet_peer_merge_data+0xe00/0xe00 [lnet]
2021-05-29 12:04:53 [5935871.703954]  [&amp;lt;ffffffffac0c5da1&amp;gt;] kthread+0xd1/0xe0
2021-05-29 12:04:53 [5935871.709833]  [&amp;lt;ffffffffac0c5cd0&amp;gt;] ? insert_kthread_work+0x40/0x40
2021-05-29 12:04:53 [5935871.716901]  [&amp;lt;ffffffffac795df7&amp;gt;] ret_from_fork_nospec_begin+0x21/0x21
2021-05-29 12:04:53 [5935871.724393]  [&amp;lt;ffffffffac0c5cd0&amp;gt;] ? insert_kthread_work+0x40/0x40
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;it happened when mounting a new filesystem on ~100 nodes at once.&lt;/p&gt;

&lt;p&gt;we have a crash dump of the one node that LBUG&apos;d if that would help.&lt;/p&gt;

&lt;p&gt;cheers,&lt;br/&gt;
robin&lt;/p&gt;</comment>
                            <comment id="303845" author="hornc" created="Tue, 8 Jun 2021 15:32:25 +0000"  >&lt;p&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/ViewProfile.jspa?name=jpeyrard&quot; class=&quot;user-hover&quot; rel=&quot;jpeyrard&quot;&gt;jpeyrard&lt;/a&gt; The fix for this issue is &quot;&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-14627&quot; title=&quot;Lost ref on lnet_peer in discovery leads to LNetError: 24909:0:(peer.c:292:lnet_destroy_peer_locked()) ASSERTION( list_empty(&amp;amp;lp-&amp;gt;lp_peer_nets) ) failed:&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-14627&quot;&gt;&lt;del&gt;LU-14627&lt;/del&gt;&lt;/a&gt; lnet: Ensure ref taken when queueing for discovery&quot; - &lt;a href=&quot;https://review.whamcloud.com/#/c/43418/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/#/c/43418/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&quot;&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-14627&quot; title=&quot;Lost ref on lnet_peer in discovery leads to LNetError: 24909:0:(peer.c:292:lnet_destroy_peer_locked()) ASSERTION( list_empty(&amp;amp;lp-&amp;gt;lp_peer_nets) ) failed:&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-14627&quot;&gt;&lt;del&gt;LU-14627&lt;/del&gt;&lt;/a&gt; lnet: Allow delayed sends&quot; is testing only change&lt;/p&gt;</comment>
                            <comment id="304014" author="gerrit" created="Wed, 9 Jun 2021 16:35:03 +0000"  >&lt;p&gt;Minh Diep (mdiep@whamcloud.com) uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/43959&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/43959&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-14627&quot; title=&quot;Lost ref on lnet_peer in discovery leads to LNetError: 24909:0:(peer.c:292:lnet_destroy_peer_locked()) ASSERTION( list_empty(&amp;amp;lp-&amp;gt;lp_peer_nets) ) failed:&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-14627&quot;&gt;&lt;del&gt;LU-14627&lt;/del&gt;&lt;/a&gt; lnet: Allow delayed sends&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: b2_12&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: 7154ecd2f8faccacdfb697cf1455a904494d910c&lt;/p&gt;</comment>
                            <comment id="304015" author="gerrit" created="Wed, 9 Jun 2021 16:35:03 +0000"  >&lt;p&gt;Minh Diep (mdiep@whamcloud.com) uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/43960&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/43960&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-14627&quot; title=&quot;Lost ref on lnet_peer in discovery leads to LNetError: 24909:0:(peer.c:292:lnet_destroy_peer_locked()) ASSERTION( list_empty(&amp;amp;lp-&amp;gt;lp_peer_nets) ) failed:&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-14627&quot;&gt;&lt;del&gt;LU-14627&lt;/del&gt;&lt;/a&gt; tests: Create unload_modules_local&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: b2_12&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: 20118700cb344c92c18d808e96a000acaa097358&lt;/p&gt;</comment>
                            <comment id="304452" author="gerrit" created="Mon, 14 Jun 2021 16:44:31 +0000"  >&lt;p&gt;Oleg Drokin (green@whamcloud.com) merged in patch &lt;a href=&quot;https://review.whamcloud.com/43418/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/43418/&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-14627&quot; title=&quot;Lost ref on lnet_peer in discovery leads to LNetError: 24909:0:(peer.c:292:lnet_destroy_peer_locked()) ASSERTION( list_empty(&amp;amp;lp-&amp;gt;lp_peer_nets) ) failed:&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-14627&quot;&gt;&lt;del&gt;LU-14627&lt;/del&gt;&lt;/a&gt; lnet: Ensure ref taken when queueing for discovery&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: &lt;br/&gt;
Commit: 2ce6957b69370b0ce75725d1d91866bf55c07fa8&lt;/p&gt;</comment>
                            <comment id="304483" author="pjones" created="Mon, 14 Jun 2021 19:33:56 +0000"  >&lt;p&gt;Landed for 2.15&lt;/p&gt;</comment>
                            <comment id="304486" author="gerrit" created="Mon, 14 Jun 2021 19:44:11 +0000"  >&lt;p&gt;Minh Diep (mdiep@whamcloud.com) uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/44001&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/44001&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-14627&quot; title=&quot;Lost ref on lnet_peer in discovery leads to LNetError: 24909:0:(peer.c:292:lnet_destroy_peer_locked()) ASSERTION( list_empty(&amp;amp;lp-&amp;gt;lp_peer_nets) ) failed:&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-14627&quot;&gt;&lt;del&gt;LU-14627&lt;/del&gt;&lt;/a&gt; lnet: Ensure ref taken when queueing for discovery&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: b2_12&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: 48df065a516c95259d18482c8c680f14bf0d8ff4&lt;/p&gt;</comment>
                            <comment id="305643" author="gerrit" created="Sun, 27 Jun 2021 10:56:31 +0000"  >&lt;p&gt;Oleg Drokin (green@whamcloud.com) merged in patch &lt;a href=&quot;https://review.whamcloud.com/43959/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/43959/&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-14627&quot; title=&quot;Lost ref on lnet_peer in discovery leads to LNetError: 24909:0:(peer.c:292:lnet_destroy_peer_locked()) ASSERTION( list_empty(&amp;amp;lp-&amp;gt;lp_peer_nets) ) failed:&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-14627&quot;&gt;&lt;del&gt;LU-14627&lt;/del&gt;&lt;/a&gt; lnet: Allow delayed sends&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: b2_12&lt;br/&gt;
Current Patch Set: &lt;br/&gt;
Commit: fea5e60b97adef090e8475e6f03101c0c8521203&lt;/p&gt;</comment>
                            <comment id="305644" author="gerrit" created="Sun, 27 Jun 2021 10:57:09 +0000"  >&lt;p&gt;Oleg Drokin (green@whamcloud.com) merged in patch &lt;a href=&quot;https://review.whamcloud.com/43960/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/43960/&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-14627&quot; title=&quot;Lost ref on lnet_peer in discovery leads to LNetError: 24909:0:(peer.c:292:lnet_destroy_peer_locked()) ASSERTION( list_empty(&amp;amp;lp-&amp;gt;lp_peer_nets) ) failed:&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-14627&quot;&gt;&lt;del&gt;LU-14627&lt;/del&gt;&lt;/a&gt; tests: Create unload_modules_local&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: b2_12&lt;br/&gt;
Current Patch Set: &lt;br/&gt;
Commit: 347bbb765f4dfc509c395da4b2b699586dba7366&lt;/p&gt;</comment>
                            <comment id="305645" author="gerrit" created="Sun, 27 Jun 2021 10:57:33 +0000"  >&lt;p&gt;Oleg Drokin (green@whamcloud.com) merged in patch &lt;a href=&quot;https://review.whamcloud.com/44001/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/44001/&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-14627&quot; title=&quot;Lost ref on lnet_peer in discovery leads to LNetError: 24909:0:(peer.c:292:lnet_destroy_peer_locked()) ASSERTION( list_empty(&amp;amp;lp-&amp;gt;lp_peer_nets) ) failed:&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-14627&quot;&gt;&lt;del&gt;LU-14627&lt;/del&gt;&lt;/a&gt; lnet: Ensure ref taken when queueing for discovery&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: b2_12&lt;br/&gt;
Current Patch Set: &lt;br/&gt;
Commit: 86bf7454d441bc1322eff68106b1766e2f255e72&lt;/p&gt;</comment>
                            <comment id="332599" author="gerrit" created="Fri, 22 Apr 2022 03:08:08 +0000"  >&lt;p&gt;&quot;Andreas Dilger &amp;lt;adilger@whamcloud.com&amp;gt;&quot; uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/47116&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/47116&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-14627&quot; title=&quot;Lost ref on lnet_peer in discovery leads to LNetError: 24909:0:(peer.c:292:lnet_destroy_peer_locked()) ASSERTION( list_empty(&amp;amp;lp-&amp;gt;lp_peer_nets) ) failed:&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-14627&quot;&gt;&lt;del&gt;LU-14627&lt;/del&gt;&lt;/a&gt; utils: quiet spurious lustre_rmmod message&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: ee7aba2beabbf983ccffe8e4881e792943a15b09&lt;/p&gt;</comment>
                            <comment id="336330" author="gerrit" created="Mon, 30 May 2022 19:02:30 +0000"  >&lt;p&gt;&quot;Oleg Drokin &amp;lt;green@whamcloud.com&amp;gt;&quot; merged in patch &lt;a href=&quot;https://review.whamcloud.com/47116/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/47116/&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-14627&quot; title=&quot;Lost ref on lnet_peer in discovery leads to LNetError: 24909:0:(peer.c:292:lnet_destroy_peer_locked()) ASSERTION( list_empty(&amp;amp;lp-&amp;gt;lp_peer_nets) ) failed:&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-14627&quot;&gt;&lt;del&gt;LU-14627&lt;/del&gt;&lt;/a&gt; utils: quiet spurious lustre_rmmod message&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: &lt;br/&gt;
Commit: 581d1afd00fe7d82d3a2be5497f0fcf4fde24a4a&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10010">
                    <name>Duplicate</name>
                                                                <inwardlinks description="is duplicated by">
                                                        </inwardlinks>
                                    </issuelinktype>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                                                <inwardlinks description="is related to">
                                        <issuelink>
            <issuekey id="63895">LU-14635</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="64974">LU-14810</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="61345">LU-14074</issuekey>
        </issuelink>
                            </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|i01snb:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>