<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 03:19:12 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-15541] Soft lockups in LNetPrimaryNID() and lnet_discover_peer_locked()</title>
                <link>https://jira.whamcloud.com/browse/LU-15541</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;We upgraded a lustre server cluster from lustre-2.12.7_2.llnl to lustre-2.12.8_6.llnl. Almost immediately after boot, clients begin reporting soft lockups on the console, with stacks like this:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;2022-02-08 09:43:10 [1644342190.528916] 
Call Trace:
 queued_spin_lock_slowpath+0xb/0xf
 _raw_spin_lock+0x30/0x40
 cfs_percpt_lock+0xc1/0x110 [libcfs]
 lnet_discover_peer_locked+0xa0/0x450 [lnet]
 ? wake_up_atomic_t+0x30/0x30
 LNetPrimaryNID+0xd5/0x220 [lnet]
 ptlrpc_connection_get+0x3e/0x450 [ptlrpc]
 target_handle_connect+0x12f1/0x2b90 [ptlrpc]
 ? enqueue_task_fair+0x208/0x6c0
 ? check_preempt_curr+0x80/0xa0
 ? ttwu_do_wakeup+0x19/0x100
 tgt_request_handle+0x4fa/0x1570 [ptlrpc]
 ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc]
 ? __getnstimeofday64+0x3f/0xd0
 ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
 ? ptlrpc_wait_event+0xb8/0x370 [ptlrpc]
 ? __wake_up_common_lock+0x91/0xc0
 ? sched_feat_set+0xf0/0xf0
 ptlrpc_main+0xc49/0x1c50 [ptlrpc]
 ? __switch_to+0xce/0x5a0
 ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc]
 kthread+0xd1/0xe0
 ? insert_kthread_work+0x40/0x40
 ret_from_fork_nospec_begin+0x21/0x21
 ? insert_kthread_work+0x40/0x40
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Some servers never exit recovery, and others do but seem to be unable to service requests.&lt;/p&gt;

&lt;p&gt;Seen during the same lustre server update where we saw &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-15539&quot; title=&quot;clients report mds_mds_connection in connect_flags after lustre update on servers&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-15539&quot;&gt;&lt;del&gt;LU-15539&lt;/del&gt;&lt;/a&gt; but appears to be a separate issue.&lt;/p&gt;

&lt;p&gt;Patch stacks are:&lt;br/&gt;
&lt;a href=&quot;https://github.com/LLNL/lustre/releases/tag/2.12.8_6.llnl&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://github.com/LLNL/lustre/releases/tag/2.12.8_6.llnl&lt;/a&gt;&lt;br/&gt;
&lt;a href=&quot;https://github.com/LLNL/lustre/releases/tag/2.12.7_2.llnl&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://github.com/LLNL/lustre/releases/tag/2.12.7_2.llnl&lt;/a&gt;&lt;/p&gt;</description>
                <environment>3.10.0-1160.45.1.1chaos.ch6.x86_64&lt;br/&gt;
lustre-2.12.7_2.llnl&lt;br/&gt;
3.10.0-1160.53.1.1chaos.ch6.x86_64&lt;br/&gt;
lustre-2.12.8_6.llnl&lt;br/&gt;
RHEL7.9&lt;br/&gt;
zfs-0.7.11-9.8llnl</environment>
        <key id="68602">LU-15541</key>
            <summary>Soft lockups in LNetPrimaryNID() and lnet_discover_peer_locked()</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="2" iconUrl="https://jira.whamcloud.com/images/icons/priorities/critical.svg">Critical</priority>
                        <status id="1" iconUrl="https://jira.whamcloud.com/images/icons/statuses/open.png" description="The issue is open and ready for the assignee to start work on it.">Open</status>
                    <statusCategory id="2" key="new" colorName="default"/>
                                    <resolution id="-1">Unresolved</resolution>
                                        <assignee username="ssmirnov">Serguei Smirnov</assignee>
                                    <reporter username="ofaaland">Olaf Faaland</reporter>
                        <labels>
                            <label>llnl</label>
                            <label>topllnl</label>
                    </labels>
                <created>Thu, 10 Feb 2022 02:19:35 +0000</created>
                <updated>Wed, 2 Aug 2023 13:08:06 +0000</updated>
                                            <version>Lustre 2.12.7</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>11</watches>
                                                                            <comments>
                            <comment id="325789" author="ofaaland" created="Thu, 10 Feb 2022 02:22:54 +0000"  >&lt;p&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/ViewProfile.jspa?name=behlendo&quot; class=&quot;user-hover&quot; rel=&quot;behlendo&quot;&gt;behlendo&lt;/a&gt; found this patch, which seems to fit the stack reported with the soft lockups&lt;br/&gt;
&lt;a href=&quot;https://review.whamcloud.com/#/c/43537/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/#/c/43537/&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="325790" author="ofaaland" created="Thu, 10 Feb 2022 02:45:32 +0000"  >&lt;p&gt;I&apos;ve attached I attached vmcore-dmesg.txt files from two servers where this occurred, while they were running 2.12.8_6.llnl and the later of the two kernels I mentioned.&lt;/p&gt;

&lt;p&gt;We reverted the servers back to the earlier kernel and lustre 2.12.7_2.llnl and saw the same symptoms.&lt;/p&gt;</comment>
                            <comment id="325799" author="ssmirnov" created="Thu, 10 Feb 2022 03:36:18 +0000"  >&lt;p&gt;Hi Olaf,&lt;/p&gt;

&lt;p&gt;With 2.12.7_2.llnl, does the client eventually recover, or is it stuck forever?&lt;/p&gt;

&lt;p&gt;If possible, I&apos;d recommend trying lustre-2.12.8_6.llnl with reverted change for &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-15234&quot; title=&quot;LNet high peer reference counts inconsistent with queue&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-15234&quot;&gt;&lt;del&gt;LU-15234&lt;/del&gt;&lt;/a&gt; &quot;lnet: Race on discovery queue&quot;. It never landed on lustre b2_12, perhaps &lt;a href=&quot;https://jira.whamcloud.com/secure/ViewProfile.jspa?name=hornc&quot; class=&quot;user-hover&quot; rel=&quot;hornc&quot;&gt;hornc&lt;/a&gt;&#160;can comment why it was abandoned.&lt;/p&gt;

&lt;p&gt;Thanks,&lt;/p&gt;

&lt;p&gt;Serguei.&lt;/p&gt;</comment>
                            <comment id="325873" author="hornc" created="Thu, 10 Feb 2022 15:59:07 +0000"  >&lt;p&gt;&amp;gt; Chris Horn can comment why it was abandoned.&lt;/p&gt;

&lt;p&gt;I only abandoned it because I found it was not actually the root cause of the issue in that ticket. I can revive the patch for b2_12 if you would like it landed there.&lt;/p&gt;</comment>
                            <comment id="325904" author="ofaaland" created="Thu, 10 Feb 2022 17:48:29 +0000"  >&lt;p&gt;Hi Serguei,&lt;/p&gt;

&lt;p&gt;&amp;gt; With 2.12.7_2.llnl, does the client eventually recover, or is it stuck forever?&lt;/p&gt;

&lt;p&gt;I think after &quot;mount -t lustre &amp;lt;lustre-target&amp;gt; &amp;lt;dir&amp;gt;&quot; the longest I waited was about 20 minutes.  At that point most of the 52 targets (16 MDTs, 36 OSTs) had exited recovery, but a couple had not.  Also at that point, nodes started crashing.&lt;/p&gt;

&lt;p&gt;&amp;gt; If possible, I&apos;d recommend trying lustre-2.12.8_6.llnl with reverted change for&#160;&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-15234&quot; title=&quot;LNet high peer reference counts inconsistent with queue&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-15234&quot;&gt;&lt;del&gt;LU-15234&lt;/del&gt;&lt;/a&gt;&#160;&quot;lnet: Race on discovery queue&quot;.&lt;/p&gt;

&lt;p&gt;I tried that, and saw the same soft lockups.&lt;/p&gt;

&lt;p&gt;thanks,&lt;br/&gt;
Olaf&lt;/p&gt;</comment>
                            <comment id="325905" author="ofaaland" created="Thu, 10 Feb 2022 17:54:34 +0000"  >&lt;p&gt;I have crash dumps as well, if that helps.&lt;/p&gt;</comment>
                            <comment id="325914" author="ssmirnov" created="Thu, 10 Feb 2022 18:52:42 +0000"  >&lt;p&gt;Olaf,&lt;/p&gt;

&lt;p&gt;Are clients being upgraded at the same time and to the same version as the server?&lt;/p&gt;

&lt;p&gt;What was the last version which doesn&apos;t cause lock ups?&lt;/p&gt;

&lt;p&gt;I&apos;m trying to identify the change which could have introduced this problem. That would make it easier to see if the fix is already available.&lt;/p&gt;

&lt;p&gt;Thanks,&lt;/p&gt;

&lt;p&gt;Serguei.&#160;&lt;/p&gt;</comment>
                            <comment id="325919" author="ofaaland" created="Thu, 10 Feb 2022 19:18:37 +0000"  >&lt;p&gt;Hi Serguei,&lt;/p&gt;

&lt;p&gt;&amp;gt; Are clients being upgraded at the same time and to the same version as the server?&lt;/p&gt;

&lt;p&gt;No.  We have 10-15 compute clusters and 3 server clusters; each server cluster hosts one file system.  We typically update 3-5 clusters per day.  &lt;/p&gt;

&lt;p&gt;The &quot;old version&quot; is 2.12.7_2.llnl.  When this happened Tuesday, we were on the first day of our update cycle.  2 compute clusters were updated to 2.12.8_6.llnl at the same time that 1 server cluster (copper, the where we saw the issue) was updated to 2.12.8_6.llnl.  The remainder of the clusters were still at 2.12.7_2.llnl&lt;/p&gt;

&lt;p&gt;During the update attempt on Tuesday, we initially brought copper up with 2.12.8_6.llnl and saw these soft lockups and also &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-15539&quot; title=&quot;clients report mds_mds_connection in connect_flags after lustre update on servers&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-15539&quot;&gt;&lt;del&gt;LU-15539&lt;/del&gt;&lt;/a&gt;.  We then brought copper down and brought it back up with its old file system image, which had 2.12.7_2.llnl.  We saw these soft lockups again with 2.12.7_2.llnl.&lt;/p&gt;

&lt;p&gt;&amp;gt; What was the last version which doesn&apos;t cause lock ups?&lt;/p&gt;

&lt;p&gt;Unfortunately, I don&apos;t know.  It seems to me that the issue exists in both 2.12.7_2.llnl and 2.12.8_6.llnl.  I&apos;ll look back and see if I have console logs indicating this issue prior to this week.&lt;/p&gt;</comment>
                            <comment id="325920" author="hornc" created="Thu, 10 Feb 2022 19:19:25 +0000"  >&lt;blockquote&gt;
&lt;p&gt;behlendo found this patch, which seems to fit the stack reported with the soft lockups&lt;br/&gt;
&lt;a href=&quot;https://review.whamcloud.com/#/c/43537/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/#/c/43537/&lt;/a&gt;&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;FWIW, this patch is really just an optimization for a particular case where clients with discovery disabled are mounting a filesystem when one or more servers are down. If all of your servers are up/healthy when the clients mount, or if your clients have discovery enabled, then this patch would not really have an impact,&lt;/p&gt;</comment>
                            <comment id="325933" author="ssmirnov" created="Thu, 10 Feb 2022 21:10:42 +0000"  >&lt;p&gt;One of the following patches may help resolve the issue:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://review.whamcloud.com/35444&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/35444&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://review.whamcloud.com/42109&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/42109&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://review.whamcloud.com/42111&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/42111&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The first one (&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-10931&quot; title=&quot;failed peer discovery still taking too long&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-10931&quot;&gt;&lt;del&gt;LU-10931&lt;/del&gt;&lt;/a&gt;) is probably more relevant because the other two deal with the situation when remote peer is not ready to connect.&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;</comment>
                            <comment id="326247" author="ofaaland" created="Mon, 14 Feb 2022 18:05:45 +0000"  >&lt;p&gt;A little more detail.  &lt;/p&gt;

&lt;p&gt;At the time this happened, the file system was mounted by about 5,400 clients (based on count of exports reported by OSTs) and was running Lustre 2.12.7_2.llnl.  &lt;/p&gt;

&lt;p&gt;The process we followed is one we always use - we shut down the file system without umounting the clients (so I/O on clients would hang), then rebooted the server nodes.  After all the server nodes came up into the new image which had Lustre 2.12.8_6.llnl, we mounted the Lustre targets within the space of 30 seconds or a minute (first MGS, then all MDTS in parallel, then all OSTs in parallel).  The recovery_status procfile on the servers indicated they all entered recovery almost immediately.  So there were a large number clients all reconnecting at the same time.&lt;/p&gt;

&lt;p&gt;In some cases we saw the soft lockup console messages before we saw &quot;Denying connection for new client ... X recovered, Y in progress, and Z evicted&quot; messages, and in other cases we saw the soft lockup messages appear later. &lt;br/&gt;
 After trying a different lustre versions and reverting a couple of patches, we stopped for the day.&lt;/p&gt;

&lt;p&gt;The next day, we umounted (with --force) all but 4 client nodes.  We then brought rebooted the Lustre server nodes and mounted the targets.  The 4 clients reconnected, the servers evicted the other 5,400-ish clients, and the servers exited recovery.  We did not see the soft lockups.  We then re-mounted all the clients, and did not see the soft lockups.&lt;/p&gt;

&lt;p&gt;So based on that experiment, it seems as if starting the servers with all the clients reconnecting at the same time triggered this issue.  But it is our normal process to leave the clients mounted during file system updates like this, and we have not seen soft lockups with this stack before.  So perhaps something else changed in the system between Tuesday and Wednesday, but if so it was fairly subtle (there were no clusters added or removed, no routes changed, no LNet or Lustre configuration changes in general).&lt;/p&gt;</comment>
                            <comment id="326249" author="ofaaland" created="Mon, 14 Feb 2022 18:11:11 +0000"  >&lt;p&gt;I checked past logs. Our servers have not reported soft lockups with stacks including LNetPrimaryNID() and lnet_discover_peer_locked() before the 2022 Feb 8th instance.&lt;/p&gt;

&lt;p&gt;The only similar soft lockup stack was on a client node in a computer cluster (catalyst220) running Lustre 2.12.7_2.llnl, in September 2021:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Call Trace:
 [&amp;lt;ffffffff98ba9ae6&amp;gt;] queued_spin_lock_slowpath+0xb/0xf
 [&amp;lt;ffffffff98bb8b00&amp;gt;] _raw_spin_lock+0x30/0x40
 [&amp;lt;ffffffffc0df7b51&amp;gt;] cfs_percpt_lock+0xc1/0x110 [libcfs]
 [&amp;lt;ffffffffc10637a0&amp;gt;] lnet_discover_peer_locked+0xa0/0x450 [lnet]
 [&amp;lt;ffffffff984cc540&amp;gt;] ? wake_up_atomic_t+0x30/0x30
 [&amp;lt;ffffffffc1063c25&amp;gt;] LNetPrimaryNID+0xd5/0x220 [lnet]
 [&amp;lt;ffffffffc15cf57e&amp;gt;] ptlrpc_connection_get+0x3e/0x450 [ptlrpc]
 [&amp;lt;ffffffffc15c344c&amp;gt;] ptlrpc_uuid_to_connection+0xec/0x1a0 [ptlrpc]
 [&amp;lt;ffffffffc1594292&amp;gt;] import_set_conn+0xb2/0x7e0 [ptlrpc]
 [&amp;lt;ffffffffc15949d3&amp;gt;] client_import_add_conn+0x13/0x20 [ptlrpc]
 [&amp;lt;ffffffffc1339e98&amp;gt;] class_add_conn+0x418/0x630 [obdclass]
 [&amp;lt;ffffffffc133bb31&amp;gt;] class_process_config+0x1a81/0x2830 [obdclass]
... &lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="326250" author="ofaaland" created="Mon, 14 Feb 2022 18:17:36 +0000"  >&lt;p&gt;So on the bright side, our servers and clients are up and our clusters are doing work again.&#160; But I don&apos;t have a reproducer.&#160; We will try to update this server cluster (copper) to lustre 2.12.8_6.llnl again in a week or two, so we&apos;ll see if the issue comes up again.&lt;/p&gt;</comment>
                            <comment id="326303" author="ofaaland" created="Mon, 14 Feb 2022 22:12:22 +0000"  >
&lt;blockquote&gt;&lt;blockquote&gt;&lt;p&gt;behlendo found this patch, which seems to fit the stack reported with the soft lockups&lt;/p&gt;&lt;/blockquote&gt;
&lt;blockquote&gt;&lt;p&gt;&lt;a href=&quot;https://review.whamcloud.com/#/c/43537/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/#/c/43537/&lt;/a&gt;&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;FWIW, this patch is really just an optimization for a particular case where clients with discovery disabled are mounting a filesystem when one or more servers are down.&#160;&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;Chris, why would this only apply when clients are performing a mount?  As long as discovery is off, couldn&apos;t lnet_peer_is_uptodate() return false on a server just like it could on a client?&lt;br/&gt;
&#160;&lt;/p&gt;</comment>
                            <comment id="326445" author="eaujames" created="Wed, 16 Feb 2022 11:44:21 +0000"  >&lt;p&gt;Hello,&lt;/p&gt;

&lt;p&gt;The CEA has seen this kind of symptoms on servers for missing routes (asymmetrical routes) or on clients (at mount time) when a target is failed over another node (with the original node not responding, lnet module unloaded).&lt;br/&gt;
At that time the CEA had lnet credit issue (starvation) for the client mount cases. Mounting all the clients at the same time could result to lnet credits starvation.&lt;/p&gt;

&lt;p&gt;I have backported the &lt;a href=&quot;https://review.whamcloud.com/#/c/43537/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/#/c/43537/&lt;/a&gt; because the CEA doesn&apos;t use multirail.&lt;br/&gt;
And we set drop_asym_route=1 on servers to protect against routes misconfiguration between clients and servers.&lt;/p&gt;

&lt;p&gt;Maybe the &lt;a href=&quot;https://review.whamcloud.com/45898/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/45898/&lt;/a&gt; (&quot;&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-10931&quot; title=&quot;failed peer discovery still taking too long&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-10931&quot;&gt;&lt;del&gt;LU-10931&lt;/del&gt;&lt;/a&gt; lnet: handle unlink before send completes&quot;) could fix that issue with discovery on.&lt;/p&gt;</comment>
                            <comment id="326505" author="ofaaland" created="Wed, 16 Feb 2022 18:16:26 +0000"  >&lt;p&gt;Thanks, Etienne&lt;/p&gt;</comment>
                            <comment id="337886" author="sarah" created="Wed, 15 Jun 2022 21:51:05 +0000"  >&lt;p&gt;similar one on 2.12.9 &lt;a href=&quot;https://testing.whamcloud.com/test_sets/c858c157-7ecd-4c98-bfe3-1da2ce125f8c&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://testing.whamcloud.com/test_sets/c858c157-7ecd-4c98-bfe3-1da2ce125f8c&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="348358" author="eaujames" created="Fri, 30 Sep 2022 08:56:50 +0000"  >&lt;p&gt;Hello,&lt;/p&gt;

&lt;p&gt;We observed the same kind of stack trace than Olaf when mounting new clients after losing a DDN controller (4 OSS, targets are mounted on the other controller):&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt; [&amp;lt;ffffffff98ba9ae6&amp;gt;] queued_spin_lock_slowpath+0xb/0xf
 [&amp;lt;ffffffff98bb8b00&amp;gt;] _raw_spin_lock+0x30/0x40
 [&amp;lt;ffffffffc0df7b51&amp;gt;] cfs_percpt_lock+0xc1/0x110 [libcfs]
 [&amp;lt;ffffffffc10637a0&amp;gt;] lnet_discover_peer_locked+0xa0/0x450 [lnet]
 [&amp;lt;ffffffff984cc540&amp;gt;] ? wake_up_atomic_t+0x30/0x30
 [&amp;lt;ffffffffc1063c25&amp;gt;] LNetPrimaryNID+0xd5/0x220 [lnet]
 [&amp;lt;ffffffffc15cf57e&amp;gt;] ptlrpc_connection_get+0x3e/0x450 [ptlrpc]
 [&amp;lt;ffffffffc15c344c&amp;gt;] ptlrpc_uuid_to_connection+0xec/0x1a0 [ptlrpc]
 [&amp;lt;ffffffffc1594292&amp;gt;] import_set_conn+0xb2/0x7e0 [ptlrpc]
 [&amp;lt;ffffffffc15949d3&amp;gt;] client_import_add_conn+0x13/0x20 [ptlrpc]
 [&amp;lt;ffffffffc1339e98&amp;gt;] class_add_conn+0x418/0x630 [obdclass]
 [&amp;lt;ffffffffc133bb31&amp;gt;] class_process_config+0x1a81/0x2830 [obdclass]
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;I have done some testing on master branch with discovery enable, and I was able to reproduce this.&lt;br/&gt;
I can only reproduce this with client behind the LNet router or with message drop rules on the OSS (lctl net_drop_add).&lt;/p&gt;

&lt;p&gt;The client parses the MGS configuration llog &quot;&amp;lt;fsname&amp;gt;-client&quot; to initialize all the lustre device.&lt;br/&gt;
e.g: MGS client osc configuration&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;#35 (224)marker  24 (flags=0x01, v2.15.51.0) lustrefs-OST0000 &apos;add osc&apos; Fri Sep 30 09:57:22 2022-       
#36 (080)add_uuid  nid=10.0.2.5@tcp(0x200000a000205)  0:  1:10.0.2.5@tcp                                
#37 (128)attach    0:lustrefs-OST0000-osc  1:osc  2:lustrefs-clilov_UUID                                
#38 (136)setup     0:lustrefs-OST0000-osc  1:lustrefs-OST0000_UUID  2:10.0.2.5@tcp                      
#39 (080)add_uuid  nid=10.0.2.4@tcp(0x200000a000204)  0:  1:10.0.2.4@tcp                                
#40 (104)add_conn  0:lustrefs-OST0000-osc  1:10.0.2.4@tcp                                               
#41 (128)lov_modify_tgts add 0:lustrefs-clilov  1:lustrefs-OST0000_UUID  2:0  3:1                       
#42 (224)END   marker  24 (flags=0x02, v2.15.51.0) lustrefs-OST0000 &apos;add osc&apos; Fri Sep 30 09:57:22 2022- 
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The records are parsed sequentially:&lt;/p&gt;
&lt;ul&gt;
	&lt;li&gt;setup -&amp;gt; client_obd_setup(): initialize the device and the primary connection (10.0.2.5@tcp, client_import_add_conn())&lt;/li&gt;
	&lt;li&gt;add_conn -&amp;gt; client_import_add_conn():  initialize the failover connections (10.0.2.4@tcp, client_import_add_conn())&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;The issue here is that client_import_add_conn() call LNetPrimaryNID() and do discovery to get the remote node interfaces.&lt;br/&gt;
Discovery thread start by pinging the node and take transaction_timeout (+- transaction_timeout/2) for it.&lt;br/&gt;
In our case, we lost 4 oss with 2 failover unreachable nodes for each: 50s * 4 * 2 = 400s (max time is 600s).&lt;/p&gt;

&lt;p&gt;On client side (2.12.7 LTS) we do not have the &lt;a href=&quot;https://review.whamcloud.com/#/c/43537/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/#/c/43537/&lt;/a&gt; (&quot;&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-14566&quot; title=&quot;Skip discovery in LNetPrimaryNID when lnet_peer_discovery_disabled is set&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-14566&quot;&gt;&lt;del&gt;LU-14566&lt;/del&gt;&lt;/a&gt; lnet: Skip discovery in LNetPrimaryNID if DD disabled&quot;), so to workaround this issue we decrease the transaction_timeout before mounting the client and then restore it: 2s * 4 * 2 = 16s (max time: 24s).&lt;/p&gt;

&lt;p&gt;With the &lt;a href=&quot;https://review.whamcloud.com/39613&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/39613&lt;/a&gt; (&quot;&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-10360&quot; title=&quot;use Imperative Recovery logs for client-&amp;gt;MDT/OST connections&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-10360&quot;&gt;LU-10360&lt;/a&gt; mgc: Use IR for client-&amp;gt;MDS/OST connections&quot;), I am not sure if we have to do discovery when parsing the configuration. But I have not played a lot with multirail, so someone have to confirm/infirm this.&lt;/p&gt;</comment>
                            <comment id="366098" author="eaujames" created="Thu, 16 Mar 2023 11:09:13 +0000"  >&lt;p&gt;The patches of &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-14668&quot; title=&quot;LNet: do discovery in the background&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-14668&quot;&gt;&lt;del&gt;LU-14668&lt;/del&gt;&lt;/a&gt; seem to resolve this issue.&lt;/p&gt;

&lt;p&gt;client_import_add_conn() do not hang anymore because&#160;LNetPrimaryNID() does not wait the end of node discovery (it does the discovery in background).&lt;/p&gt;</comment>
                            <comment id="368106" author="pjones" created="Sat, 1 Apr 2023 15:08:16 +0000"  >&lt;p&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/ViewProfile.jspa?name=hxing&quot; class=&quot;user-hover&quot; rel=&quot;hxing&quot;&gt;hxing&lt;/a&gt;&#160;could you please port the &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-14668&quot; title=&quot;LNet: do discovery in the background&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-14668&quot;&gt;&lt;del&gt;LU-14668&lt;/del&gt;&lt;/a&gt; patches to b2_15?&lt;/p&gt;</comment>
                            <comment id="368133" author="JIRAUSER17900" created="Mon, 3 Apr 2023 00:59:59 +0000"  >&lt;p&gt;OK, I&apos;ll Port them to b2_15.&lt;/p&gt;</comment>
                            <comment id="373243" author="ofaaland" created="Tue, 23 May 2023 22:11:10 +0000"  >&lt;p&gt;&amp;gt; OK, I&apos;ll Port them to b2_15.&lt;/p&gt;

&lt;p&gt;Is this still being done?&lt;/p&gt;

&lt;p&gt;thanks&lt;/p&gt;</comment>
                            <comment id="373246" author="ssmirnov" created="Tue, 23 May 2023 22:27:34 +0000"  >&lt;p&gt;Hi Olaf,&lt;/p&gt;

&lt;p&gt;Yes, there were some distractions so I started on this only late last week. I&apos;m still porting the patches. There&apos;s a chance I&apos;ll push the ports by the end of this week.&lt;/p&gt;

&lt;p&gt;Thanks,&lt;/p&gt;

&lt;p&gt;Serguei.&lt;/p&gt;</comment>
                            <comment id="374026" author="ssmirnov" created="Wed, 31 May 2023 19:14:16 +0000"  >&lt;p&gt;Here&apos;s the link to the &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-14668&quot; title=&quot;LNet: do discovery in the background&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-14668&quot;&gt;&lt;del&gt;LU-14668&lt;/del&gt;&lt;/a&gt; patch series ported to b2_15: &lt;a href=&quot;https://review.whamcloud.com/51135/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/51135/&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="374443" author="ofaaland" created="Sat, 3 Jun 2023 00:48:20 +0000"  >&lt;p&gt;Thank you, Serguei.  We&apos;ll add them to our stack and do some testing.  We haven&apos;t successfully reproduced the original issue, so we&apos;ll only be able to tell you if we have unexpected new symptoms with LNet; but that&apos;s a start.&lt;/p&gt;</comment>
                            <comment id="381090" author="pjones" created="Wed, 2 Aug 2023 13:08:06 +0000"  >&lt;p&gt;The patch series has merged to b2_15 for 2.15.4&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                        <issuelink>
            <issuekey id="64039">LU-14668</issuekey>
        </issuelink>
                            </outwardlinks>
                                                                <inwardlinks description="is related to">
                                                        </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                            <attachment id="42295" name="vmcore-dmesg.copper1.txt" size="1016928" author="ofaaland" created="Thu, 10 Feb 2022 02:43:54 +0000"/>
                            <attachment id="42296" name="vmcore-dmesg.copper2.txt" size="582574" author="ofaaland" created="Thu, 10 Feb 2022 02:43:54 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|i02hy7:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>