<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 02:49:50 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-12122] Deadlock with check_routers_before_use and discovery</title>
                <link>https://jira.whamcloud.com/browse/LU-12122</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;I found this issue when testing Amir&apos;s new patches (see &lt;a href=&quot;https://review.whamcloud.com/#/c/33651/9&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/#/c/33651/9&lt;/a&gt;), but I believe the issue exists in current master.&lt;/p&gt;

&lt;p&gt; LNetNIInit()  calls lnet_monitor_thr_start()  -&amp;gt; lnet_router_post_mt_start() -&amp;gt; lnet_wait_known_routerstate()&lt;/p&gt;

&lt;p&gt;lnet_wait_known_routerstate() will wait indefinitely until all gateways have been discovered.&lt;/p&gt;

&lt;p&gt;However, the discovery thread is not started until after lnet_monitor_thr_start() returns. Thus, LNet never finishes starting.&lt;/p&gt;

&lt;p&gt;Logs slowly fill with:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[7073564.123980] LNetError: 31952:0:(router.c:873:lnet_check_routers()) Failed to discover router 192.168.2.26@tcp4
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Reproduced on simple three node VM.&lt;/p&gt;

&lt;p&gt;The LNet configuration:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;sles15build01:/etc/modprobe.d # pdsh -w sles15build01,sles15c01,sles15s01 cat /etc/modprobe.d/lnet.conf | dshbak -c
----------------
sles15s01
----------------
options lnet networks=&quot;tcp(eth0)&quot;
options lnet routes=&quot;tcp4 192.168.2.26@tcp&quot;
options lnet lnet_peer_discovery_disabled=0
options lnet check_routers_before_use=1
----------------
sles15c01
----------------
options lnet ip2nets=&quot;tcp4(eth0) 192.168.*.*; tcp99(eth0) 192.168.*.*&quot;
options lnet routes=&quot;tcp 192.168.2.26@tcp4&quot;
options lnet lnet_peer_discovery_disabled=0
options lnet check_routers_before_use=1
----------------
sles15build01
----------------
options lnet ip2nets=&quot;tcp(eth0) 192.168.*.*; tcp4(eth1) 192.168.*.*; tcp99(eth1) 192.168.*.*&quot;
options lnet forwarding=enabled
options lnet lnet_peer_discovery_disabled=0
sles15build01:/etc/modprobe.d #
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Attempt to start LNet:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;sles15build01:/etc/modprobe.d # pdsh -w sles15build01,sles15c01,sles15s01
pdsh&amp;gt; modprobe lnet
pdsh&amp;gt; lctl net up
sles15build01: LNET configured
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;LNet can start on the node acting as a router, but the hangs indefinitely on the other two nodes.&lt;/p&gt;</description>
                <environment></environment>
        <key id="55260">LU-12122</key>
            <summary>Deadlock with check_routers_before_use and discovery</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="4" iconUrl="https://jira.whamcloud.com/images/icons/priorities/minor.svg">Minor</priority>
                        <status id="6" iconUrl="https://jira.whamcloud.com/images/icons/statuses/closed.png" description="The issue is considered finished, the resolution is correct. Issues which are closed can be reopened.">Closed</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="6">Not a Bug</resolution>
                                        <assignee username="wc-triage">WC Triage</assignee>
                                    <reporter username="hornc">Chris Horn</reporter>
                        <labels>
                    </labels>
                <created>Tue, 26 Mar 2019 18:40:58 +0000</created>
                <updated>Fri, 3 Jan 2020 00:42:58 +0000</updated>
                            <resolved>Tue, 26 Mar 2019 21:49:23 +0000</resolved>
                                    <version>Lustre 2.12.0</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>3</watches>
                                                                            <comments>
                            <comment id="244690" author="hornc" created="Tue, 26 Mar 2019 21:20:15 +0000"  >&lt;p&gt;In trying out a quick fix for this issue I noticed that &apos;lctl net up&apos; will hang if check_routers_before_use is enabled but the routers don&apos;t have lnet loaded&lt;br/&gt;
I think what ought to happen is we attempt discovery to each router once, and set them up or down as appropriate&lt;br/&gt;
rather than wait forever&lt;/p&gt;</comment>
                            <comment id="244692" author="hornc" created="Tue, 26 Mar 2019 21:49:15 +0000"  >&lt;p&gt;I must&apos;ve gotten mixed-up when checking whether this bug exists in master. The switch from the router_checker_thread to the monitoring thread is in master, but this bug is only introduced by amir&apos;s patch to replace the router pings with discovery under &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-11299&quot; title=&quot;LNet: Router pinger&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-11299&quot;&gt;&lt;del&gt;LU-11299&lt;/del&gt;&lt;/a&gt;. Since that patch is still under review I will close this ticket and share my feedback in the code review.&lt;/p&gt;</comment>
                            <comment id="260477" author="mhaakddn" created="Thu, 2 Jan 2020 01:15:22 +0000"  >&lt;p&gt;This needs to be reopened.&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-11299&quot; title=&quot;LNet: Router pinger&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-11299&quot;&gt;&lt;del&gt;LU-11299&lt;/del&gt;&lt;/a&gt; was commited and the bug does not appear to be addressed in the final version of &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-11299&quot; title=&quot;LNet: Router pinger&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-11299&quot;&gt;&lt;del&gt;LU-11299&lt;/del&gt;&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;We had all of production go down due to lustre servers not being able to resolve the status of lnet routers.&lt;/p&gt;</comment>
                            <comment id="260495" author="pjones" created="Thu, 2 Jan 2020 14:30:24 +0000"  >&lt;p&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/ViewProfile.jspa?name=mhaakddn&quot; class=&quot;user-hover&quot; rel=&quot;mhaakddn&quot;&gt;mhaakddn&lt;/a&gt; rather than reopening ancient similar tickets, please open a new ticket with the details of the incident for analysis.&lt;/p&gt;</comment>
                            <comment id="260507" author="hornc" created="Thu, 2 Jan 2020 20:23:50 +0000"  >&lt;p&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/ViewProfile.jspa?name=mhaak&quot; class=&quot;user-hover&quot; rel=&quot;mhaak&quot;&gt;mhaak&lt;/a&gt; perhaps you experienced &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-13001&quot; title=&quot;check_routers_before_use causes LNet to hang indefinitely if any router is down&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-13001&quot;&gt;&lt;del&gt;LU-13001&lt;/del&gt;&lt;/a&gt;?&lt;/p&gt;</comment>
                            <comment id="260533" author="mhaakddn" created="Fri, 3 Jan 2020 00:42:58 +0000"  >&lt;p&gt;@Chris Horn,&lt;/p&gt;

&lt;p&gt;That would appear to be it.&lt;/p&gt;</comment>
                    </comments>
                    <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|i00dyf:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>