<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:51:54 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-5485] first mount always fail with avoid_asym_router_failure</title>
                <link>https://jira.whamcloud.com/browse/LU-5485</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;We hit this on lola, the environment is quite simple, all clients are in o2ib1 and all servers are in o2ib0, these two networks are connected via two routers, no other nodes in this  cluster. &lt;/p&gt;

&lt;p&gt;We found that when we unload/reload client modules, the first mount always fail, the second try will success. After digging into source code, I think the scenario is like this:&lt;/p&gt;
&lt;ul&gt;
	&lt;li&gt;LNet are shutdown on all clients node, there is no incoming/outgoing message on network o2ib1, so Router Checker (RC) on router will change status of NI to &quot;DOWN&quot; after a couple of minutes.&lt;/li&gt;
	&lt;li&gt;RC on servers pinged routers, and learnt that NI(o2ib1) on all these routers are DOWN.&lt;/li&gt;
	&lt;li&gt;before the next RC ping of server router checker, if user tried to mount lustre client on client nodes, server (MGS) handled connect request and reply.&lt;/li&gt;
	&lt;li&gt;while sending this reply, LNet will search routers, and find all routers are DOWN for o2ib1 (out of date information), although NI status on routers are actually UP now (because routers have received request from clients on o2ib1, so they will change NI(o2ib1) to UP).&lt;/li&gt;
	&lt;li&gt;mount will fail until the next time RC ping routers and get up-to-date information from them.&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;I think users didn&apos;t hit this is because they normally upgrade clients in a few batches, or will try to check network status (lctl ping etc) before mount client, so routers will get something from client network, and keep NI status as alive.&lt;/p&gt;

&lt;p&gt;I don&apos;t have good solution yet, need more time to think about it, and discuss with Isaac.&lt;/p&gt;</description>
                <environment></environment>
        <key id="26010">LU-5485</key>
            <summary>first mount always fail with avoid_asym_router_failure</summary>
                <type id="4" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11310&amp;avatarType=issuetype">Improvement</type>
                                            <priority id="4" iconUrl="https://jira.whamcloud.com/images/icons/priorities/minor.svg">Minor</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="liang">Liang Zhen</assignee>
                                    <reporter username="liang">Liang Zhen</reporter>
                        <labels>
                    </labels>
                <created>Thu, 14 Aug 2014 12:58:42 +0000</created>
                <updated>Mon, 27 Apr 2015 20:45:32 +0000</updated>
                            <resolved>Thu, 8 Jan 2015 13:54:57 +0000</resolved>
                                                    <fixVersion>Lustre 2.7.0</fixVersion>
                    <fixVersion>Lustre 2.5.4</fixVersion>
                                        <due></due>
                            <votes>0</votes>
                                    <watches>13</watches>
                                                                            <comments>
                            <comment id="91697" author="liang" created="Fri, 15 Aug 2014 02:56:18 +0000"  >&lt;p&gt;Isaac, could you please comment?&lt;/p&gt;</comment>
                            <comment id="92655" author="isaac" created="Wed, 27 Aug 2014 20:35:06 +0000"  >&lt;p&gt;There used to be a similar problem with conventional router pingers (i.e. without the asymmetrical pinger) at ORNL. ORNL often boots a whole client cluster (including the routers that connect to the server cluster) all together, so when a client&apos;s request arrives at a server there&apos;s a chance that all routers to the client cluster are still considered as dead by the server, then server will drop the reply as there&apos;s no route available to the client.&lt;/p&gt;

&lt;p&gt;A possible solution is:&lt;br/&gt;
When a message arrives (in lnet_parse()) from a router, this is a good indication that the router is available. Check if our router status is up-to-date, in case the pinger hasn&apos;t been able to update it yet:&lt;/p&gt;
&lt;ul class=&quot;alternate&quot; type=&quot;square&quot;&gt;
	&lt;li&gt;If the router is down, mark it as up.&lt;/li&gt;
	&lt;li&gt;If the router&apos;s corresponding far-side NI is down, mark it as up too.&lt;/li&gt;
&lt;/ul&gt;
</comment>
                            <comment id="93777" author="liang" created="Thu, 11 Sep 2014 12:37:06 +0000"  >&lt;p&gt;Due to Isaac&apos;s suggestion, I also try to address this issue in &lt;a href=&quot;http://review.whamcloud.com/11748&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/11748&lt;/a&gt;&lt;br/&gt;
It&apos;s not ready for product yet, now it&apos;s only for testing and discussing.&lt;br/&gt;
I may have a follow-on patch to reduce ping if router has recent aliveness information.&lt;/p&gt;</comment>
                            <comment id="96139" author="simmonsja" created="Fri, 10 Oct 2014 17:52:21 +0000"  >&lt;p&gt;When we attempted to upgrade to 2.4 we had to turn off asym_router_failure in order to bring up our file system. Recently we upgraded to 2.5.3 and again we hit the issue of asym_router_failure breaking our systems. Currently we have it turned off in our system.&lt;/p&gt;</comment>
                            <comment id="97929" author="liang" created="Thu, 30 Oct 2014 12:43:02 +0000"  >&lt;p&gt;I think we should have a dedicated patch for this issue, instead of putting everything in &lt;a href=&quot;http://review.whamcloud.com/11748&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/11748&lt;/a&gt;&lt;br/&gt;
Here is the patch, Isaac, could you take a look?&lt;br/&gt;
&lt;a href=&quot;http://review.whamcloud.com/#/c/12453/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#/c/12453/&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="98180" author="simmonsja" created="Mon, 3 Nov 2014 15:12:16 +0000"  >&lt;p&gt;Liang does this patch need to be applied for both clients and servers?&lt;/p&gt;</comment>
                            <comment id="101065" author="gerrit" created="Tue, 9 Dec 2014 08:13:41 +0000"  >&lt;p&gt;Oleg Drokin (oleg.drokin@intel.com) merged in patch &lt;a href=&quot;http://review.whamcloud.com/12453/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/12453/&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-5485&quot; title=&quot;first mount always fail with avoid_asym_router_failure&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-5485&quot;&gt;&lt;del&gt;LU-5485&lt;/del&gt;&lt;/a&gt; lnet: peer aliveness status and NI status&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: &lt;br/&gt;
Commit: fb259fe85813e0f28ac7f7410689e3856ef26316&lt;/p&gt;</comment>
                            <comment id="102863" author="jlevi" created="Thu, 8 Jan 2015 13:54:57 +0000"  >&lt;p&gt;Patch landed to Master. If there is more work to be done in this ticket, please reopen.&lt;/p&gt;</comment>
                            <comment id="102869" author="simmonsja" created="Thu, 8 Jan 2015 14:43:46 +0000"  >&lt;p&gt;Mounting now works with ARF. Now ARF just doesn&apos;t work for us. That work can be completed under &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-5758&quot; title=&quot;enabling avoid_asym_router_failure prvents the bring up of ORNL production systems&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-5758&quot;&gt;&lt;del&gt;LU-5758&lt;/del&gt;&lt;/a&gt;.&lt;/p&gt;</comment>
                            <comment id="104803" author="gerrit" created="Tue, 27 Jan 2015 02:43:37 +0000"  >&lt;p&gt;Oleg Drokin (oleg.drokin@intel.com) merged in patch &lt;a href=&quot;http://review.whamcloud.com/12435/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/12435/&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-5485&quot; title=&quot;first mount always fail with avoid_asym_router_failure&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-5485&quot;&gt;&lt;del&gt;LU-5485&lt;/del&gt;&lt;/a&gt; lnet: peer aliveness status and NI status&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: b2_5&lt;br/&gt;
Current Patch Set: &lt;br/&gt;
Commit: 58c4cd80e197bd6e70d1638df796ae878baf844c&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10010">
                    <name>Duplicate</name>
                                            <outwardlinks description="duplicates">
                                        <issuelink>
            <issuekey id="27154">LU-5785</issuekey>
        </issuelink>
                            </outwardlinks>
                                                        </issuelinktype>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                        <issuelink>
            <issuekey id="27994">LU-6060</issuekey>
        </issuelink>
                            </outwardlinks>
                                                                <inwardlinks description="is related to">
                                        <issuelink>
            <issuekey id="27053">LU-5758</issuekey>
        </issuelink>
                            </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzwtq7:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>15306</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                                                                                </customfields>
    </item>
</channel>
</rss>