<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:52:37 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-5570] Better router selection in LNet</title>
                <link>https://jira.whamcloud.com/browse/LU-5570</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;LNet chooses routers based on queued bytes on routers, at the meanwhile, it normally takes tens of seconds to detect dead routers (we see failed completion event of outstanding tx/rx, then close connection and notify LNet peer is dead) , which means it is still possible to queue more messages to a potentially dead router if all other alive routers have long message queue.&lt;/p&gt;

&lt;p&gt;we may need to check aliveness timestamp as part of router evaluation, and avoid to choose those routers that are inactive for certain number of seconds as long as there are other active routers (it takes pretty long to mark a router as dead, we might prefer not to choose it before marking it as dead)&lt;/p&gt;</description>
                <environment></environment>
        <key id="26255">LU-5570</key>
            <summary>Better router selection in LNet</summary>
                <type id="4" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11310&amp;avatarType=issuetype">Improvement</type>
                                            <priority id="4" iconUrl="https://jira.whamcloud.com/images/icons/priorities/minor.svg">Minor</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="liang">Liang Zhen</assignee>
                                    <reporter username="liang">Liang Zhen</reporter>
                        <labels>
                    </labels>
                <created>Tue, 2 Sep 2014 07:13:29 +0000</created>
                <updated>Wed, 13 Feb 2019 07:37:13 +0000</updated>
                            <resolved>Wed, 13 Feb 2019 07:36:58 +0000</resolved>
                                                                        <due></due>
                            <votes>0</votes>
                                    <watches>12</watches>
                                                                            <comments>
                            <comment id="93174" author="liang" created="Thu, 4 Sep 2014 05:38:55 +0000"  >&lt;p&gt;Isaac, Amir, I just posted an experimental patch: &lt;a href=&quot;http://review.whamcloud.com/11748&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/11748&lt;/a&gt;&lt;br/&gt;
could you take a look and give me some feedback, thanks.&lt;/p&gt;</comment>
                            <comment id="93450" author="isaac" created="Mon, 8 Sep 2014 17:57:50 +0000"  >&lt;p&gt;How is it different from setting router_ping_timeout to a lower value (and thus detecting dead/congested routers earlier)?&lt;/p&gt;</comment>
                            <comment id="93498" author="liang" created="Tue, 9 Sep 2014 03:16:15 +0000"  >&lt;p&gt;if so, then more likely we will drop already queued messages even the router is still alive? I also think It&apos;s helpful if user can see last_alive of routers on clients/servers.&lt;/p&gt;</comment>
                            <comment id="93500" author="isaac" created="Tue, 9 Sep 2014 03:46:50 +0000"  >&lt;p&gt;In case of a false dead router due to low timeout, we don&apos;t abandon messages already queued to it, do we? I thought we just stop giving it more messages, and leave those already queued intact.&lt;/p&gt;</comment>
                            <comment id="93502" author="liang" created="Tue, 9 Sep 2014 04:12:55 +0000"  >&lt;p&gt;but when we finalise message and return credit, we may drop queued message because router is down? hmm... if without this patch, then this is only happen on router, so you are correct, this is not an issue.&lt;br/&gt;
Anyway, I agree changing configuration may prevent upper layer to deliver more messages to potentially dead router, but it may take a few while to recover correct status even it is a false dead (dead_router_check_interval), also, user may not notice that they have to change their configuration. If we have aliveness status for routers, user don&apos;t have to change anything?&lt;/p&gt;</comment>
                            <comment id="93547" author="isaac" created="Tue, 9 Sep 2014 16:48:20 +0000"  >&lt;p&gt;The dead_router_check_interval can be changed at run time, not a big deal really. My point is, the added code complexity seemed to out weight the benefits from the patch. If you want to go forward, why not also avoid unnecessary pings (e.g. no need to ping if last_alive is very recent) - then the additional benefit of reduced pings would make it more worthwhile.&lt;/p&gt;</comment>
                            <comment id="93548" author="isaac" created="Tue, 9 Sep 2014 16:52:52 +0000"  >&lt;p&gt;Also, with aliveness for routers, it&apos;d be possible to fix &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-5485&quot; title=&quot;first mount always fail with avoid_asym_router_failure&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-5485&quot;&gt;&lt;del&gt;LU-5485&lt;/del&gt;&lt;/a&gt; as well. Better plan them all together.&lt;/p&gt;</comment>
                            <comment id="93763" author="liang" created="Thu, 11 Sep 2014 03:50:56 +0000"  >&lt;p&gt;I would think we should always avoid configuration changes when it&apos;s possible, we already have too many tunables which is overkill and very hard for users to make them all corrects.&lt;/p&gt;

&lt;p&gt;I totally agree it &apos;s better to fix &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-5485&quot; title=&quot;first mount always fail with avoid_asym_router_failure&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-5485&quot;&gt;&lt;del&gt;LU-5485&lt;/del&gt;&lt;/a&gt; in this patch (sorry I missed your comment there), and RC ping reduction is also a very good idea, I will have a follow-on patch to implement ping reduction, as it may requires a little more changes to a few different timestamps, so it could be clear to have a separate patch, thanks.&lt;/p&gt;</comment>
                            <comment id="93767" author="shadow" created="Thu, 11 Sep 2014 07:58:09 +0000"  >&lt;p&gt;I think we should don&apos;t queue any messages in case negative credits. In that case we will queue only data able to send and easy to put other data to the different routers.&lt;/p&gt;</comment>
                            <comment id="93775" author="liang" created="Thu, 11 Sep 2014 12:34:34 +0000"  >&lt;p&gt;In current lnet, if there is any router has positive credit, we will queue message to it not to router with negative credits.&lt;/p&gt;</comment>
                            <comment id="93776" author="shadow" created="Thu, 11 Sep 2014 12:36:48 +0000"  >&lt;p&gt;it&apos;s not true, i have a lots crash dumps with negative credits per destination when router dead.&lt;/p&gt;</comment>
                            <comment id="102503" author="gerrit" created="Sun, 4 Jan 2015 18:33:04 +0000"  >&lt;p&gt;Oleg Drokin (oleg.drokin@intel.com) merged in patch &lt;a href=&quot;http://review.whamcloud.com/11748/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/11748/&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-5570&quot; title=&quot;Better router selection in LNet&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-5570&quot;&gt;&lt;del&gt;LU-5570&lt;/del&gt;&lt;/a&gt; lnet: check router aliveness timestamp&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: &lt;br/&gt;
Commit: 339c7b2b784a528f41c432e9b90285d3445b7536&lt;/p&gt;</comment>
                            <comment id="102862" author="jlevi" created="Thu, 8 Jan 2015 13:54:14 +0000"  >&lt;p&gt;Patch landed to Master. If there is more work to be done in this ticket, please reopen the ticket.&lt;/p&gt;</comment>
                            <comment id="102946" author="gerrit" created="Fri, 9 Jan 2015 01:34:34 +0000"  >&lt;p&gt;Oleg Drokin (oleg.drokin@intel.com) uploaded a new patch: &lt;a href=&quot;http://review.whamcloud.com/13302&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/13302&lt;/a&gt;&lt;br/&gt;
Subject: Revert &quot;&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-5570&quot; title=&quot;Better router selection in LNet&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-5570&quot;&gt;&lt;del&gt;LU-5570&lt;/del&gt;&lt;/a&gt; lnet: check router aliveness timestamp&quot;&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: 2d8b8c9e0149b0fe860983cd2020d9781bd2e548&lt;/p&gt;</comment>
                            <comment id="102947" author="liang" created="Fri, 9 Jan 2015 01:36:27 +0000"  >&lt;p&gt;I just requested Oleg to revert it, because this patch is not  cleanly rebased, also we need Isaac to review it.&lt;/p&gt;</comment>
                            <comment id="102960" author="gerrit" created="Fri, 9 Jan 2015 03:11:13 +0000"  >&lt;p&gt;Oleg Drokin (oleg.drokin@intel.com) merged in patch &lt;a href=&quot;http://review.whamcloud.com/13302/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/13302/&lt;/a&gt;&lt;br/&gt;
Subject: Revert &quot;&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-5570&quot; title=&quot;Better router selection in LNet&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-5570&quot;&gt;&lt;del&gt;LU-5570&lt;/del&gt;&lt;/a&gt; lnet: check router aliveness timestamp&quot;&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: &lt;br/&gt;
Commit: bfaadd73b74da2aca82007ca78a6baf15ea2790c&lt;/p&gt;</comment>
                            <comment id="103158" author="gerrit" created="Mon, 12 Jan 2015 01:03:58 +0000"  >&lt;p&gt;Liang Zhen (liang.zhen@intel.com) uploaded a new patch: &lt;a href=&quot;http://review.whamcloud.com/13342&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/13342&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-5570&quot; title=&quot;Better router selection in LNet&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-5570&quot;&gt;&lt;del&gt;LU-5570&lt;/del&gt;&lt;/a&gt; lnet: check router aliveness timestamp&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: 8650e97e5bede98f3bae16cf64a687ae1c07ef4b&lt;/p&gt;</comment>
                            <comment id="103159" author="simmonsja" created="Mon, 12 Jan 2015 01:23:46 +0000"  >&lt;p&gt;Do you think this could help with the ARF problems we have been having? Earlier comment seem to point to that.&lt;/p&gt;</comment>
                            <comment id="103178" author="liang" created="Mon, 12 Jan 2015 12:18:56 +0000"  >&lt;p&gt;James, probably not, I still need your help to sample NI status on router (my last comment on &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-5758&quot; title=&quot;enabling avoid_asym_router_failure prvents the bring up of ORNL production systems&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-5758&quot;&gt;&lt;del&gt;LU-5758&lt;/del&gt;&lt;/a&gt;)&lt;/p&gt;</comment>
                            <comment id="238727" author="ashehata" created="Tue, 18 Dec 2018 02:56:07 +0000"  >&lt;p&gt;This won&apos;t be needed with the Router re-work I have done. LNet Health currently uses the legacy routing code. I reworked the routing code to bring it inline with the Multi-Rail. These changes should resolve the issue described in this ticket.&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                        <issuelink>
            <issuekey id="34447">LU-7734</issuekey>
        </issuelink>
                            </outwardlinks>
                                                        </issuelinktype>
                    </issuelinks>
                <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzwv2f:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>15533</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                                                                                </customfields>
    </item>
</channel>
</rss>