<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 03:14:24 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-14979] LNet: add tunable parameter to control max recovery interval duration</title>
                <link>https://jira.whamcloud.com/browse/LU-14979</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;Currently implemented recovery ping mechanism increases the next scheduled recovery ping attempt timeout exponentially (base 2) and limits the timeout at 900 seconds. This hard-coded value appears to be too high in many cases. Introduce a tunable parameter that can be used to limit the recovery ping timeout and come up with a reasonable default.&lt;/p&gt;</description>
                <environment></environment>
        <key id="65908">LU-14979</key>
            <summary>LNet: add tunable parameter to control max recovery interval duration</summary>
                <type id="4" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11310&amp;avatarType=issuetype">Improvement</type>
                                            <priority id="4" iconUrl="https://jira.whamcloud.com/images/icons/priorities/minor.svg">Minor</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="cbordage">Cyril Bordage</assignee>
                                    <reporter username="ssmirnov">Serguei Smirnov</reporter>
                        <labels>
                            <label>lnet</label>
                            <label>lnet-health</label>
                    </labels>
                <created>Wed, 1 Sep 2021 20:02:16 +0000</created>
                <updated>Sat, 11 Jun 2022 15:30:22 +0000</updated>
                            <resolved>Sat, 11 Jun 2022 15:30:22 +0000</resolved>
                                                    <fixVersion>Lustre 2.16.0</fixVersion>
                                        <due></due>
                            <votes>0</votes>
                                    <watches>4</watches>
                                                                            <comments>
                            <comment id="311866" author="hornc" created="Wed, 1 Sep 2021 20:20:36 +0000"  >&lt;p&gt;Can you add some detail about the cases where the value is too high? My hope was that resetting the interval when we received a message from an NI would be sufficient. Is that not working for some reason?&lt;/p&gt;</comment>
                            <comment id="311870" author="ssmirnov" created="Wed, 1 Sep 2021 21:24:20 +0000"  >&lt;p&gt;Chris,&lt;/p&gt;

&lt;p&gt;I set up a test for&#160;&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-14978&quot; title=&quot;LNet: balance peer NI selection if peer NI is added late&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-14978&quot;&gt;&lt;del&gt;LU-14978&lt;/del&gt;&lt;/a&gt;: Node A with one NI, Node B with two NIs, all on the same net. I was using lnetctl ping to create traffic from A to B. Then I executed &quot;ifdown&quot; on the interface corresponding to one of the B&apos;s NIs. (This was to simulate a hardware failure on node B.) Some lnetctl pings failed and the &quot;failed&quot; peer NI&apos;s health got decremented as seen by A. I left it alone for a few minutes, then brought the &quot;failed&quot; interface on B back up. A didn&apos;t realize that B had both NIs healthy until it got around to sending the next recovery ping. In my opinion, the delay was too long. Unless I initiate a ping from B to A and it uses the recovered interface, A has no idea that B has both NIs back until 900 second timeout expires.&lt;/p&gt;</comment>
                            <comment id="311874" author="hornc" created="Wed, 1 Sep 2021 22:07:11 +0000"  >&lt;p&gt;Okay, that makes sense and is working like I would expect. It might be interesting to see whether this is an issue in an environment where Node A is a Lustre server and B is a Lustre client (and vice versa) and there is actual i/o going on (or maybe even just idle client traffic). I think if there was i/o going on then things might recover more quickly, but the idle client case might also take a while to recover the NI (but, if the client is idle maybe it doesn&apos;t really matter. Once I/O was started we may again recover quickly).&lt;/p&gt;</comment>
                            <comment id="312896" author="gerrit" created="Wed, 15 Sep 2021 16:19:13 +0000"  >&lt;p&gt;&quot;Cyril Bordage &amp;lt;cbordage@whamcloud.com&amp;gt;&quot; uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/44927&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/44927&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-14979&quot; title=&quot;LNet: add tunable parameter to control max recovery interval duration&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-14979&quot;&gt;&lt;del&gt;LU-14979&lt;/del&gt;&lt;/a&gt; lnet: set max recovery interval duration&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: f3bea849cd255b6bcd8c379904795e3d8d6ffde8&lt;/p&gt;</comment>
                            <comment id="337389" author="gerrit" created="Sat, 11 Jun 2022 05:31:42 +0000"  >&lt;p&gt;&quot;Oleg Drokin &amp;lt;green@whamcloud.com&amp;gt;&quot; merged in patch &lt;a href=&quot;https://review.whamcloud.com/44927/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/44927/&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-14979&quot; title=&quot;LNet: add tunable parameter to control max recovery interval duration&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-14979&quot;&gt;&lt;del&gt;LU-14979&lt;/del&gt;&lt;/a&gt; lnet: set max recovery interval duration&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: &lt;br/&gt;
Commit: 4027395fe463b6ea11084ff2af43ba0732ad0ddb&lt;/p&gt;</comment>
                            <comment id="337491" author="pjones" created="Sat, 11 Jun 2022 15:30:22 +0000"  >&lt;p&gt;Landed for 2.16&lt;/p&gt;</comment>
                    </comments>
                    <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|i023dj:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                                                                                </customfields>
    </item>
</channel>
</rss>