<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:29:10 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-2898] More timely notification of clients in case of eviction</title>
                <link>https://jira.whamcloud.com/browse/LU-2898</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;There have been periodic complaints about lustre not really knowing when it was evicted from a server node, as this could only be known in case an RPC is sent.&lt;br/&gt;
Frequently this would be handled by a periodic ping, but with this functionality being turned down to happen in rarer cases, it more and more converts into the case of an app initiating an RPC and being evicted all of a sudden due to an eviction that has happened quite a while ago.&lt;/p&gt;

&lt;p&gt;As such we probably need a somewhat better way of notifying clients of their eviction so that they can reconnect somewhat more eagerly and with a bit less damage to whatever it is that might be running in userspace.&lt;/p&gt;</description>
                <environment></environment>
        <key id="17749">LU-2898</key>
            <summary>More timely notification of clients in case of eviction</summary>
                <type id="4" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11310&amp;avatarType=issuetype">Improvement</type>
                                            <priority id="3" iconUrl="https://jira.whamcloud.com/images/icons/priorities/major.svg">Major</priority>
                        <status id="1" iconUrl="https://jira.whamcloud.com/images/icons/statuses/open.png" description="The issue is open and ready for the assignee to start work on it.">Open</status>
                    <statusCategory id="2" key="new" colorName="default"/>
                                    <resolution id="-1">Unresolved</resolution>
                                        <assignee username="wc-triage">WC Triage</assignee>
                                    <reporter username="green">Oleg Drokin</reporter>
                        <labels>
                    </labels>
                <created>Sun, 3 Mar 2013 23:51:19 +0000</created>
                <updated>Mon, 22 Jan 2024 15:57:16 +0000</updated>
                                                                                <due></due>
                            <votes>0</votes>
                                    <watches>7</watches>
                                                                            <comments>
                            <comment id="53243" author="green" created="Sun, 3 Mar 2013 23:59:21 +0000"  >&lt;p&gt;Fujitsu as the first site to disable pinging in most of the cases hit this esp. often so they created a patch to avert this issue that makes servers to notify MGS of eviction and MGS in turn would send messages to clients to come in contact with servers and reconnect as needed (sort of like reverse imperative recovery I guess).&lt;br/&gt;
The contributed patch against fefs is here &lt;a href=&quot;http://review.whamcloud.com/#change,5457&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#change,5457&lt;/a&gt; (And is not directly applicable to the master tree, but gives an idea of how they did it).&lt;/p&gt;

&lt;p&gt;I imagine it might have been much easier to just send a specially crafted ldlm callback to let it know we are evicting him (and this would require a lot less infrastructure changes), but that would not handle a case of severed communication between this particular server and client where as MGS connectivity of both would remain unaffected.&lt;/p&gt;

&lt;p&gt;Additionally since the case outlined as most severe by Fujitsu is that of a new application started, there is a possible workaround of doing &quot;df&quot; before a new job starts from whatever job scheduling framework might be there, but still there is a feeling that this case should be handled more transparently inside of Lustre.&lt;/p&gt;</comment>
                            <comment id="53267" author="rread" created="Mon, 4 Mar 2013 11:42:55 +0000"  >&lt;p&gt;My first thought was that this does seem like a special case of imperative recovery, but limited to a specific client, and we could call it &quot;imperative reconnect.&quot; &lt;img class=&quot;emoticon&quot; src=&quot;https://jira.whamcloud.com/images/icons/emoticons/smile.png&quot; height=&quot;16&quot; width=&quot;16&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;  But perhaps a simpler ldlm callback is sufficient since if there is a network split the client wouldn&apos;t be able to reconnect anyway.&lt;/p&gt;

&lt;p&gt;Do we understand why these seemingly idle clients are being evicted in the first place? Is there an issue there?&lt;/p&gt;</comment>
                            <comment id="53303" author="green" created="Mon, 4 Mar 2013 19:26:10 +0000"  >&lt;p&gt;There might be many reasons for reconnect, I guess. All of them are valid one way or another. Like one-off AST loss or such.&lt;/p&gt;</comment>
                            <comment id="53305" author="rread" created="Mon, 4 Mar 2013 21:14:45 +0000"  >&lt;p&gt;I agree that reconnects are probably valid, but I&apos;m not sure all evicts are necessarily valid or unavoidable. If they are occurring frequently then we should at least try to find out what is causing them. &lt;/p&gt;</comment>
                            <comment id="53974" author="nozaki" created="Thu, 14 Mar 2013 01:18:18 +0000"  >&lt;p&gt;Hi, Robert.&lt;/p&gt;

&lt;p&gt;I&apos;ve often seen lots of clients are evicted when server recovering. It appeaers that a server cannot catch up with a great number of coming reconnect reqs, about 90k * (target-disks).&lt;br/&gt;
As a reasult, clients whose recon reqs haven&apos;t been handled by the server are evicted.&lt;/p&gt;</comment>
                            <comment id="53977" author="rread" created="Thu, 14 Mar 2013 02:04:55 +0000"  >&lt;p&gt;I see. Well, that&apos;s not ideal, but at least we know what the reason is. &lt;img class=&quot;emoticon&quot; src=&quot;https://jira.whamcloud.com/images/icons/emoticons/smile.png&quot; height=&quot;16&quot; width=&quot;16&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/p&gt;

&lt;p&gt;BTW, if the clients are not pinging, how did they all know to reconnect to the recovering server?&lt;/p&gt;</comment>
                            <comment id="53986" author="nozaki" created="Thu, 14 Mar 2013 03:38:05 +0000"  >&lt;p&gt;Recovering serveres try to retrieve clients&apos; information from last_rcvd files and see if they&apos;ve been connected. And next, the serveres send callback pings to the clients in order to make them reconnect.&lt;br/&gt;
this is a basic recovering motion in FEFS, thought lots of trivial&lt;img class=&quot;emoticon&quot; src=&quot;https://jira.whamcloud.com/images/icons/emoticons/help_16.png&quot; height=&quot;16&quot; width=&quot;16&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt; functions are included in it.&lt;/p&gt;

&lt;p&gt;oh, and I want you to know one thing, that is ... when we handle a large system like K, ping often eats up lnet resources such as credit&lt;img class=&quot;emoticon&quot; src=&quot;https://jira.whamcloud.com/images/icons/emoticons/help_16.png&quot; height=&quot;16&quot; width=&quot;16&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt; ... I&apos;m not so good there, thought ... so I think you&apos;ll need a mesure against the problem. And which is why we restrict the retry number of times of callback ping to 5 times.&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                        <issuelink>
            <issuekey id="16898">LU-2467</issuekey>
        </issuelink>
                            </outwardlinks>
                                                                <inwardlinks description="is related to">
                                                        </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzvk4f:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>6985</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                                                                                </customfields>
    </item>
</channel>
</rss>