<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:14:50 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-1239] cascading client evictions</title>
                <link>https://jira.whamcloud.com/browse/LU-1239</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;recently I have found the following scenario that may lead to cascading client reconnects, lock timeouts, evictions, etc.&lt;/p&gt;

&lt;p&gt;1. MDS is overloaded with enqueues, they consume all the threads on MDS_REQUEST portal. &lt;br/&gt;
2. it happened that some rpc timed out on 1 client what led to its reconnection. this client has some locks to cancel, MDS is waiting for them. &lt;br/&gt;
3. client sends MDS_CONNECT, but there is no empty thread to handle it. &lt;br/&gt;
4. other clients are waiting for their enqueue completions, they try to ping MDS if it is still alive, but PING is also sent to MDS_REQUEST portal, despite the fact it is a high priority rpc, it has no special handlers (srv_hpreq_handler == NULL) and therefore 2nd thread is not reserved for hi-priority rpcs on such services:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;static int ptlrpc_server_allow_normal(struct ptlrpc_service *svc, int force)
{
#ifndef __KERNEL__
        if (1) /* always allow to handle normal request for liblustre */
                return 1;
#endif
        if (force ||
            svc-&amp;gt;srv_n_active_reqs &amp;lt; svc-&amp;gt;srv_threads_running - 2)
                return 1;

        if (svc-&amp;gt;srv_n_active_reqs &amp;gt;= svc-&amp;gt;srv_threads_running - 1)
                return 0;

        return svc-&amp;gt;srv_n_active_hpreq &amp;gt; 0 || svc-&amp;gt;srv_hpreq_handler == NULL;
}
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;no thread to handle pings - other clients get timed out rpc. &lt;br/&gt;
6. once 1 ldlm lock times out, enqueue completes and an MDS_CONNECT may be taken into handling, however this client is likely to have an enqueue rpc in processing on MDS, thus it gets ebusy and will re-try only after some delay, whereas others tries to re-connect and consume MDS threads by enqueues &lt;br/&gt;
again. this is being discussed in &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-7&quot; title=&quot;Reconnect server-&amp;gt;client connection&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-7&quot;&gt;&lt;del&gt;LU-7&lt;/del&gt;&lt;/a&gt;, but it is not the main issue here.&lt;/p&gt;

&lt;p&gt;fixes: &lt;br/&gt;
1) reserve an extra threads on services which expect PINGS to come. &lt;br/&gt;
2) make CONNECTs hi-priority RPCs.&lt;br/&gt;
3) &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-7&quot; title=&quot;Reconnect server-&amp;gt;client connection&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-7&quot;&gt;&lt;del&gt;LU-7&lt;/del&gt;&lt;/a&gt; to address (6)&lt;/p&gt;</description>
                <environment></environment>
        <key id="13626">LU-1239</key>
            <summary>cascading client evictions</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="4" iconUrl="https://jira.whamcloud.com/images/icons/priorities/minor.svg">Minor</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="wc-triage">WC Triage</assignee>
                                    <reporter username="vitaly_fertman">Vitaly Fertman</reporter>
                        <labels>
                    </labels>
                <created>Tue, 20 Mar 2012 11:41:35 +0000</created>
                <updated>Tue, 17 Mar 2015 17:54:32 +0000</updated>
                            <resolved>Wed, 25 Jul 2012 10:56:07 +0000</resolved>
                                                    <fixVersion>Lustre 2.3.0</fixVersion>
                                        <due></due>
                            <votes>0</votes>
                                    <watches>6</watches>
                                                                            <comments>
                            <comment id="31608" author="vitaly_fertman" created="Tue, 20 Mar 2012 15:03:10 +0000"  >&lt;p&gt;&lt;a href=&quot;http://review.whamcloud.com/2355&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/2355&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="33563" author="adilger" created="Thu, 5 Apr 2012 16:32:16 +0000"  >&lt;p&gt;What version of Lustre hit this problem, and what kind of workload blocked all of the MDS threads?&lt;/p&gt;</comment>
                            <comment id="33613" author="nrutman" created="Thu, 5 Apr 2012 18:52:17 +0000"  >&lt;p&gt;Lustre 2.10 server, Lustre 1.8.6 clients.  &lt;br/&gt;
4000 nodes simultaneously trying to mkdir -p the same directory in the lustre root dir and create a (distinct) file in that dir.&lt;/p&gt;

&lt;blockquote&gt;&lt;p&gt;The MDS_CONNECT and OST_CONNECT RPCs should only be high priority if they are reconnects, not if they are initial connects. Please add a check for MSG_CONNECT_RECONNECT instead of just making all CONNECT requests high priority.&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;Why can&apos;t they all be high priority?  They should take 0 time to process on the MDS relative to anything else, and we want a responsive client mount command.  And it&apos;s simpler code.&lt;/p&gt;</comment>
                            <comment id="39090" author="nrutman" created="Fri, 18 May 2012 19:08:48 +0000"  >&lt;p&gt;This patch has been sitting here for two months with no review - what should we do with it?&lt;/p&gt;</comment>
                            <comment id="39091" author="nrutman" created="Fri, 18 May 2012 19:11:22 +0000"  >&lt;p&gt;sorry, one month.  I didn&apos;t realize it had gone through some revisions.&lt;/p&gt;</comment>
                            <comment id="42244" author="pjones" created="Wed, 25 Jul 2012 10:56:07 +0000"  >&lt;p&gt;Landed for 2.3&lt;/p&gt;</comment>
                            <comment id="42622" author="morrone" created="Thu, 2 Aug 2012 18:49:53 +0000"  >&lt;p&gt;Is this patch suitable for 2.1?  I think we&apos;re seeing the same cascading client evictions there.&lt;/p&gt;</comment>
                            <comment id="42637" author="spitzcor" created="Fri, 3 Aug 2012 01:05:31 +0000"  >&lt;p&gt;Chris, yes the patch is suitable for 2.1.  Cray initially found this bug on 2.1 and Vitaly developed the fix for Xyratex&apos;s 2.1+patches: &lt;a href=&quot;https://github.com/Xyratex/lustre-stable/commit/afcf3cf1091c67d076ef36dc0d73cd649f84421e&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://github.com/Xyratex/lustre-stable/commit/afcf3cf1091c67d076ef36dc0d73cd649f84421e&lt;/a&gt;.&lt;/p&gt;</comment>
                            <comment id="48203" author="nrutman" created="Wed, 21 Nov 2012 15:27:23 +0000"  >&lt;p&gt;Xyratex &lt;a href=&quot;http://jira-nss.xy01.xyratex.com:8080/browse/MRP-455&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;MRP-455&lt;/a&gt;&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                                                <inwardlinks description="is related to">
                                        <issuelink>
            <issuekey id="10080">LU-7</issuekey>
        </issuelink>
                            </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzv6d3:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>4565</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>