<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:49:14 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-5183] If Adaptive Timeout is set for at_max = 600 then id ldlm_timeouts gets affective or it becomes over ruled</title>
                <link>https://jira.whamcloud.com/browse/LU-5183</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;Hi &lt;/p&gt;

&lt;p&gt;I&apos;d like an explanation of which timeout values are being exceeded that are resulting in these evictions, so what does that &quot;227 seconds&quot; reffers to, like which timeout it&apos;s considering. Is that &quot;ldlm_timeout, obd_timeout, /proc/sys/lustre/timeout, at_min or at_max. &lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;May 14 05:37:59 dc2oss15 kernel: : Lustre: dc2-OST009c: haven&apos;t heard from client ac9ef944-83f6-a453-c821-f0067101d2ca (at 149.165.229.28@tcp) in 227 seconds. I think it&apos;s dead, and I am evicting it. exp ffff880bb8ff2800, cur 1400060279 expire 1400060129 last 1400060052
May 14 05:37:59 dc2oss15 kernel: : Lustre: Skipped 9 previous similar messages
May 14 05:38:02 dc2oss12 kernel: : Lustre: dc2-OST007b: haven&apos;t heard from client ac9ef944-83f6-a453-c821-f0067101d2ca (at 149.165.229.28@tcp) in 227 seconds. I think it&apos;s dead, and I am evicting it. exp ffff88169eecb400, cur 1400060282 expire 1400060132 last 1400060055
May 14 05:38:02 dc2oss12 kernel: : Lustre: Skipped 8 previous similar messages
May 14 05:37:53 dc2oss04 kernel: : Lustre: dc2-OST0021: haven&apos;t heard from client ac9ef944-83f6-a453-c821-f0067101d2ca (at 149.165.229.28@tcp) in 227 seconds. I think it&apos;s dead, and I am evicting it. exp ffff880bd683c400, cur 1400060273 expire 1400060123 last 1400060046
May 14 05:37:53 dc2oss04 kernel: : Lustre: Skipped 8 previous similar messages
May 14 05:37:58 dc2oss05 kernel: : Lustre: dc2-OST002c: haven&apos;t heard from client ac9ef944-83f6-a453-c821-f0067101d2ca (at 149.165.229.28@tcp) in 227 seconds. I think it&apos;s dead, and I am evicting it. exp ffff88154cc64000, cur 1400060278 expire 1400060128 last 1400060051
May 14 05:37:58 dc2oss05 kernel: : Lustre: Skipped 9 previous similar messages
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt; Particularly, I&apos;m interested in knowing whether ldlm_timeouts, which is 20s for OSTs and 6s for MDT, are in play given that we&apos;ve adaptive timeouts enabled(at_max = 600) and /proc/sys/lustre/timeout=100.&lt;/p&gt;

&lt;p&gt;Should we consider increasing the ldlm_timeouts if they are in fact being used? Should we consider setting at_min to 60-70s to allow time for slow client responses?&lt;/p&gt;

&lt;p&gt;If yes then how does that settings helps and makes difference.&lt;/p&gt;

&lt;p&gt;See sections 2.2.2 and 2.2.8 in Cory Spitz&apos;s paper here:&lt;br/&gt;
&lt;a href=&quot;https://cug.org/5-publications/proceedings_attendee_lists/CUG11CD/page&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://cug.org/5-publications/proceedings_attendee_lists/CUG11CD/page&lt;/a&gt;&lt;br/&gt;
s/1-program/final_program/Wednesday/12A-Spitz-Paper.pdf&lt;/p&gt;

&lt;p&gt;Thank You,&lt;br/&gt;
                   Manish&lt;/p&gt;</description>
                <environment>Lustre Server 2.1.6 &lt;br/&gt;
Lustre Client 1.8.9</environment>
        <key id="25126">LU-5183</key>
            <summary>If Adaptive Timeout is set for at_max = 600 then id ldlm_timeouts gets affective or it becomes over ruled</summary>
                <type id="9" iconUrl="https://jira.whamcloud.com/images/icons/issuetypes/undefined.png">Question/Request</type>
                                            <priority id="4" iconUrl="https://jira.whamcloud.com/images/icons/priorities/minor.svg">Minor</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="3">Duplicate</resolution>
                                        <assignee username="emoly.liu">Emoly Liu</assignee>
                                    <reporter username="manish">Manish Patel</reporter>
                        <labels>
                    </labels>
                <created>Thu, 12 Jun 2014 15:53:45 +0000</created>
                <updated>Mon, 21 Jul 2014 13:48:40 +0000</updated>
                            <resolved>Mon, 21 Jul 2014 13:48:40 +0000</resolved>
                                                                        <due></due>
                            <votes>0</votes>
                                    <watches>3</watches>
                                                                            <comments>
                            <comment id="86590" author="pjones" created="Fri, 13 Jun 2014 17:59:25 +0000"  >&lt;p&gt;Emoly&lt;/p&gt;

&lt;p&gt;Could you please help on this one?&lt;/p&gt;

&lt;p&gt;Thanks&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="86652" author="emoly.liu" created="Mon, 16 Jun 2014 03:38:54 +0000"  >&lt;p&gt;Hi Manish,&lt;/p&gt;

&lt;p&gt;The first question about &quot;227 seconds&quot; eviction is related to obd_timeout. The code is here:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;                expire_time = cfs_time_current_sec() - PING_EVICT_TIMEOUT;
...  
                        &lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt; (expire_time &amp;gt; exp-&amp;gt;exp_last_request_time) {
                                class_export_get(exp);
                                cfs_spin_unlock(&amp;amp;obd-&amp;gt;obd_dev_lock);
                                 LCONSOLE_WARN(&lt;span class=&quot;code-quote&quot;&gt;&quot;%s: haven&apos;t heard from client %s&quot;&lt;/span&gt;
                                              &lt;span class=&quot;code-quote&quot;&gt;&quot; (at %s) in %ld seconds. I think&quot;&lt;/span&gt;
                                              &lt;span class=&quot;code-quote&quot;&gt;&quot; it&apos;s dead, and I am evicting&quot;&lt;/span&gt;
                                              &lt;span class=&quot;code-quote&quot;&gt;&quot; it. exp %p, cur %ld expire %ld&quot;&lt;/span&gt;
                                              &lt;span class=&quot;code-quote&quot;&gt;&quot; last %ld\n&quot;&lt;/span&gt;,
                                              obd-&amp;gt;obd_name,
                                              obd_uuid2str(&amp;amp;exp-&amp;gt;exp_client_uuid),
                                              obd_export_nid2str(exp),
                                              (&lt;span class=&quot;code-object&quot;&gt;long&lt;/span&gt;)(cfs_time_current_sec() -
                                                     exp-&amp;gt;exp_last_request_time),
                                              exp, (&lt;span class=&quot;code-object&quot;&gt;long&lt;/span&gt;)cfs_time_current_sec(),
                                              (&lt;span class=&quot;code-object&quot;&gt;long&lt;/span&gt;)expire_time,
                                              (&lt;span class=&quot;code-object&quot;&gt;long&lt;/span&gt;)exp-&amp;gt;exp_last_request_time);
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;and&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;#define PING_INTERVAL max(obd_timeout / 4, 1U)
/* Client may skip 1 ping; we must wait at least 2.5. But &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; multiple
 * failover targets the client only pings one server at a time, and pings
 * can be lost on a loaded network. Since eviction has serious consequences,
 * and there&lt;span class=&quot;code-quote&quot;&gt;&apos;s no urgent need to evict a client just because it&apos;&lt;/span&gt;s idle, we
 * should be very conservative here. */
#define PING_EVICT_TIMEOUT (PING_INTERVAL * 6)
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;From the log above we can see that PING_EVICT_TIMEOUT is 150 seconds, and the time difference between the last request sent by client and the OST ping eviction check is 227 seconds.&lt;br/&gt;
For example,&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;May 14 05:37:59 dc2oss15 kernel: : Lustre: dc2-OST009c: haven&apos;t heard from client ac9ef944-83f6-a453-c821-f0067101d2ca (at 149.165.229.28@tcp) in 227 seconds. I think it&apos;s dead, and I am evicting it. exp ffff880bb8ff2800, cur 1400060279 expire 1400060129 last 1400060052&lt;/p&gt;&lt;/blockquote&gt;
&lt;p&gt;PING_EVICT_TIMEOUT = 1400060279 - 1400060129 = 150&lt;/p&gt;

&lt;p&gt;As for the other question about adaptive timeout and ldlm_timeout, I need to check the code and the document, and then give a reply.&lt;/p&gt;</comment>
                            <comment id="87206" author="manish" created="Fri, 20 Jun 2014 20:23:49 +0000"  >&lt;p&gt;Hi Emoly,&lt;/p&gt;

&lt;p&gt;If that 227 is the related to &quot;obd_timeout&quot; then which setting need to be tweaked so that it can hold that limit till 300 seconds and what is the tweak setting options for increasing the PING_EVICT_TIMEOUT to 300 sec.&lt;/p&gt;

&lt;p&gt;About the second let me know if you have any new updates for the &quot;ldlm_timeout and adaptive_timeout&quot;, by looking at the codes.&lt;/p&gt;

&lt;p&gt;Thank you,&lt;br/&gt;
                  Manish&lt;/p&gt;</comment>
                            <comment id="87257" author="emoly.liu" created="Mon, 23 Jun 2014 03:05:23 +0000"  >&lt;p&gt;Hi Manish,&lt;br/&gt;
According to the following code, obd_timeout is /proc/sys/lustre/timeout. &lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;        {
                .ctl_name = OBD_TIMEOUT,
                .procname = &lt;span class=&quot;code-quote&quot;&gt;&quot;timeout&quot;&lt;/span&gt;,
                .data     = &amp;amp;obd_timeout,
                .maxlen   = sizeof(&lt;span class=&quot;code-object&quot;&gt;int&lt;/span&gt;),
                .mode     = 0644,
                .proc_handler = &amp;amp;proc_set_timeout
        },
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;So you can increase the PING_EVICT_TIMEOUT by increasing /proc/fs/lustre/timeout should work.&lt;/p&gt;


&lt;p&gt;About the second question, ldlm_timeout is a static timeout, that is a server waits for a client to reply to an initial lock cancellation request, and it should be smaller than obd_timeout.&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;        &lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt; (ldlm_timeout &amp;gt;= obd_timeout)
                ldlm_timeout = max(obd_timeout / 3, 1U);
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Somehow, ldlm_timeout is in play given that you have adaptive timeouts enabled.&lt;br/&gt;
In ost code,&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;&lt;span class=&quot;code-keyword&quot;&gt;static&lt;/span&gt; inline &lt;span class=&quot;code-object&quot;&gt;int&lt;/span&gt; prolong_timeout(struct ptlrpc_request *req)
{       
        struct ptlrpc_service *svc = req-&amp;gt;rq_rqbd-&amp;gt;rqbd_service;

        &lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt; (AT_OFF)
                &lt;span class=&quot;code-keyword&quot;&gt;return&lt;/span&gt; obd_timeout / 2; 
        
        &lt;span class=&quot;code-keyword&quot;&gt;return&lt;/span&gt; max(at_est2timeout(at_get(&amp;amp;svc-&amp;gt;srv_at_estimate)), ldlm_timeout);
}       
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;In mdt_init0() code,&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;        /* Reduce the initial timeout on an MDS because it doesn&apos;t need such
         * a &lt;span class=&quot;code-object&quot;&gt;long&lt;/span&gt; timeout as an OST does. Adaptive timeouts will adjust &lt;span class=&quot;code-keyword&quot;&gt;this&lt;/span&gt;
         * value appropriately. */
        &lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt; (ldlm_timeout == LDLM_TIMEOUT_DEFAULT)
                ldlm_timeout = MDS_LDLM_TIMEOUT_DEFAULT;
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;We can see in Cory&apos;s pdf, Cray configures both the  minimum Adaptive Timeout, at_min, and the ldlm_timeout to 70 seconds to allow Lustre to &#8220;ride through&#8221; the re-route for the Gemini feature. And yes, you can set at_min to 60-70s to allow time for slow client responses.&lt;/p&gt;</comment>
                            <comment id="89517" author="pjones" created="Fri, 18 Jul 2014 17:58:40 +0000"  >&lt;p&gt;Emoly&lt;/p&gt;

&lt;p&gt;DDN have suggested that we include this material in the Lustre manual. Could you please create an LUDOC ticket to track that?&lt;/p&gt;

&lt;p&gt;Thanks&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="89599" author="emoly.liu" created="Mon, 21 Jul 2014 01:26:21 +0000"  >&lt;p&gt;&lt;a href=&quot;https://jira.whamcloud.com/browse/LUDOC-250&quot; title=&quot;More explanation of several kinds of timeout setting&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LUDOC-250&quot;&gt;LUDOC-250&lt;/a&gt; is created to track the lustre manual update.&lt;/p&gt;</comment>
                            <comment id="89612" author="pjones" created="Mon, 21 Jul 2014 13:48:40 +0000"  >&lt;p&gt;Closing ticket as the remaining doc work will be handled under &lt;a href=&quot;https://jira.whamcloud.com/browse/LUDOC-250&quot; title=&quot;More explanation of several kinds of timeout setting&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LUDOC-250&quot;&gt;LUDOC-250&lt;/a&gt;&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                        <issuelink>
            <issuekey id="25664">LUDOC-250</issuekey>
        </issuelink>
                            </outwardlinks>
                                                        </issuelinktype>
                    </issuelinks>
                <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzwoav:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>14384</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                                                                                </customfields>
    </item>
</channel>
</rss>