<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:15:49 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-1344] Evicted Clients</title>
                <link>https://jira.whamcloud.com/browse/LU-1344</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;Customer reported a number of clients were evicted.  All clients had difficulties communicating with OSTs on a single OSS.  Johann has looked at the client logs but I did not have server logs at the time.  I now have the server logs.  I have attached them to this ticket.  I need recommendations on how to prevent this from happening in the future.&lt;/p&gt;

&lt;p&gt;Should I consider changing the OBD timeout from the default 100s?&lt;/p&gt;

&lt;p&gt;Should I consider reducing the number of OST service threads (default 256)?&lt;/p&gt;</description>
                <environment>CentOS 5.5 on Lustre servers&lt;br/&gt;
RHEL 6.1 on clients</environment>
        <key id="14183">LU-1344</key>
            <summary>Evicted Clients</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="3" iconUrl="https://jira.whamcloud.com/images/icons/priorities/major.svg">Major</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="5">Cannot Reproduce</resolution>
                                        <assignee username="liang">Liang Zhen</assignee>
                                    <reporter username="dnelson@ddn.com">Dennis Nelson</reporter>
                        <labels>
                    </labels>
                <created>Thu, 26 Apr 2012 12:43:09 +0000</created>
                <updated>Fri, 19 Oct 2012 10:45:47 +0000</updated>
                            <resolved>Fri, 19 Oct 2012 10:45:47 +0000</resolved>
                                    <version>Lustre 1.8.6</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>3</watches>
                                                                            <comments>
                            <comment id="35512" author="johann" created="Thu, 26 Apr 2012 13:14:58 +0000"  >&lt;p&gt;The server logs confirm that there are network problems.&lt;br/&gt;
Liang (our lnet expert) is having another look at the logs just in case.&lt;/p&gt;</comment>
                            <comment id="35515" author="cliffw" created="Thu, 26 Apr 2012 13:21:34 +0000"  >&lt;p&gt;obd timeout is now automatically adjusted, there should be no need to change this. &lt;br/&gt;
You can check for the file &apos;timeouts&apos; - there is one for various services under /proc/fs/lustre. &lt;br/&gt;
That file provides a history and will show if there are timeout issues. &lt;/p&gt;

&lt;p&gt;The errors appear to show a network/other failure rather than a server overload. The client evictions&lt;br/&gt;
track to errors on the server. &lt;br/&gt;
There are only LustreErrors in the server logs. Are there any indications of a network failure?&lt;br/&gt;
What was the load on the server when the clients dropped connections? &lt;br/&gt;
I would suggest upgrading to Lustre 1.8.7 as there are improvements in that release. &lt;/p&gt;</comment>
                            <comment id="35521" author="dnelson@ddn.com" created="Thu, 26 Apr 2012 13:48:56 +0000"  >&lt;p&gt;So, I understand that obd timeout mostley deprcated with the introduction of adaptive timeouts.  That is the reason it is still set to the default.  A coworker pointed out the following passage from the manual and suggested that it might be helpful to increase it:&lt;/p&gt;

&lt;p&gt;In previous Lustre versions, the static obd_timeout (/proc/sys/lustre/timeout) value was used as the maximum completion time for all RPCs; this value also affected the client-server ping interval and initial recovery timer. Now, with adaptive timeouts, obd_timeout is only used for the ping interval and initial recovery estimate. When a client reconnects during recovery, the server uses the client&apos;s timeout value to reset the recovery wait period; i.e., the server learns how long the client had been willing to wait, and takes this into account when adjusting the recovery period.&lt;/p&gt;

&lt;p&gt;I found that there is some sar data being collected and cpu idle time never went below 90%.  The reason I am thinking that this might be load related is based on the following server entry:&lt;/p&gt;

&lt;p&gt;Apr 20 20:05:14 lfs-oss-1-13 kernel: Lustre: Service thread pid 32236 completed after 278.00s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).&lt;/p&gt;

&lt;p&gt;Also, the clients did not have any issues with OSTs on any of the other servers during this time.&lt;/p&gt;

&lt;p&gt;All kernel errors are directed to /var/log/kern.log.  You got that entire file.  /var/log/messages is empty of any messages at all after the last boot 12 days ago.  I have not seen any logs on the server that indicate there was a network issue at the time.&lt;/p&gt;
</comment>
                            <comment id="35543" author="pjones" created="Fri, 27 Apr 2012 01:48:35 +0000"  >&lt;p&gt;Liang is reviewing the logs&lt;/p&gt;</comment>
                            <comment id="38084" author="dnelson@ddn.com" created="Thu, 3 May 2012 10:19:02 +0000"  >&lt;p&gt;Any update on the review of the logs?&lt;/p&gt;</comment>
                            <comment id="44356" author="kitwestneat" created="Fri, 7 Sep 2012 09:49:26 +0000"  >&lt;p&gt;We haven&apos;t seen this since, we can close it. &lt;/p&gt;</comment>
                    </comments>
                    <attachments>
                            <attachment id="11246" name="lustre-client-errors.txt" size="18918" author="dnelson@ddn.com" created="Thu, 26 Apr 2012 12:43:09 +0000"/>
                            <attachment id="11247" name="lustre-mds-errors.txt" size="16515" author="dnelson@ddn.com" created="Thu, 26 Apr 2012 12:43:09 +0000"/>
                            <attachment id="11248" name="lustre-server-errors.txt" size="864902" author="dnelson@ddn.com" created="Thu, 26 Apr 2012 12:43:09 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzv39z:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>4030</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>