<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:09:16 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-667] Experiencing sluggish, intermittently unresponsive, and OOM killed MDS nodes</title>
                <link>https://jira.whamcloud.com/browse/LU-667</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;We are experiencing major MDS problems that are greatly affecting the stability of our Lustre filesystem.  We don&apos;t have any changes in the fundamental configuration or setup of our storage cluster to point the finger at.&lt;/p&gt;

&lt;p&gt;The general symptoms are that the load on the active MDS node is unusually high and filesystem access hangs intermittently.  Logged into the active MDS node we noticed that the command line also intermittently hangs.  We noticed that the ptlrpcd process was pegged at 100%+ cpu usage followed by ~50% cpu usage for the kiblnd_sd_* processes.  Furthermore, the iowait time is less that 1% while system time ranges from 25%-80%.  It sort of appears that the active MDS is spinning as quickly as it can dealing with some kind of RPC traffic coming in over the IB lnd.  So far we haven&apos;t been able to isolate the traffic involved.  In one isolation step we took all the Lnet routers offline feeding in from the compute clusters, and the MDS was still churning vigorously in ptlrpcd and kiblnd processes.  Another symptom we are seeing now is that when an MDS node becomes active and start trying to serve clients, we can watch the node rather quickly consume all available memory via Slab allocations and then die an OOM death.  Some other observations:&lt;/p&gt;
&lt;ul&gt;
	&lt;li&gt;OSTs evicting the mdtlov over the IB path&lt;/li&gt;
	&lt;li&gt;fun &apos;sluggish network&apos; log messages like:  ===&amp;gt; Sep  7 12:39:21 amds1 LustreError: 13000:0:(import.c:357:ptlrpc_invalidate_import()) scratch2-OST0053_UUID: RPCs in &quot;Unregistering&quot; phase found (0). Network is sluggish? Waiting them to error out. &amp;lt;===&lt;/li&gt;
	&lt;li&gt;MGS evicting itself over localhost connection: ==&amp;gt; Sep  7 12:39:14 amds1 Lustre: MGS: client 8337bacb-6b62-b0f0-261d-53678e2e56a9 (at 0@lo) evicted, unresponsive for 227s &amp;lt;==&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;At this point we have been through 3 or more MDS failover sequences and we also rebooted all the StorP Lustre servers and restarted the filesystem cleanly to see if that would clean things up.&lt;/p&gt;

&lt;p&gt;We have syslog and Lustre debug message logs from various phases of debugging this.  I&apos;m not sure at this point what logs will be the most useful, but after I submit this issue I&apos;ll attach some files.&lt;/p&gt;</description>
                <environment>StorP Storage Cluster: Dell R710 servers (20 OSS, 2 MDS), IB direct connected DDN99k storage on OSSes, FC direct attached DDN EF3000 storage on MDS, 24GB per server, dual socket 8 core Nehalem. StorP is dual-homed for Lustre clients with DDR IB and 10 Gig Ethernet via Chelsio T3 adapters.  StorP is configured for failover MDS and OSS pairs with multipath.&lt;br/&gt;
&lt;br/&gt;
StorP is running TOSS 1.4-2 (chaos 4.4-2) which includes:&lt;br/&gt;
lustre-1.8.5.0-3chaos_2.6.18_105chaos.ch4.4&lt;br/&gt;
lustre-modules-1.8.5.0-3chaos_2.6.18_105chaos.ch4.4&lt;br/&gt;
chaos-kernel-2.6.18-105chaos&lt;br/&gt;
&lt;br/&gt;
Multiple compute clusters interconnect to StorP via a set of IB(client)-to-IB(server) lnet routers and a set of IB(client)-to-10gig(server) lnet routers.  The IB-to-IB lnet routers deal with &amp;lt;300 Lustre client nodes.  The IB-to-10gig routers deal with ~2700 Lustre client nodes.</environment>
        <key id="11701">LU-667</key>
            <summary>Experiencing sluggish, intermittently unresponsive, and OOM killed MDS nodes</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="1" iconUrl="https://jira.whamcloud.com/images/icons/priorities/blocker.svg">Blocker</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="2">Won&apos;t Fix</resolution>
                                        <assignee username="cliffw">Cliff White</assignee>
                                    <reporter username="jbogden">jbogden</reporter>
                        <labels>
                            <label>o2iblnd</label>
                    </labels>
                <created>Wed, 7 Sep 2011 16:36:15 +0000</created>
                <updated>Thu, 5 Jan 2012 18:28:11 +0000</updated>
                            <resolved>Thu, 5 Jan 2012 18:28:11 +0000</resolved>
                                                                        <due></due>
                            <votes>0</votes>
                                    <watches>10</watches>
                                                                            <comments>
                            <comment id="20017" author="pjones" created="Wed, 7 Sep 2011 17:54:26 +0000"  >&lt;p&gt;Cliff &lt;/p&gt;

&lt;p&gt;Could you please help out with this.&lt;/p&gt;

&lt;p&gt;Thanks&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="20018" author="cliffw" created="Wed, 7 Sep 2011 18:02:28 +0000"  >&lt;p&gt;We need to know the exact version of Lustre you are running, and any patches applied.&lt;br/&gt;
You should certainly have errors in syslog, if you have done a clear full filesystem restart,&lt;br/&gt;
were you able to observe the start of the high MDS load? Logs from around that time might&lt;br/&gt;
have useful information. &lt;br/&gt;
In addition, the MDS records some timeout data, in the &apos;timeouts&apos; files in /proc&lt;br/&gt;
find /proc/fs/lustre -name timeouts &lt;br/&gt;
will show you the list, if you cat those files it will indicate which services are slow. &lt;br/&gt;
Look for large values in the &apos;worst&apos; column.&lt;/p&gt;</comment>
                            <comment id="20055" author="jbogden" created="Thu, 8 Sep 2011 12:55:11 +0000"  >&lt;p&gt;Complete set of syslogs for node &quot;amds1&quot; which shows a complete cycle of boot-&amp;gt; takeover lustre mds duties-&amp;gt; ptlrpcd and kiblnd continually ramp up load on the node-&amp;gt; node OOMs itself to death&lt;/p&gt;</comment>
                            <comment id="20056" author="jbogden" created="Thu, 8 Sep 2011 12:58:30 +0000"  >&lt;p&gt;/proc/meminfo snapshot as amds1 allocates itself to death&lt;/p&gt;</comment>
                            <comment id="20057" author="jbogden" created="Thu, 8 Sep 2011 12:59:15 +0000"  >&lt;p&gt;/etc/slabinfo snapshot at the same time as the /proc/meminfo snapshot&lt;/p&gt;</comment>
                            <comment id="20060" author="jbogden" created="Thu, 8 Sep 2011 13:06:52 +0000"  >&lt;p&gt;Cliff, I just attached three files representative of the behavior we are seeing on the MDS node named &apos;amds1&apos; (even though the meminfo and slabinfo files were slightly misnamed as amds2).  We observed that the start of the high MDS load started ramping up as soon as an MDS node booted, started up Lustre services for MDS duties, and finished reestablishing connectivity with Lustre clients.&lt;/p&gt;

&lt;p&gt;As best I can tell, the lustre-1.8.5.0-3chaos version we are running seems to be Lustre-1.8.5.0-3 + five patches:&lt;br/&gt;
ff2ef0c &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-337&quot; title=&quot;Processes stuck in sync_page on lustre client&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-337&quot;&gt;&lt;del&gt;LU-337&lt;/del&gt;&lt;/a&gt; Fix alloc mask in alloc_qinfo()&lt;br/&gt;
f9e0e36 &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-234&quot; title=&quot;OOM killer causes node hang&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-234&quot;&gt;&lt;del&gt;LU-234&lt;/del&gt;&lt;/a&gt; OOM killer causes node hang.&lt;br/&gt;
09eb8f9 &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-286&quot; title=&quot;racer: general protection fault: 0000 [1] SMP RIP: __wake_up_common+60}&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-286&quot;&gt;&lt;del&gt;LU-286&lt;/del&gt;&lt;/a&gt; racer: general protection fault.&lt;br/&gt;
f5a9068 &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-274&quot; title=&quot;Client delayed file status (cache meta-data) causing job failures&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-274&quot;&gt;&lt;del&gt;LU-274&lt;/del&gt;&lt;/a&gt; Update LVB from disk when glimpse callback return error&lt;br/&gt;
c4d695f Add IP to error message when peer&apos;s IB port is not privileged. &lt;/p&gt;

&lt;p&gt;I&apos;ll attempt to get clarification on the version details.&lt;/p&gt;

&lt;p&gt;We (or at least I) didn&apos;t know about the &apos;timeouts&apos; proc entries so I don&apos;t have data from them.  But we&apos;ll watch them and see if they are useful.&lt;/p&gt;</comment>
                            <comment id="20066" author="morrone" created="Thu, 8 Sep 2011 13:25:43 +0000"  >&lt;p&gt;I am a little confused.  Are there a number of typos in that last comment?  I am not aware of any tagged release numbered &quot;1.8.5.0-3&quot;.&lt;/p&gt;

&lt;p&gt;1.8.5.0-3chaos is 44 patches on top of 1.8.5.&lt;/p&gt;

&lt;p&gt;But the patch stack that you mention is EXACTLY the patch stack in the range 1.8.5.0-3chaos..1.8.5.0-5chaos.&lt;/p&gt;

&lt;p&gt;So it sounds like you are actually running 1.8.5.0-5chaos, not 1.8.5.0-3chaos.  Is that correct?&lt;/p&gt;</comment>
                            <comment id="20069" author="jbogden" created="Thu, 8 Sep 2011 13:31:58 +0000"  >&lt;p&gt;Chris,&lt;/p&gt;

&lt;p&gt;That is probably my bad in pulling the wrong changelog details.  Here is exactly what we are running:&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;root@amds1 ~&amp;#93;&lt;/span&gt;# rpm -qa | egrep &apos;lustre|chaos-kern&apos;&lt;br/&gt;
lustre-1.8.5.0-3chaos_2.6.18_105chaos.ch4.4&lt;br/&gt;
lustre-modules-1.8.5.0-3chaos_2.6.18_105chaos.ch4.4&lt;br/&gt;
chaos-kernel-2.6.18-105chaos&lt;/p&gt;

&lt;p&gt;Jeff&lt;/p&gt;</comment>
                            <comment id="20079" author="morrone" created="Thu, 8 Sep 2011 14:00:29 +0000"  >&lt;p&gt;According to Joe Mervini in a &lt;a href=&quot;http://jira.whamcloud.com/browse/LU-659?focusedCommentId=20039&amp;amp;page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-20039&quot; class=&quot;external-link&quot; rel=&quot;nofollow&quot;&gt;comment&lt;/a&gt; in &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-659&quot; title=&quot;Experiencing heavy IO load, client eviction and RPC timeouts after upgrade to lustre-1.8.5.0-5 (chaos release)&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-659&quot;&gt;&lt;del&gt;LU-659&lt;/del&gt;&lt;/a&gt;, the workload that triggered this issue was:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;the MDS was getting buried by multiple user jobs (same user) were running chgrp recursively on a directory with ~4.5M files/directories&lt;/p&gt;&lt;/blockquote&gt;

</comment>
                            <comment id="20086" author="jbogden" created="Thu, 8 Sep 2011 14:13:51 +0000"  >&lt;p&gt;We have a good update about this issue.  We seem to finally have stabilized our MDS functions on this Lustre filesystem.  We believe that the root cause was almost pathologically bad usage of the filesystem by a single user.  The user was running serial batch jobs on the compute cluster connected via the IB&amp;lt;-&amp;gt;IB routers.  The users directory tree looked like /gscratch2/joeuser/projectname/XXXX where XXXX were directories that contained a tree associated with each serial batch job.  At the end of the batch job scripts the user does:&lt;/p&gt;

&lt;p&gt;  chgrp -R othergroup /gscratch2/joeuser/projectname&lt;br/&gt;
  chmod -R g=u-w /gscratch2/joeuser/projectname&lt;/p&gt;

&lt;p&gt;When we finally stumbled upon this, the user had 17 jobs concurrently doing chmod/chgrp on that directory tree.  The /gscratch2/joeuser/projectname directory tree contains about 4.5 million files.&lt;/p&gt;

&lt;p&gt;So what we think was happening was just obscene Lustre DLM thrashing.  I don&apos;t have a Lustre debug trace to prove it, but it makes sense from what I understand of how DLM goes about things.&lt;/p&gt;

&lt;p&gt;So maybe this probably isn&apos;t a bug per se, but it does raise several questions I can think of (I&apos;m sure there are others as well).  In no particular order:&lt;/p&gt;

&lt;ul&gt;
	&lt;li&gt;Is there a way to protect the MDS from allocating itself to death under conditions like this?  I&apos;m guessing the ptlrpcd and kiblnd processes just slab allocated the node to death trying to deal with all the DLM rpc traffic.&lt;/li&gt;
	&lt;li&gt;Is there a way to detect behavior like this is occurring by inspecting sources of information on and MDS node?  I&apos;m guessing that we could probably figure it out now by pouring through a Lustre debug log trace with at least &apos;dlmtrace&apos; and &apos;rpctrace&apos; enabled in lnet.debug.  But that isn&apos;t really an optimal production path for a normal systems administrator to have to follow.&lt;/li&gt;
	&lt;li&gt;Is there a way to protect the MDS from what essentially is a denial of service behavior from a small number of clients?&lt;/li&gt;
&lt;/ul&gt;

</comment>
                            <comment id="20090" author="jbogden" created="Thu, 8 Sep 2011 14:20:31 +0000"  >&lt;p&gt;I didn&apos;t explicitly state that when we shot that user&apos;s jobs in the head and prevented any new jobs of the user from running, we were able to stablize the MDS behavior.  We aren&apos;t quite sure why when we took down the IB-to-IB routers initially, the MDS didn&apos;t stabilize.  Subsequent to the IB-to-IB router shutdown, we did a full restart of all the Lustre server nodes and that may have cleaned out some cruft that was confusing the issue initially...&lt;/p&gt;
</comment>
                            <comment id="20099" author="green" created="Thu, 8 Sep 2011 22:31:24 +0000"  >&lt;p&gt;I am afraid there is no easy way to tell bad from good traffic in ldlm and act accordingly.&lt;/p&gt;

&lt;p&gt;Any client behavior that would result in a lot of MDS threads blocked potentially could lead to this sort of DoS.&lt;br/&gt;
The proper way to fix this is a bit of redesign of the processing architecture to allow MDS threads to do other stuff once there is lock blocking encountered. Or an infinite number of MDT threads otherwise.&lt;br/&gt;
Another particular 1.8 scenario is a large job opening the same file with O_CREATE flag set, that would end up causing MDS threads blocked on the same lock and once the thread pool is exhausted unrelated jobs would take a hit. It&apos;s even worse in 1.8 as the same condition is triggered by having a lot of threads to open wiles with O_CREAT in the same dir.&lt;/p&gt;</comment>
                            <comment id="25947" author="cliffw" created="Thu, 5 Jan 2012 18:28:11 +0000"  >&lt;p&gt;I am going to close this issue, as there is not a fix at this time. Please re-open if you have further data or questions. &lt;/p&gt;</comment>
                    </comments>
                    <attachments>
                            <attachment id="10416" name="amds1.syslog.gz" size="39363" author="jbogden" created="Thu, 8 Sep 2011 12:55:11 +0000"/>
                            <attachment id="10417" name="amds2-meminfo" size="777" author="jbogden" created="Thu, 8 Sep 2011 12:58:30 +0000"/>
                            <attachment id="10418" name="amds2-slabinfo" size="18978" author="jbogden" created="Thu, 8 Sep 2011 12:59:15 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10040" key="com.atlassian.jira.plugin.system.customfieldtypes:labels">
                        <customfieldname>Epic</customfieldname>
                        <customfieldvalues>
                                        <label>hang</label>
            <label>lnet</label>
            <label>metadata</label>
            <label>server</label>
            <label>timeout</label>
    
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzvhyv:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>6560</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                                                                                </customfields>
    </item>
</channel>
</rss>