<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 02:18:21 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-8528] MDT lock callback timer expiration and evictions under light load</title>
                <link>https://jira.whamcloud.com/browse/LU-8528</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;Running a 1000-node user job (call it &quot;ben&quot;) on the jade cluster results in &apos;lock callback timer expired&apos; messages on the MDS console and transactions begin taking a very long time or failing entirely when the client is evicted.&lt;/p&gt;

&lt;p&gt;The first lock timeouts are seen within 5 minutes of starting the job.&lt;/p&gt;

&lt;p&gt;After the MDS stops responding inThe MDS is still up and debug logs can be dumped; I&apos;ll attach some.&lt;/p&gt;

&lt;p&gt;There is no evidence of network issues; the fabric in the compute cluster appears clean, the router nodes and compute nodes report no peers down, and initially the clients report good connections to the server.  Networking monitoring tools also indicate no network issues.&lt;/p&gt;</description>
                <environment>servers: cider cluster&lt;br/&gt;
rhel 6.8 derivative&lt;br/&gt;
lustre-2.5.5-8chaos_2.6.32_642.3.1.1chaos.ch5.5.x86_64.x86_64&lt;br/&gt;
&lt;br/&gt;
clients: jade cluster&lt;br/&gt;
rhel 7.2 derivative&lt;br/&gt;
kernel-3.10.0-327.28.2.1chaos.ch6.x86_64&lt;br/&gt;
lustre-2.5.5-9chaos.2.ch6.x86_64</environment>
        <key id="39066">LU-8528</key>
            <summary>MDT lock callback timer expiration and evictions under light load</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="2" iconUrl="https://jira.whamcloud.com/images/icons/priorities/critical.svg">Critical</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="6">Not a Bug</resolution>
                                        <assignee username="yong.fan">nasf</assignee>
                                    <reporter username="ofaaland">Olaf Faaland</reporter>
                        <labels>
                            <label>llnl</label>
                    </labels>
                <created>Tue, 23 Aug 2016 18:50:24 +0000</created>
                <updated>Tue, 29 Nov 2016 00:32:50 +0000</updated>
                            <resolved>Mon, 29 Aug 2016 15:37:30 +0000</resolved>
                                                                        <due></due>
                            <votes>0</votes>
                                    <watches>5</watches>
                                                                            <comments>
                            <comment id="162881" author="ofaaland" created="Tue, 23 Aug 2016 18:53:02 +0000"  >&lt;p&gt;I overlooked setting the priority.  This is critical; it is preventing us from doing early user testing of a new, large, cluster.&lt;/p&gt;</comment>
                            <comment id="162882" author="ofaaland" created="Tue, 23 Aug 2016 18:53:37 +0000"  >&lt;p&gt;Running mdtest at this scale, and larger scale, does not trigger this issue.&lt;/p&gt;</comment>
                            <comment id="162883" author="ofaaland" created="Tue, 23 Aug 2016 18:58:53 +0000"  >&lt;p&gt;For reference when looking at the attached logs: the job was started on 10/23 at 10:26am.&lt;br/&gt;
I&apos;ll attach lustre debug logs from a few randomly selected clients and from the mds.  I&apos;ll also attach stacks from the selected clients and from the mds.&lt;br/&gt;
I don&apos;t know the details of what I/O the job attempts to do, we&apos;ll look into that.&lt;/p&gt;</comment>
                            <comment id="162884" author="ofaaland" created="Tue, 23 Aug 2016 19:06:45 +0000"  >&lt;p&gt;From another cluster which also mounts the lscratchf, &apos;df&apos; started taking &amp;gt;3 minutes to return at 10:30am and started responding normally only after the jade compute nodes had been powered off and the MDS had been crashed and recovered.  That test is done every 5 minutes; at 10:25am &apos;df&apos; took &amp;lt; 3 minutes.&lt;/p&gt;</comment>
                            <comment id="162910" author="ofaaland" created="Tue, 23 Aug 2016 21:14:19 +0000"  >&lt;p&gt;jade2074 is the node with nid 192.168.136.42@o2ib24 which is called out in the MDS console log.&lt;/p&gt;

&lt;p&gt;I&apos;ve attached the console log from jade2074.&lt;/p&gt;

&lt;p&gt;Unfortunately I don&apos;t have stacks or a lustre debug log from jade2074, and the node has been rebooted.&lt;/p&gt;</comment>
                            <comment id="162913" author="ofaaland" created="Tue, 23 Aug 2016 21:19:14 +0000"  >&lt;p&gt;This is reproducible on this cluster/filesystem.  We&apos;re working on reproducing the issue on a smaller cluster where we can leave the nodes in the bad state as long as necessary to gather data.&lt;/p&gt;</comment>
                            <comment id="163001" author="ofaaland" created="Wed, 24 Aug 2016 14:58:10 +0000"  >&lt;p&gt;Hi, can someone take a look at this?&lt;br/&gt;
thanks,&lt;br/&gt;
Olaf&lt;/p&gt;</comment>
                            <comment id="163017" author="pjones" created="Wed, 24 Aug 2016 15:56:48 +0000"  >&lt;p&gt;Fan Yong&lt;/p&gt;

&lt;p&gt;Can you please assist with this ticket?&lt;/p&gt;

&lt;p&gt;Thanks&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="163079" author="ofaaland" created="Wed, 24 Aug 2016 20:38:30 +0000"  >&lt;p&gt;Fan,&lt;/p&gt;

&lt;p&gt;I&apos;m about to run another test at different scale, but expect it will likely produce the same symptoms.  Is there anything in particular you want me to gather?  Right now, I&apos;m planning on collecting this:&lt;/p&gt;
&lt;ul class=&quot;alternate&quot; type=&quot;square&quot;&gt;
	&lt;li&gt;console logs from clients and servers that report anything lustre related after the job starts&lt;/li&gt;
	&lt;li&gt;lctl dk with +rpctrace, and stacks, from:
	&lt;ul class=&quot;alternate&quot; type=&quot;square&quot;&gt;
		&lt;li&gt;the mds&lt;/li&gt;
		&lt;li&gt;the first client the mdt mentions in lock callback timer expired message&lt;/li&gt;
		&lt;li&gt;the first client that complains in its log about anything lustre related after the job starts&lt;/li&gt;
	&lt;/ul&gt;
	&lt;/li&gt;
	&lt;li&gt;list of downed peers from router nodes&lt;/li&gt;
	&lt;li&gt;contents of /proc/fs/lustre/mdc/*/state from the clients that report anything in console logs&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;The mdt lustre debug logs will likely be too big to send you, so you&apos;ll need to tell me what you would like me to extract.&lt;/p&gt;

&lt;p&gt;I can gather different information - let me know what is helpful.&lt;/p&gt;

&lt;p&gt;thanks,&lt;br/&gt;
Olaf&lt;/p&gt;</comment>
                            <comment id="163099" author="ofaaland" created="Wed, 24 Aug 2016 22:16:56 +0000"  >&lt;p&gt;I forgot to specify above, but the client nodes and router nodes appear to be stable the entire time.  I haven&apos;t observed that any nodes oom, or crash.  So I see no reason the mdt wouldn&apos;t be receiving responses to requests sent to the clients.&lt;/p&gt;</comment>
                            <comment id="163140" author="ofaaland" created="Thu, 25 Aug 2016 15:57:06 +0000"  >&lt;p&gt;Tested the job at slightly different scale 8/24 and got the same symptoms.&lt;br/&gt;
I notice when looking at the stacks on the MDS that it has many threads in this stack:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[&amp;lt;ffffffffa0cf3dd5&amp;gt;] ptlrpc_wait_event+0x2d5/0x2e0 [ptlrpc]
[&amp;lt;ffffffffa0cfe31f&amp;gt;] ptlrpc_main+0x94f/0x1af0 [ptlrpc]
[&amp;lt;ffffffff810a6cbe&amp;gt;] kthread+0x9e/0xc0
[&amp;lt;ffffffff8100c2ca&amp;gt;] child_rip+0xa/0x20
[&amp;lt;ffffffffffffffff&amp;gt;] 0xffffffffffffffff
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;which looks to me like they are idle waiting for requests to arrive.  Am I mistaken?&lt;br/&gt;
This confuses me because metadata operations such as statfs or stat of the filesystem root, on any clients mounting the filesystem (not just those participating in the ben job) return after long delays (several minutes) or possibly never (meaning I gave up).  I&apos;d thought this was because the service threads were busy but that seems not to be the case.&lt;/p&gt;</comment>
                            <comment id="163141" author="ofaaland" created="Thu, 25 Aug 2016 15:59:30 +0000"  >&lt;p&gt;I attached data from the Aug 24 run I mentioned.  It includes job_stats data from the MDS for the job, as well as consoles, stacks, and lustre debug logs.  It&apos;s called 08-24.for_intel.tgz&lt;/p&gt;</comment>
                            <comment id="163146" author="ofaaland" created="Thu, 25 Aug 2016 16:28:07 +0000"  >&lt;p&gt;This same job run on 100 nodes x 18 cores/node produces the same warning messages on the console of the MDS, but does not cause the file system to hang (at least not consistently).  At 1000 nodes x 18 cores/node it does consistently cause the file system to hang.&lt;/p&gt;</comment>
                            <comment id="163158" author="yong.fan" created="Thu, 25 Aug 2016 17:29:58 +0000"  >&lt;p&gt;Where can I get the source code that you are testing lustre-2.5.5-8chaos and lustre-2.5.5-9chaos? &lt;a href=&quot;https://github.com/LLNL/lustre&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://github.com/LLNL/lustre&lt;/a&gt; ? I cannot find related tags.&lt;/p&gt;</comment>
                            <comment id="163159" author="ofaaland" created="Thu, 25 Aug 2016 17:42:06 +0000"  >&lt;p&gt;Hello Fan,&lt;br/&gt;
There&apos;s a repo visible to Intel folk that Chris pushes to (that I believe is hosted by Inel).  Peter and Andreas know the details, others might as well.&lt;br/&gt;
thanks,&lt;br/&gt;
Olaf&lt;/p&gt;</comment>
                            <comment id="163164" author="ofaaland" created="Thu, 25 Aug 2016 18:14:42 +0000"  >&lt;p&gt;The user reports he has successfully run the same job at 1024 nodes on Sierra, several times within the last few months.  The Sierra cluster is at lustre-2.5.5-6chaos_2.6.32_573.26.1.1chaos.ch5.4.x86_64.x86_64.  Sierra is running a RHEL 6.7 derivative (the same one as the luster servers cider*).&lt;/p&gt;</comment>
                            <comment id="163204" author="ofaaland" created="Fri, 26 Aug 2016 00:39:47 +0000"  >&lt;p&gt;Hello Fan,&lt;/p&gt;

&lt;p&gt;I have a probable reproducer:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;$ cat ~/projects/toss-3380/mkdir_script
#!/usr/bin/bash
mkdir -p /p/lscratchf/faaland1/mkdirp/${SLURM_JOBID}/a/b/c/d/e/f/g/$(hostname)/$$
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Run this way on jade (compute cluster described above) it reproduces the symptoms of the problem, except that after the compute nodes are rebooted the server recovers on its own:&lt;br/&gt;
srun -ppbatch -N 100 -n 800 ~/projects/toss-3380/mkdir_script&lt;/p&gt;

&lt;p&gt;Run this way on jade (compute cluster described above) it does not produce any symptoms; the directories are created quickly and the job succeeds:&lt;br/&gt;
srun -ppbatch -N 100 -n 100 ~/projects/toss-3380/mkdir_script&lt;/p&gt;</comment>
                            <comment id="163205" author="ofaaland" created="Fri, 26 Aug 2016 01:08:10 +0000"  >&lt;p&gt;We have another compute cluster, catalyst, which mounts the same filesystem (cider/lsf/lscratchf).&lt;/p&gt;

&lt;p&gt;On catalyst I do &lt;em&gt;not&lt;/em&gt; reproduce the problem even at 100 nodes x 8 processes per node:&lt;br/&gt;
srun -ppbatch -N 100 -n 800 ~/projects/toss-3380/mkdir_script&lt;/p&gt;

&lt;p&gt;There are many differences between catalyst and jade, but two differences are:&lt;br/&gt;
OS:  jade runs RHEL 7.2,  catalyst runs RHEL 6.8&lt;br/&gt;
selinux: jade has selinux enforcing, catalyst has it disabled&lt;/p&gt;</comment>
                            <comment id="163332" author="jhammond" created="Sat, 27 Aug 2016 15:45:08 +0000"  >&lt;p&gt;Hi Olaf, would it be possible to disable SELinux on jade and run again?&lt;/p&gt;</comment>
                            <comment id="163380" author="yong.fan" created="Mon, 29 Aug 2016 08:32:35 +0000"  >&lt;p&gt;Olaf, have you tried with SELinux disabled as John suggested? It is suspected that SELinux caused your trouble. Even if there might be other reason(s), we still suggest you to disable SELinux on Jade. Because your current system (Lustre-2.5.5 based) does not support SELinux. Please refer to &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-5560&quot; title=&quot;SELinux support on the client side&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-5560&quot;&gt;&lt;del&gt;LU-5560&lt;/del&gt;&lt;/a&gt; for detail.&lt;/p&gt;</comment>
                            <comment id="163409" author="ofaaland" created="Mon, 29 Aug 2016 15:24:10 +0000"  >&lt;p&gt;John, Fan,&lt;br/&gt;
Yes, after disabling SELinux on jade, the job runs successfully, repeatedly.  Sorry for the delay responding, I was out Friday.  You can mark this issue resolved.&lt;/p&gt;</comment>
                            <comment id="163410" author="yong.fan" created="Mon, 29 Aug 2016 15:37:30 +0000"  >&lt;p&gt;That is related with SELinux.&lt;/p&gt;</comment>
                    </comments>
                    <attachments>
                            <attachment id="22759" name="08-24.for_intel.tgz" size="1062555" author="ofaaland" created="Thu, 25 Aug 2016 15:57:25 +0000"/>
                            <attachment id="22729" name="cider-mds1.console.1471978512" size="15173" author="ofaaland" created="Tue, 23 Aug 2016 20:51:05 +0000"/>
                            <attachment id="22733" name="console.jade2074" size="18309" author="ofaaland" created="Tue, 23 Aug 2016 21:14:19 +0000"/>
                            <attachment id="22730" name="dk.jade2119.1471973342" size="89708" author="ofaaland" created="Tue, 23 Aug 2016 20:52:49 +0000"/>
                            <attachment id="22731" name="ps.ef.jade2119.1471973574" size="117886" author="ofaaland" created="Tue, 23 Aug 2016 20:52:49 +0000"/>
                            <attachment id="22732" name="stacks.cider-mds1.1471973508" size="465580" author="ofaaland" created="Tue, 23 Aug 2016 20:54:45 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzylv3:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>