<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:15:24 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-1297] LBUG: LustreError: 3982:0:(dcache.c:175:ll_ddelete()) ASSERTION( ((de)-&gt;d_count) == 1 ) failed:</title>
                <link>https://jira.whamcloud.com/browse/LU-1297</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;LBUG: LustreError: 3982:0:(dcache.c:175:ll_ddelete()) ASSERTION( ((de)-&amp;gt;d_count) == 1 ) failed:&lt;/p&gt;

&lt;p&gt;Several clients are locking up and crashing.  Looking at the logs from three different clients, I see three different errors, namely:&lt;/p&gt;

&lt;p&gt;Apr  9 19:08:35 usrs389 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;1821598.590661&amp;#93;&lt;/span&gt; LustreError: 3982:0:(dcache.c:175:ll_ddelete()) ASSERTION( ((de)-&amp;gt;d_count) == 1 ) failed:&lt;br/&gt;
Apr  9 19:08:35 usrs389 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;1821598.638614&amp;#93;&lt;/span&gt; LustreError: 3982:0:(dcache.c:175:ll_ddelete()) LBUG&lt;/p&gt;

&lt;p&gt;Apr  9 15:49:45 usrs397 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;1809559.519814&amp;#93;&lt;/span&gt; WARNING: at fs/dcache.c:1367 d_set_d_op+0x8f/0xc0()&lt;/p&gt;

&lt;p&gt;Apr  9 19:06:55 usrs398 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;1827812.880503&amp;#93;&lt;/span&gt; kernel BUG at fs/dcache.c:440!&lt;br/&gt;
Apr  9 19:06:55 usrs398 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;1827812.901832&amp;#93;&lt;/span&gt; invalid opcode: 0000 &lt;a href=&quot;#1&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;1&lt;/a&gt; SMP&lt;/p&gt;


&lt;p&gt;Customer reports the following:&lt;/p&gt;

&lt;p&gt;Yesterday one of our heavy users ran into several problems:&lt;/p&gt;

&lt;p&gt;1) Twice java.io.File.mkdirs() returned false when creating a directory in the lustre file system (he was never able to replicate it).&lt;br/&gt;
2) A number of times a file that he had previously opened and written to (on a single client) came back as not found when attempting to open.&lt;br/&gt;
3) Two clients locked up with LBUG in /var/log/messages&lt;br/&gt;
4) One client completely crashed.&lt;/p&gt;

&lt;p&gt;For issue #2, the logic was of the form:&lt;/p&gt;

&lt;p&gt;java.io.RandomFile raf = new RandomFile(&amp;lt;path&amp;gt;, &quot;rw&quot;) raf.seek(&amp;lt;end of file&amp;gt;)&lt;br/&gt;
raf.write(&amp;lt;buffer&amp;gt;)&lt;br/&gt;
raf.close()&lt;/p&gt;

&lt;p&gt;This sequence was executed successfully a number of times (by the same program) and then failed on the first line with a FileNotFoundException.&lt;/p&gt;

&lt;p&gt;I will attach the logs from the clients that had LBUgs&lt;/p&gt;


&lt;p&gt;Today, the customer reports:&lt;/p&gt;

&lt;p&gt;I have uploaded netconsole output from 6 additional clients. All of these clients were used in yesterday&apos;s run by the heavy user and had not been rebooted. All of the problems were triggered by executing the following:&lt;/p&gt;

&lt;p&gt;$ echo 3 &amp;gt; /proc/sys/vm/drop_caches&lt;/p&gt;

&lt;p&gt;That command on usrs392 ended up hanging (using 100% of the cpu)&lt;/p&gt;

&lt;p&gt;I executed &quot;echo t &amp;gt; /proc/sysrq-trigger&quot; on usrs392 which crashed the machine and dumped out the stack traces in the output.&lt;/p&gt;

&lt;p&gt;That command on usrs&lt;/p&gt;
{388,393,394,399,400} caused the command to either segfault or be killed. Additionally, I decided to rerun the command on usrs{394,399,400} to see what happened, the command hung on all of three using up 100% of a single CPU. I managed to execute &quot;echo l &amp;gt; /proc/sysrq-trigger&quot; which shows the stack trace for CPU12 which was the really active CPU. usrs{399,400} both locked up before I could run any further tests on them and had to be rebooted. I still have usrs{388,393} in the corrupted state if you need me to run any commands on them (they have not yet been rebooted).&lt;br/&gt;
&lt;br/&gt;
It appears that the file/directory cache in the kernel is being corrupted by the lustre client which is causing these infinite loops when trying to drop the caches.&lt;br/&gt;
&lt;br/&gt;
I looked at the MDS log.  It includes the following:&lt;br/&gt;
&lt;br/&gt;
Apr  9 18:05:49 ts-xxxxxxxx-01 kernel: Lustre: 2542:0:(mdt_handler.c:887:mdt_getattr_name_lock()) header@ffff8805e8828080[0x0, 1, &lt;span class=&quot;error&quot;&gt;&amp;#91;0x200000bf5:0x10a0:0x0&amp;#93;&lt;/span&gt; hash lru]{&lt;br/&gt;
Apr  9 18:05:49 ts-xxxxxxxx-01 kernel: Lustre: 10206:0:(mdt_handler.c:887:mdt_getattr_name_lock()) header@ffff8805e8828080[0x0, 2, &lt;span class=&quot;error&quot;&gt;&amp;#91;0x200000bf5:0x10a0:0x0&amp;#93;&lt;/span&gt; hash lru]{
Apr  9 18:05:49 ts-xxxxxxxx-01 kernel: Lustre: 10206:0:(mdt_handler.c:887:mdt_getattr_name_lock()) ....mdt@ffff8805e88280d8mdt-object@ffff8805e8828080(ioepoch=0 flags=0x0, epochcount=0, writecount=0)
Apr  9 18:05:49 ts-xxxxxxxx-01 kernel: Lustre: 10206:0:(mdt_handler.c:887:mdt_getattr_name_lock()) ....cmm@ffff8805fa4dd1c0[local]
Apr  9 18:05:49 ts-xxxxxxxx-01 kernel: Lustre: 10206:0:(mdt_handler.c:887:mdt_getattr_name_lock()) ....mdd@ffff8805e9610b40mdd-object@ffff8805e9610b40(open_count=0, valid=0, cltime=0, flags=0)
Apr  9 18:05:49 ts-xxxxxxxx-01 kernel: Lustre: 10206:0:(mdt_handler.c:887:mdt_getattr_name_lock()) ....osd-ldiskfs@ffff8805e9610c00osd-ldiskfs-object@ffff8805e9610c00(i:(null):0/0)[plain]
Apr  9 18:05:49 ts-xxxxxxxx-01 kernel: Lustre: 10206:0:(mdt_handler.c:887:mdt_getattr_name_lock()) } header@ffff8805e8828080&lt;br/&gt;
Apr  9 18:05:49 ts-xxxxxxxx-01 kernel: Lustre: 10206:0:(mdt_handler.c:887:mdt_getattr_name_lock()) Parent doesn&apos;t exist!&lt;br/&gt;
Apr  9 18:05:49 ts-xxxxxxxx-01 kernel: Lustre: 10206:0:(mdt_handler.c:887:mdt_getattr_name_lock()) header@ffff8805e97902c0[0x0, 1, &lt;span class=&quot;error&quot;&gt;&amp;#91;0x200000bf5:0x10a1:0x0&amp;#93;&lt;/span&gt; hash lru]{
Apr  9 18:05:49 ts-xxxxxxxx-01 kernel: Lustre: 10206:0:(mdt_handler.c:887:mdt_getattr_name_lock()) ....mdt@ffff8805e9790318mdt-object@ffff8805e97902c0(ioepoch=0 flags=0x0, epochcount=0, writecount=0)
Apr  9 18:05:49 ts-xxxxxxxx-01 kernel: Lustre: 10206:0:(mdt_handler.c:887:mdt_getattr_name_lock()) ....cmm@ffff8805ed3297c0[local]
Apr  9 18:05:49 ts-xxxxxxxx-01 kernel: Lustre: 10206:0:(mdt_handler.c:887:mdt_getattr_name_lock()) ....mdd@ffff8805e927c900mdd-object@ffff8805e927c900(open_count=0, valid=0, cltime=0, flags=0)
Apr  9 18:05:49 ts-xxxxxxxx-01 kernel: Lustre: 10206:0:(mdt_handler.c:887:mdt_getattr_name_lock()) ....osd-ldiskfs@ffff8805e927ccc0osd-ldiskfs-object@ffff8805e927ccc0(i:(null):0/0)[plain]
Apr  9 18:05:49 ts-xxxxxxxx-01 kernel: Lustre: 10206:0:(mdt_handler.c:887:mdt_getattr_name_lock()) } header@ffff8805e97902c0&lt;br/&gt;
Apr  9 18:05:49 ts-xxxxxxxx-01 kernel: Lustre: 10206:0:(mdt_handler.c:887:mdt_getattr_name_lock()) Parent doesn&apos;t exist!&lt;br/&gt;
Apr  9 18:05:49 ts-xxxxxxxx-01 kernel: Lustre: 10206:0:(mdt_handler.c:887:mdt_getattr_name_lock()) header@ffff8805e8828080[0x0, 2, &lt;span class=&quot;error&quot;&gt;&amp;#91;0x200000bf5:0x10a0:0x0&amp;#93;&lt;/span&gt; hash lru]{
Apr  9 18:05:49 ts-xxxxxxxx-01 kernel: Lustre: 10206:0:(mdt_handler.c:887:mdt_getattr_name_lock()) ....mdt@ffff8805e88280d8mdt-object@ffff8805e8828080(ioepoch=0 flags=0x0, epochcount=0, writecount=0)
Apr  9 18:05:49 ts-xxxxxxxx-01 kernel: Lustre: 10206:0:(mdt_handler.c:887:mdt_getattr_name_lock()) ....cmm@ffff8805fa4dd1c0[local]
Apr  9 18:05:49 ts-xxxxxxxx-01 kernel: Lustre: 10206:0:(mdt_handler.c:887:mdt_getattr_name_lock()) ....mdd@ffff8805e9610b40mdd-object@ffff8805e9610b40(open_count=0, valid=0, cltime=0, flags=0)
Apr  9 18:05:49 ts-xxxxxxxx-01 kernel: Lustre: 10206:0:(mdt_handler.c:887:mdt_getattr_name_lock()) ....osd-ldiskfs@ffff8805e9610c00osd-ldiskfs-object@ffff8805e9610c00(i:(null):0/0)[plain]
Apr  9 18:05:49 ts-xxxxxxxx-01 kernel: Lustre: 10206:0:(mdt_handler.c:887:mdt_getattr_name_lock()) } header@ffff8805e8828080&lt;br/&gt;
Apr  9 18:05:49 ts-xxxxxxxx-01 kernel: Lustre: 10206:0:(mdt_handler.c:887:mdt_getattr_name_lock()) Parent doesn&apos;t exist!&lt;br/&gt;
&lt;br/&gt;
&lt;br/&gt;
The following files are attached:&lt;br/&gt;
&lt;br/&gt;
MDS.messages:  /var/log/messages from the MDS.  The problem starts at Apr  9, 18:05&lt;br/&gt;
&lt;br/&gt;
usrs389.messages, usrs397.messages, usrs.398.messages:  Each shows a different LBUG or kernel assert&lt;br/&gt;
&lt;br/&gt;
usrs392.netconsole shows the output from sysrq-trigger-T, though the machine may have crashed before it was completed.&lt;br/&gt;
&lt;br/&gt;
usrs{388,393,394,399,400}
&lt;p&gt; crashed or were killed when running sysrq-trigger-T.&lt;/p&gt;

&lt;p&gt;Please let me know if you need additional information.&lt;/p&gt;



</description>
                <environment>Lustre servers are running 2.6.32-220.el6, with Lustre 2.1.1.rc4.&lt;br/&gt;
Lustre clients are running 2.6.38.2, with special code created for this release, with &lt;a href=&quot;http://review.whamcloud.com/#change,2170&quot;&gt;http://review.whamcloud.com/#change,2170&lt;/a&gt;. </environment>
        <key id="13926">LU-1297</key>
            <summary>LBUG: LustreError: 3982:0:(dcache.c:175:ll_ddelete()) ASSERTION( ((de)-&gt;d_count) == 1 ) failed:</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="2" iconUrl="https://jira.whamcloud.com/images/icons/priorities/critical.svg">Critical</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="laisiyao">Lai Siyao</assignee>
                                    <reporter username="rspellman">Roger Spellman</reporter>
                        <labels>
                    </labels>
                <created>Tue, 10 Apr 2012 12:34:26 +0000</created>
                <updated>Tue, 17 Apr 2012 15:02:46 +0000</updated>
                            <resolved>Tue, 17 Apr 2012 15:02:46 +0000</resolved>
                                    <version>Lustre 2.1.1</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>4</watches>
                                                                            <comments>
                            <comment id="34475" author="pjones" created="Tue, 10 Apr 2012 17:08:57 +0000"  >&lt;p&gt;Lai&lt;/p&gt;

&lt;p&gt;Could you please look into this one?&lt;/p&gt;

&lt;p&gt;Thanks&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="34492" author="laisiyao" created="Tue, 10 Apr 2012 22:43:19 +0000"  >&lt;p&gt;Hi Roger, the FC15 kernel lustre supports is kernel-2.6.38.6-26.rc1.fc15; while kernel-2.6.38.2-9.fc15 is for FC15 beta release, I can&apos;t even find a place to download it now. Could you ask customer upgrade to kernel-2.6.38.6-26.rc1.fc15 and test again?&lt;/p&gt;</comment>
                            <comment id="34523" author="pcpiela" created="Wed, 11 Apr 2012 10:20:50 +0000"  >&lt;p&gt;I don&apos;t think it is feasible to ask the customer to upgrade their kernel unless we can make a strong case that the symptoms map to a known kernel issue.&lt;/p&gt;</comment>
                            <comment id="34536" author="laisiyao" created="Wed, 11 Apr 2012 11:38:01 +0000"  >&lt;p&gt;Hmm, I got source rpm of this kernel, will check code and try to compile and test.&lt;/p&gt;</comment>
                            <comment id="34579" author="laisiyao" created="Thu, 12 Apr 2012 03:37:04 +0000"  >&lt;p&gt;The cause of ll_ddelete() ASSERT is that dget_dlock() is not under dentry-&amp;gt;d_lock in ll_splice_alias(). This is fine for kernels &amp;lt; 2.6.38 because dput() will take dcache_lock, but new kernel removes dcache_lock, d_lock is now a must to update dentry-&amp;gt;d_count.&lt;/p&gt;

&lt;p&gt;All the crashes looks to be of the same cause. And the warning message of d_set_d_op() has been fixed too. Could you update to &lt;a href=&quot;http://review.whamcloud.com/#change,2170&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#change,2170&lt;/a&gt; Patch Set 8 and test again?&lt;/p&gt;</comment>
                            <comment id="34942" author="pcpiela" created="Tue, 17 Apr 2012 13:24:42 +0000"  >&lt;p&gt;Both customer and Terascala have confirmed the fix. This issue can be closed.&lt;/p&gt;</comment>
                    </comments>
                    <attachments>
                            <attachment id="11113" name="MDS.messages" size="91777" author="rspellman" created="Tue, 10 Apr 2012 12:34:26 +0000"/>
                            <attachment id="11118" name="usrs388.netconsole" size="4509" author="rspellman" created="Tue, 10 Apr 2012 12:34:26 +0000"/>
                            <attachment id="11114" name="usrs389.messages" size="28485" author="rspellman" created="Tue, 10 Apr 2012 12:34:26 +0000"/>
                            <attachment id="11117" name="usrs392.netconsole" size="47410" author="rspellman" created="Tue, 10 Apr 2012 12:34:26 +0000"/>
                            <attachment id="11119" name="usrs393.netconsole" size="4509" author="rspellman" created="Tue, 10 Apr 2012 12:34:26 +0000"/>
                            <attachment id="11120" name="usrs394.netconsole" size="13365" author="rspellman" created="Tue, 10 Apr 2012 12:34:26 +0000"/>
                            <attachment id="11115" name="usrs397.messages" size="34197" author="rspellman" created="Tue, 10 Apr 2012 12:34:26 +0000"/>
                            <attachment id="11116" name="usrs398.messages" size="4759" author="rspellman" created="Tue, 10 Apr 2012 12:34:26 +0000"/>
                            <attachment id="11121" name="usrs399.netconsole" size="4789" author="rspellman" created="Tue, 10 Apr 2012 12:34:26 +0000"/>
                            <attachment id="11122" name="usrs400.netconsole" size="4509" author="rspellman" created="Tue, 10 Apr 2012 12:34:26 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzvh3b:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>6418</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10021"><![CDATA[2]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>