<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 02:45:14 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-11593] LFSCK crashed all OSS servers</title>
                <link>https://jira.whamcloud.com/browse/LU-11593</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;Hi&lt;/p&gt;

&lt;p&gt;This is Manish from nasa and we were getting some hangs on ls command and because of that we started running the &quot;lfsck&quot; command on all OSS nodes to clear up stale entries and that crashed all the servers after a long run of lfsck command.&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
PID: 25536 &#160;TASK: ffff881bc2abaf70 &#160;CPU: 9 &#160; COMMAND: &lt;span class=&quot;code-quote&quot;&gt;&quot;lfsck&quot;&lt;/span&gt;
&#160;#0 [ffff88151015fbc8] machine_kexec at ffffffff8105b64b
&#160;#1 [ffff88151015fc28] __crash_kexec at ffffffff81105342
&#160;#2 [ffff88151015fcf8] panic at ffffffff81689aad
&#160;#3 [ffff88151015fd78] lbug_with_loc at ffffffffa08938cb [libcfs]
&#160;#4 [ffff88151015fd98] lfsck_layout_slave_prep at ffffffffa0cfa6fd [lfsck]
&#160;#5 [ffff88151015fdf0] lfsck_master_engine at ffffffffa0cd4624 [lfsck]
&#160;#6 [ffff88151015fec8] kthread at ffffffff810b1131
&#160;#7 [ffff88151015ff50] ret_from_fork at ffffffff816a14f7&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;I will upload the complete crash dump to ftp site soon.&lt;/p&gt;

&lt;p&gt;Thank You,&lt;/p&gt;

&lt;p&gt;&#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160;Manish&lt;/p&gt;</description>
                <environment>CentOS Linux release 7.5.1804 (Core)&lt;br/&gt;
Kernel Version 3.10.0-693.21.1.el7.20180508.x86_64.lustre2105&lt;br/&gt;
e2fsprogs-libs-1.44.3.wc1-0.el7.x86_64&lt;br/&gt;
e2fsprogs-1.44.3.wc1-0.el7.x86_64&lt;br/&gt;
e2fsprogs-static-1.44.3.wc1-0.el7.x86_64&lt;br/&gt;
e2fsprogs-devel-1.44.3.wc1-0.el7.x86_64&lt;br/&gt;
e2fsprogs-debuginfo-1.44.3.wc1-0.el7.x86_64</environment>
        <key id="53876">LU-11593</key>
            <summary>LFSCK crashed all OSS servers</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="3" iconUrl="https://jira.whamcloud.com/images/icons/priorities/major.svg">Major</priority>
                        <status id="3" iconUrl="https://jira.whamcloud.com/images/icons/statuses/inprogress.png" description="This issue is being actively worked on at the moment by the assignee.">In Progress</status>
                    <statusCategory id="4" key="indeterminate" colorName="inprogress"/>
                                    <resolution id="-1">Unresolved</resolution>
                                        <assignee username="laisiyao">Lai Siyao</assignee>
                                    <reporter username="manishpatel">Manish</reporter>
                        <labels>
                    </labels>
                <created>Thu, 1 Nov 2018 16:39:08 +0000</created>
                <updated>Mon, 14 Jun 2021 21:23:49 +0000</updated>
                                            <version>Lustre 2.10.5</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>8</watches>
                                                                            <comments>
                            <comment id="236156" author="pjones" created="Thu, 1 Nov 2018 17:09:34 +0000"  >&lt;p&gt;Lai&lt;/p&gt;

&lt;p&gt;Could you please advise?&lt;/p&gt;

&lt;p&gt;Thanks&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="236168" author="adilger" created="Thu, 1 Nov 2018 21:51:35 +0000"  >&lt;p&gt;The &lt;tt&gt;lfsck_layout_slave_prep()&lt;/tt&gt; function has two different &lt;tt&gt;LASSERT()&lt;/tt&gt; checks in it, but it isn&apos;t clear from the above stack trace which one was triggered.  Are there some lines on the console before the stack trace that show what the actual problem was?  Based on the &lt;tt&gt;lfsck_layout_slave_prep()&lt;/tt&gt; function this was in, it looks like it was at the start of the MDS-&amp;gt;OSS &lt;tt&gt;layout&lt;/tt&gt; scanning phase, so it may have already repaired the corrupted LMA structures on the MDS.&lt;/p&gt;

&lt;p&gt;If the previous LFSCK run has repaired most of the issues with the files, it is not strictly necessary to continue with LFSCK prior to returning the filesystem to service.  You might consider to run a &quot;find -uid 0&quot; (or similar, maybe in parallel across each top-level directory from a separate client) on the filesystem to ensure that there are no files that cause the MDS or client to crash when accessed, but cleaning up orphan OST objects is probably a secondary concern at this point.  Lustre clients can handle corrupt/unknown file layouts and missing OST objects fairly well (returning an error when such files are accessed).&lt;/p&gt;</comment>
                            <comment id="236173" author="manishpatel" created="Thu, 1 Nov 2018 22:21:29 +0000"  >&lt;p&gt;Here is the stack trace from vmcore-dmesg what I see from one of the node.&lt;/p&gt;

&lt;p&gt;&#160;&#160;&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
[216597.870804] LustreError: 29043:0:(lfsck_layout.c:5144:lfsck_layout_slave_prep()) ASSERTION( !llsd-&amp;gt;llsd_rbtree_valid ) failed:
[216597.870893] LustreError: 29045:0:(lfsck_layout.c:5144:lfsck_layout_slave_prep()) ASSERTION( !llsd-&amp;gt;llsd_rbtree_valid ) failed:
[216597.870896] LustreError: 29045:0:(lfsck_layout.c:5144:lfsck_layout_slave_prep()) LBUG
[216597.870897] Pid: 29045, comm: lfsck 3.10.0-693.21.1.el7.20180508.x86_64.lustre2105 #1 SMP Mon Aug 27 23:04:41 UTC 2018
[216597.870898] Call Trace:
[216597.870916]&#160; [&amp;lt;ffffffff8103a1f2&amp;gt;] save_stack_trace_tsk+0x22/0x40
[216597.870926]&#160; [&amp;lt;ffffffffa086d7cc&amp;gt;] libcfs_call_trace+0x8c/0xc0 [libcfs]
[216597.870931]&#160; [&amp;lt;ffffffffa086d87c&amp;gt;] lbug_with_loc+0x4c/0xa0 [libcfs]
[216597.870954]&#160; [&amp;lt;ffffffffa10e06fd&amp;gt;] lfsck_layout_slave_prep+0x51d/0x590 [lfsck]
[216597.870961]&#160; [&amp;lt;ffffffffa10ba624&amp;gt;] lfsck_master_engine+0x184/0x1360 [lfsck]
[216597.870965]&#160; [&amp;lt;ffffffff810b1131&amp;gt;] kthread+0xd1/0xe0
[216597.870968]&#160; [&amp;lt;ffffffff816a14f7&amp;gt;] ret_from_fork+0x77/0xb0
[216597.870989]&#160; [&amp;lt;ffffffffffffffff&amp;gt;] 0xffffffffffffffff
[216597.870990] Kernel panic - not syncing: LBUG
[216597.870992] CPU: 11 PID: 29045 Comm: lfsck Tainted: G &#160; &#160; &#160; &#160; &#160; OE&#160; ------------ &#160; 3.10.0-693.21.1.el7.20180508.x86_64.lustre2105 #1
[216597.870994] Call Trace:
[216597.870999]&#160; [&amp;lt;ffffffff8168f4b8&amp;gt;] dump_stack+0x19/0x1b
[216597.871003]&#160; [&amp;lt;ffffffff81689aa2&amp;gt;] panic+0xe8/0x21f
[216597.871008]&#160; [&amp;lt;ffffffffa086d8cb&amp;gt;] lbug_with_loc+0x9b/0xa0 [libcfs]
[216597.871017]&#160; [&amp;lt;ffffffffa10e06fd&amp;gt;] lfsck_layout_slave_prep+0x51d/0x590 [lfsck]
[216597.871025]&#160; [&amp;lt;ffffffffa10ba624&amp;gt;] lfsck_master_engine+0x184/0x1360 [lfsck]
[216597.871032]&#160; [&amp;lt;ffffffffa10ba4a0&amp;gt;] ? lfsck_master_oit_engine+0x1190/0x1190 [lfsck]
[216597.871034]&#160; [&amp;lt;ffffffff810b1131&amp;gt;] kthread+0xd1/0xe0
[216597.871036]&#160; [&amp;lt;ffffffff810b1060&amp;gt;] ? insert_kthread_work+0x40/0x40
[216597.871038]&#160; [&amp;lt;ffffffff816a14f7&amp;gt;] ret_from_fork+0x77/0xb0
[216597.871039]&#160; [&amp;lt;ffffffff810b1060&amp;gt;] ? insert_kthread_work+0x40/0x40
[216597.871698] LustreError: 29049:0:(lfsck_layout.c:5144:lfsck_layout_slave_prep()) ASSERTION( !llsd-&amp;gt;llsd_rbtree_valid ) failed:
[216597.871700] LustreError: 29049:0:(lfsck_layout.c:5144:lfsck_layout_slave_prep()) LBUG
[216597.871701] Pid: 29049, comm: lfsck 3.10.0-693.21.1.el7.20180508.x86_64.lustre2105 #1 SMP Mon Aug 27 23:04:41 UTC 2018
[216597.871702] Call Trace:
[216597.871710]&#160; [&amp;lt;ffffffff8103a1f2&amp;gt;] save_stack_trace_tsk+0x22/0x40
[216597.871716]&#160; [&amp;lt;ffffffffa086d7cc&amp;gt;] libcfs_call_trace+0x8c/0xc0 [libcfs]
[216597.871721]&#160; [&amp;lt;ffffffffa086d87c&amp;gt;] lbug_with_loc+0x4c/0xa0 [libcfs]
[216597.871733]&#160; [&amp;lt;ffffffffa10e06fd&amp;gt;] lfsck_layout_slave_prep+0x51d/0x590 [lfsck]
[216597.871741]&#160; [&amp;lt;ffffffffa10ba624&amp;gt;] lfsck_master_engine+0x184/0x1360 [lfsck]
[216597.871743]&#160; [&amp;lt;ffffffff810b1131&amp;gt;] kthread+0xd1/0xe0
[216597.871745]&#160; [&amp;lt;ffffffff816a14f7&amp;gt;] ret_from_fork+0x77/0xb0

&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;I hope this helps, and I have already uploaded the complete stack trace to FTP site.&lt;/p&gt;

&lt;p&gt;Thank You,&lt;/p&gt;

&lt;p&gt;&#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; Manish&#160;&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;</comment>
                            <comment id="236244" author="manishpatel" created="Fri, 2 Nov 2018 16:54:34 +0000"  >&lt;p&gt;Hi Lai,&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;Just wanted to follow up, any updates on this issues, as we will be getting the patch from other ticket &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-11584&quot; title=&quot;kernel BUG at ldiskfs.h:1907!&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-11584&quot;&gt;&lt;del&gt;LU-11584&lt;/del&gt;&lt;/a&gt; for lfsck and if this issues can be addressed in that patch then it would be nice to avoid one more outage because of this bug.&lt;/p&gt;

&lt;p&gt;Thank You,&lt;/p&gt;

&lt;p&gt;&#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; Manish&lt;/p&gt;</comment>
                            <comment id="236316" author="laisiyao" created="Mon, 5 Nov 2018 11:22:20 +0000"  >&lt;p&gt;I&apos;m still reviewing code to understand this, and it doesn&apos;t look to be the same issue of &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-11584&quot; title=&quot;kernel BUG at ldiskfs.h:1907!&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-11584&quot;&gt;&lt;del&gt;LU-11584&lt;/del&gt;&lt;/a&gt;.&lt;/p&gt;</comment>
                            <comment id="236416" author="laisiyao" created="Tue, 6 Nov 2018 12:13:15 +0000"  >&lt;p&gt;Manish, do you remember how you started &apos;lfsck&apos; on servers? Did you only run &apos;lfsck&apos; on all OSS? but not MDS? And what&apos;s the exact command?&lt;/p&gt;</comment>
                            <comment id="236479" author="mhanafi" created="Tue, 6 Nov 2018 18:39:50 +0000"  >&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
lctl lfsc_start -o -r &lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;This likely related to &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-11625&quot; class=&quot;external-link&quot; rel=&quot;nofollow&quot;&gt;https://jira.whamcloud.com/browse/LU-11625&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="237436" author="laisiyao" created="Mon, 26 Nov 2018 09:12:54 +0000"  >&lt;p&gt;Mahmoud, does your system have multiple MDTs? Did you run &apos;lctl lfsck_start -o -r&apos; on MDS either?&lt;/p&gt;</comment>
                            <comment id="304500" author="jwallior" created="Mon, 14 Jun 2021 21:23:49 +0000"  >&lt;p&gt;We just hit this one on 2.10.8. &lt;br/&gt;
I was running `lctl lfsck_start -M MDT-0000 -o -r` on MDS0&lt;/p&gt;

&lt;p&gt;We have 2 MDS - 2 MDT and we are running 1 MDT on each MDS. &lt;/p&gt;</comment>
                    </comments>
                    <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|i005gn:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10021"><![CDATA[2]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>