<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 02:24:04 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-9196] MDS server for Atlas file system crashed due to memory exhaustion.</title>
                <link>https://jira.whamcloud.com/browse/LU-9196</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;Our MDS server crashed due to memory exhaustion. Examination of the system logs show nothing out of the ordinary expect it was noticed that an IO scrub did start off on the MDS server:&lt;/p&gt;

&lt;p&gt;Lustre: atlas1-MDT0000-o: trigger OI scrub by RPC for the &lt;span class=&quot;error&quot;&gt;&amp;#91;0x20003f37b:0x3e5:0x0&amp;#93;&lt;/span&gt; with flags 0x4a, rc = 0&lt;/p&gt;

&lt;p&gt;Some time after that we encountered the following crash which is attached.&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;</description>
                <environment>RHEL6.8 running  non patched Lustre 2.8 server using ldiskfs. </environment>
        <key id="44596">LU-9196</key>
            <summary>MDS server for Atlas file system crashed due to memory exhaustion.</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="4" iconUrl="https://jira.whamcloud.com/images/icons/priorities/minor.svg">Minor</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="yong.fan">nasf</assignee>
                                    <reporter username="simmonsja">James A Simmons</reporter>
                        <labels>
                    </labels>
                <created>Wed, 8 Mar 2017 19:01:08 +0000</created>
                <updated>Tue, 5 Jun 2018 16:36:51 +0000</updated>
                            <resolved>Tue, 5 Jun 2018 16:36:51 +0000</resolved>
                                    <version>Lustre 2.8.0</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>3</watches>
                                                                            <comments>
                            <comment id="187503" author="simmonsja" created="Wed, 8 Mar 2017 19:01:58 +0000"  >&lt;p&gt;The IO scrubber was triggered at&#160;02:06, 05:12, 08:26, 11:36, 14:47 during the day.&lt;/p&gt;</comment>
                            <comment id="187520" author="yujian" created="Wed, 8 Mar 2017 20:16:50 +0000"  >&lt;p&gt;Hi Nasf,&lt;br/&gt;
Could you please advise? Thank you.&lt;/p&gt;</comment>
                            <comment id="187627" author="yong.fan" created="Thu, 9 Mar 2017 09:58:15 +0000"  >&lt;p&gt;According to the logs, there are several issues:&lt;/p&gt;

&lt;p&gt;1) OI scrub was triggered because of OI inconsistency, including the following three FIDs:&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;0x1000:0x15c5020:0x0&amp;#93;&lt;/span&gt;&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;0x20003f37b:0x3e5:0x0&amp;#93;&lt;/span&gt;&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;0x20003f47a:0xd8e0:0x0&amp;#93;&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;Would you please to find out the file or object/inode corresponding to these FIDs via &quot;lfs fid2path&quot;. The FID &lt;span class=&quot;error&quot;&gt;&amp;#91;0x1000:0x15c5020:0x0&amp;#93;&lt;/span&gt; is IGIF, some abnormal. Please dump such inode (#4096) via debugfs.&lt;/p&gt;

&lt;p&gt;2) LBUG() during osd_object_release().&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;&amp;lt;0&amp;gt;[2339122.255892] LustreError: 16352:0:(osd_handler.c:1610:osd_object_release()) LBUG
&amp;lt;4&amp;gt;[2339122.264739] Pid: 16352, comm: mdt01_381
&amp;lt;4&amp;gt;[2339122.269446] 
&amp;lt;4&amp;gt;[2339122.269446] Call Trace:
&amp;lt;4&amp;gt;[2339122.274687]  [&amp;lt;ffffffffa05b4875&amp;gt;] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
&amp;lt;4&amp;gt;[2339122.282893]  [&amp;lt;ffffffffa05b4e77&amp;gt;] lbug_with_loc+0x47/0xb0 [libcfs]
&amp;lt;4&amp;gt;[2339122.290225]  [&amp;lt;ffffffffa0ea93e8&amp;gt;] osd_object_release+0x88/0x90 [osd_ldiskfs]
&amp;lt;4&amp;gt;[2339122.298765]  [&amp;lt;ffffffffa074d6fd&amp;gt;] lu_object_put+0x16d/0x3b0 [obdclass]
&amp;lt;4&amp;gt;[2339122.306500]  [&amp;lt;ffffffffa102abc7&amp;gt;] mdt_getattr_name_lock+0x5f7/0x1900 [mdt]
&amp;lt;4&amp;gt;[2339122.314609]  [&amp;lt;ffffffffa102c3f2&amp;gt;] mdt_intent_getattr+0x292/0x470 [mdt]
&amp;lt;4&amp;gt;[2339122.322330]  [&amp;lt;ffffffffa101d93e&amp;gt;] mdt_intent_policy+0x4be/0xc70 [mdt]
&amp;lt;4&amp;gt;[2339122.329981]  [&amp;lt;ffffffffa091c0c7&amp;gt;] ldlm_lock_enqueue+0x127/0x990 [ptlrpc]
&amp;lt;4&amp;gt;[2339122.337912]  [&amp;lt;ffffffffa0946307&amp;gt;] ldlm_handle_enqueue0+0x807/0x14d0 [ptlrpc]
&amp;lt;4&amp;gt;[2339122.346565]  [&amp;lt;ffffffffa09b9a71&amp;gt;] ? tgt_lookup_reply+0x31/0x190 [ptlrpc]
&amp;lt;4&amp;gt;[2339122.354501]  [&amp;lt;ffffffffa09cbbe1&amp;gt;] tgt_enqueue+0x61/0x230 [ptlrpc]
&amp;lt;4&amp;gt;[2339122.361753]  [&amp;lt;ffffffffa09cc69c&amp;gt;] tgt_request_handle+0x8ec/0x1440 [ptlrpc]
&amp;lt;4&amp;gt;[2339122.369877]  [&amp;lt;ffffffffa09796f1&amp;gt;] ptlrpc_main+0xd21/0x1800 [ptlrpc]
&amp;lt;4&amp;gt;[2339122.377324]  [&amp;lt;ffffffffa09789d0&amp;gt;] ? ptlrpc_main+0x0/0x1800 [ptlrpc]
&amp;lt;4&amp;gt;[2339122.384747]  [&amp;lt;ffffffff810a640e&amp;gt;] kthread+0x9e/0xc0
&amp;lt;4&amp;gt;[2339122.390625]  [&amp;lt;ffffffff8100c28a&amp;gt;] child_rip+0xa/0x20
&amp;lt;4&amp;gt;[2339122.396587]  [&amp;lt;ffffffff810a6370&amp;gt;] ? kthread+0x0/0xc0
&amp;lt;4&amp;gt;[2339122.402556]  [&amp;lt;ffffffff8100c280&amp;gt;] ? child_rip+0x0/0x20
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;It seems that the inode nlink attribute is invalid (be marked as zero but nobody destroy it). We hit similar trouble in &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-8992&quot; title=&quot;osd_object_release() LBUG&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-8992&quot;&gt;&lt;del&gt;LU-8992&lt;/del&gt;&lt;/a&gt;. But only with the given logs, I cannot say they are the same.&lt;/p&gt;

&lt;p&gt;3) A lot of mdt_getattr_internal() failure as following:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;&amp;lt;3&amp;gt;[2230069.055437] LustreError: 16356:0:(mdt_handler.c:893:mdt_getattr_internal()) atlas1-MDT0000: getattr error for [0x2003863b4:0xc7fd:0x0]: rc = -2
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;It may be normal because of raced unlink operation from other. Let&apos;s ignore this failure temporarily.&lt;/p&gt;

&lt;p&gt;4) Some directories are full:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;&amp;lt;4&amp;gt;[401753.420890] LDISKFS-fs warning (device dm-5): ldiskfs_dx_add_entry: Directory (ino: 438307289) index full, reach max htree level :2
&amp;lt;4&amp;gt;[401753.434706] LDISKFS-fs warning (device dm-5): ldiskfs_dx_add_entry: Large directory feature is not enabled on this filesystem
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;It is because ldiskfs only supports two levels htree-based directory by default. If too many entries are inserted into single directory, it will be exhausted although there are still space on disk.&lt;/p&gt;

&lt;p&gt;5) Out of memory.&lt;br/&gt;
Currently, I am not sure who exhausted the RAM, the first suspected point is NOT OI scrub, instead, it may be related with the possible improperly cleanup the page-cache after directory full.&lt;/p&gt;</comment>
                            <comment id="192648" author="yong.fan" created="Wed, 19 Apr 2017 08:26:24 +0000"  >&lt;p&gt;Any further feedback or logs or reproduction?&lt;/p&gt;</comment>
                            <comment id="193028" author="simmonsja" created="Fri, 21 Apr 2017 15:28:29 +0000"  >&lt;p&gt;We haven&apos;t seen this problem since. As for the OI problems you saw we are running lfsck to clean those up. Once lfsck is done I will report if everything is fixed.&lt;/p&gt;</comment>
                            <comment id="204614" author="yong.fan" created="Mon, 7 Aug 2017 05:54:17 +0000"  >&lt;p&gt;Any update? Thanks!&lt;/p&gt;</comment>
                            <comment id="215246" author="yujian" created="Mon, 4 Dec 2017 17:42:42 +0000"  >&lt;p&gt;Hi James,&lt;/p&gt;

&lt;blockquote&gt;&lt;p&gt;We haven&apos;t seen this problem since. As for the OI problems you saw we are running lfsck to clean those up. Once lfsck is done I will report if everything is fixed.&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;Is everything fixed after running lfsck?&lt;/p&gt;</comment>
                            <comment id="229105" author="yong.fan" created="Tue, 5 Jun 2018 16:36:51 +0000"  >&lt;p&gt;The main issues should have been fixed via LFSCK. Please reopen it if have more questions.&lt;/p&gt;</comment>
                    </comments>
                    <attachments>
                            <attachment id="25776" name="vmcore-dmesg.txt" size="524288" author="simmonsja" created="Wed, 8 Mar 2017 19:00:55 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzz6bz:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>