<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 02:20:32 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-8787] zpool containing MDT0000 out of space</title>
                <link>https://jira.whamcloud.com/browse/LU-8787</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;On a DNE file system, MDT0000 ran out of space while one or more other MDTs were in recovery.&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;   2016-10-31 18:26:53 [20537.964631] Lustre: Skipped 1 previous similar message
   2016-10-31 18:26:58 [20542.793836] LustreError: 31561:0:(osd_handler.c:223:osd_trans_start()) lsh-MDT0000: failed to start transaction due to ENOSPC. Metadata overhead is underestimated or grant_ratio is too low.
   2016-10-31 18:26:58 [20542.815473] LustreError: 31561:0:(osd_handler.c:223:osd_trans_start()) Skipped 39 previous similar messages
   2016-10-31 18:26:58 [20542.827434] LustreError: 31561:0:(llog_cat.c:744:llog_cat_cancel_records()) lsh-OST0009-osc-MDT0000: fail to cancel 1 of 1 llog-records: rc = -28
   2016-10-31 18:26:58 [20542.843771] LustreError: 31561:0:(osp_sync.c:1031:osp_sync_process_committed()) lsh-OST0009-osc-MDT0000: can&apos;t cancel record: -28
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Obviously the first step is to increase the capacity of the pool.  However, after that is done, is further action required?  Should I run lfsck, or do anything else?&lt;/p&gt;</description>
                <environment>Lustre: Build Version: 2.8.0_5.chaos</environment>
        <key id="41229">LU-8787</key>
            <summary>zpool containing MDT0000 out of space</summary>
                <type id="9" iconUrl="https://jira.whamcloud.com/images/icons/issuetypes/undefined.png">Question/Request</type>
                                            <priority id="4" iconUrl="https://jira.whamcloud.com/images/icons/priorities/minor.svg">Minor</priority>
                        <status id="6" iconUrl="https://jira.whamcloud.com/images/icons/statuses/closed.png" description="The issue is considered finished, the resolution is correct. Issues which are closed can be reopened.">Closed</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="10000">Done</resolution>
                                        <assignee username="yong.fan">nasf</assignee>
                                    <reporter username="ofaaland">Olaf Faaland</reporter>
                        <labels>
                            <label>llnl</label>
                    </labels>
                <created>Tue, 1 Nov 2016 23:43:22 +0000</created>
                <updated>Thu, 2 Nov 2017 22:02:52 +0000</updated>
                            <resolved>Thu, 2 Nov 2017 22:02:52 +0000</resolved>
                                                                        <due></due>
                            <votes>0</votes>
                                    <watches>4</watches>
                                                                            <comments>
                            <comment id="171999" author="pjones" created="Wed, 2 Nov 2016 12:45:32 +0000"  >&lt;p&gt;Fan Yong&lt;/p&gt;

&lt;p&gt;Could you please look into this issue?&lt;/p&gt;

&lt;p&gt;Thanks&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="172037" author="ofaaland" created="Wed, 2 Nov 2016 15:51:32 +0000"  >&lt;p&gt;I find that update_log_dir is taking 1.1T out of the 1.4T available.  Shall I create a separate ticket for that?  It seems far too large to me, but maybe I&apos;m wrong.&lt;/p&gt;</comment>
                            <comment id="172046" author="ofaaland" created="Wed, 2 Nov 2016 16:29:32 +0000"  >&lt;p&gt;There are 158 files in update_log_dir.&lt;br/&gt;
68  size&amp;gt;10GB&lt;br/&gt;
29  10GB &amp;gt; size &amp;gt;= 1GB&lt;br/&gt;
7    1GB &amp;gt; size &amp;gt;= 1M&lt;br/&gt;
44  size &amp;lt; 1M&lt;/p&gt;</comment>
                            <comment id="172086" author="ofaaland" created="Wed, 2 Nov 2016 21:21:48 +0000"  >&lt;p&gt;Created a separate ticket &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-8794&quot; title=&quot;update_log_dir consuming 1.1TB on MDT0000&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-8794&quot;&gt;&lt;del&gt;LU-8794&lt;/del&gt;&lt;/a&gt; for the large amount of space occupied by update_log_dir.  &lt;/p&gt;

&lt;p&gt;This ticket is only for the procedure to be followed when an MDT fills up, since it could happen in production and we need to know the procedure for recovering.&lt;/p&gt;

&lt;p&gt;thanks,&lt;br/&gt;
Olaf&lt;/p&gt;</comment>
                            <comment id="172102" author="yong.fan" created="Thu, 3 Nov 2016 03:16:27 +0000"  >&lt;p&gt;According to current DNE implementation, the cross-MDTs operations (in detail) will be recorded as llog under the update_log_dir for recovery purpose. For most of use cases, the llog is append only, if there are too much cross-MDTs operations, the llog will become huge. So if you can describe your operations before the out of space, that may help us to judge the issue. Anyway, if there are some Lustre kernel debug logs on the MDT0000, that will be better.&lt;/p&gt;</comment>
                            <comment id="172161" author="di.wang" created="Thu, 3 Nov 2016 14:11:39 +0000"  >&lt;p&gt;you can delete update_log* manually as what we did on &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-8753&quot; title=&quot;Recovery already passed deadline with DNE&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-8753&quot;&gt;&lt;del&gt;LU-8753&lt;/del&gt;&lt;/a&gt;, which I need further log.&lt;br/&gt;
And there is also another ticket &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-8714&quot; title=&quot;too many update logs during soak-test.&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-8714&quot;&gt;LU-8714&lt;/a&gt; about deleting update_log efficiently. I will try to work out the patch.&lt;/p&gt;</comment>
                            <comment id="172257" author="ofaaland" created="Fri, 4 Nov 2016 01:47:51 +0000"  >&lt;p&gt;Di,&lt;br/&gt;
Do you expect that the file system is still in a consistent state, and after deleting update_log* I don&apos;t need to do anything else?  That&apos;s my main question for this ticket.&lt;br/&gt;
thanks,&lt;br/&gt;
Olaf&lt;/p&gt;</comment>
                            <comment id="172259" author="ofaaland" created="Fri, 4 Nov 2016 01:49:04 +0000"  >&lt;p&gt;nasf,&lt;br/&gt;
I unfortunately don&apos;t know the workload at the time the MDT filled up, nor can I get debug logs.  The problem was discovered after the servers had been rebooted.&lt;/p&gt;</comment>
                            <comment id="172260" author="yong.fan" created="Fri, 4 Nov 2016 02:06:03 +0000"  >&lt;blockquote&gt;
&lt;p&gt;Do you expect that the file system is still in a consistent state, and after deleting update_log* I don&apos;t need to do anything else? That&apos;s my main question for this ticket.&lt;br/&gt;
...&lt;br/&gt;
I unfortunately don&apos;t know the workload at the time the MDT filled up, nor can I get debug logs. The problem was discovered after the servers had been rebooted.&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;Generally, the llog became bigger after the reboot means there are something to be recovered. But as long as your recovery complete successfully after the reboot, even if you removed the llogs, your namespace should be in consistent status unless there were some inconsistency before your reboot (for ZFS backend, it should very rare case).&lt;/p&gt;</comment>
                            <comment id="172272" author="di.wang" created="Fri, 4 Nov 2016 04:47:25 +0000"  >&lt;p&gt;Olaf: &lt;br/&gt;
it might be affected, but the chance is normally low as nasf said. Anyway I would suggest you to run lfsck to check and fix the consistency after you delete the update_logs.&lt;/p&gt;</comment>
                            <comment id="175082" author="yong.fan" created="Sat, 26 Nov 2016 09:07:16 +0000"  >&lt;p&gt;Olaf,&lt;br/&gt;
Have you tried to remove the huge llogs as Wangdi suggested? Any feedback for that?&lt;br/&gt;
Thanks&lt;/p&gt;</comment>
                            <comment id="212713" author="ofaaland" created="Thu, 2 Nov 2017 22:02:52 +0000"  >&lt;p&gt;Basic advice that we should delete the update logs and then run lfsck is a sufficient answer.&lt;/p&gt;

&lt;p&gt;This occurred during DNE2 testing with Lustre 2.8, which we have decided not to work at any further.&#160; Instead we will test DNE2 when we start testing Lustre 2.10.x.&#160; So we will test the advice only if we encounter the problem again, and in that case will file a new ticket.&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                        <issuelink>
            <issuekey id="41010">LU-8753</issuekey>
        </issuelink>
                            </outwardlinks>
                                                        </issuelinktype>
                    </issuelinks>
                <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzyu3r:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                                                                                </customfields>
    </item>
</channel>
</rss>