<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 02:22:52 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-9055] MDS crash due to changelog being full</title>
                <link>https://jira.whamcloud.com/browse/LU-9055</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;Hello, &lt;/p&gt;

&lt;p&gt;We enabled changelogs on our mds server to prepare for using robinhood. We made the change and then began setting up robinhood. We got robinhood working well for reporting it seemed and began to look at setting up a HSM scenario where for one test user, we&apos;d move files matching an age to a alternate filesystem. There were no errors in the lustre logs relating to robinhood up until we enabled the robinhood policy. Once enabling the robinhood policy, this error started being written to the lustre log at the same exact moment: &lt;/p&gt;

&lt;p&gt;Jan 18 14:02:57 mds1 kernel: Lustre: 14021:0:(llog_cat.c:817:llog_cat_process_or_fork()) catlog 0x6:10 crosses index zero &lt;/p&gt;

&lt;p&gt;Every 10 minutes (i presume from robinhood) for 4 days until the mds system crashed. Also, after powering up the mds: &lt;/p&gt;

&lt;p&gt;Jan 22 20:37:59 mds1 kernel: LDISKFS-fs warning (device dm-2): ldiskfs_multi_mount_protect: MMP interval 42 higher than expected, please wait. &lt;br/&gt;
Jan 22 20:37:59 mds1 kernel: &lt;br/&gt;
Jan 22 20:38:41 mds1 kernel: LDISKFS-fs (dm-2): recovery complete &lt;br/&gt;
Jan 22 20:38:41 mds1 kernel: LDISKFS-fs (dm-2): mounted filesystem with ordered data mode. quota=on. Opts: &lt;br/&gt;
Jan 22 20:38:42 mds1 kernel: Lustre: MGS: Connection restored to MGC172.16.3.231@o2ib_0 (at 0@lo)&lt;br/&gt;
Jan 22 20:38:42 mds1 kernel: Lustre: 2956:0:(llog_cat.c:924:llog_cat_reverse_process()) catalog 0x6:10 crosses index zero Jan 22 20:38:42 mds1 kernel: Lustre: blizzard-MDD0000: changelog on Jan 22 20:38:42 mds1 kernel: Lustre: blizzard-MDT0000: Will be in recovery for at least 5:00, or until 26 clients reconnect &lt;br/&gt;
Jan 22 20:38:42 mds1 kernel: Lustre: blizzard-MDT0000: Denying connection for new client f95bf979-9378-6cb7-798c-b5dfbec3221a(at 172.16.3.106@o2ib), waiting for 26 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 5:00 &lt;br/&gt;
Jan 22 20:44:42 mds1 kernel: Lustre: blizzard-MDT0000: recovery is timed out, evict stale exports Jan 22 20:44:42 mds1 kernel: Lustre: blizzard-MDT0000: disconnecting 1 stale clients &lt;br/&gt;
Jan 22 20:44:42 mds1 kernel: Lustre: 3256:0:(llog_cat.c:93:llog_cat_new_log()) blizzard-MDD0000: there are no more free slots in catalog &lt;/p&gt;

&lt;p&gt;And then a bit later the mds would crash. This would happen 3 other times we attempted to start the mds server. We ran the following command to turn off the changelogs and attempt to clear them: &lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;root@mds1 ~&amp;#93;&lt;/span&gt;# lctl --device blizzard-MDT0000 changelog_deregister cl1&quot; &lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;root@mds1 ~&amp;#93;&lt;/span&gt;# lctl get_param mdd.blizzard-MDT0000.changelog_users mdd.blizzard-MDT0000.changelog_users=current index: 12720211297 ID index &lt;/p&gt;

&lt;p&gt;The shell we fired the process off in ended up ending, but lctl continued running. The lctl command has been running now for 70+ hrs ... I have no idea if its normal for it to run this long. I also saw this in /tmp/lustre-log1485199531.6179 which was about 2 hrs after the above command was run, and it hasn&apos;t been written to since then: &lt;/p&gt;

&lt;p&gt;llog_cat.cllog_cat_process_cbprocessing log 0x21f25:1:0 at index 3513 of catalog 0x6:1 llog_cat.cllog_cat_cleanupcancel plain log at index 3513 of catalog 0x6:10 llog_cat.cllog_cat_process_cbprocessing log 0x21f26:1:0 at index 3514 of catalog 0x6:10 &lt;/p&gt;

&lt;p&gt;Let me know if you need more information!&lt;/p&gt;</description>
                <environment>RHEL 6.7, 2.6.32-573.12.1.el6_lustre.x86_64, o2ib, redhat ofed, 165TB spread over 7OSS/OST</environment>
        <key id="43359">LU-9055</key>
            <summary>MDS crash due to changelog being full</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="2" iconUrl="https://jira.whamcloud.com/images/icons/priorities/critical.svg">Critical</priority>
                        <status id="1" iconUrl="https://jira.whamcloud.com/images/icons/statuses/open.png" description="The issue is open and ready for the assignee to start work on it.">Open</status>
                    <statusCategory id="2" key="new" colorName="default"/>
                                    <resolution id="-1">Unresolved</resolution>
                                        <assignee username="bfaccini">Bruno Faccini</assignee>
                                    <reporter username="cbc">christopher coffey</reporter>
                        <labels>
                    </labels>
                <created>Thu, 26 Jan 2017 14:55:26 +0000</created>
                <updated>Wed, 1 Mar 2017 12:00:32 +0000</updated>
                                            <version>Lustre 2.8.0</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>4</watches>
                                                                            <comments>
                            <comment id="182267" author="bfaccini" created="Thu, 26 Jan 2017 15:48:50 +0000"  >&lt;p&gt;Hello,&lt;br/&gt;
According to your description, t is unclear for me if you have been running with ChangeLogs enabled and no reader (only RobinHood?) running for a long period of time and/or where a huge Lustre activity may have occurred?&lt;br/&gt;
My best guess is that there is an orphan LLOG entry in ChangeLog Catalog that has caused a wrap and then a full situation...&lt;/p&gt;

&lt;p&gt;Can you debugfs the MDT device and dump  the changelog_catalog and changelog_users files ?&lt;/p&gt;

&lt;p&gt;Also the last msgs you have provided seem to indicate that the ChangeLogs cleanup is on-going, but if you don&apos;t mind about the current ChangeLogs content, and want to start from a clean state, you can umount/stop the MDT, then remount it as ldiskfs, the move/backup+remove the changelog_catalog and changelog_users files in a safe place for further analysis, then restart/remount the MDT.&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                        <issuelink>
            <issuekey id="15097">LU-1586</issuekey>
        </issuelink>
                            </outwardlinks>
                                                                <inwardlinks description="is related to">
                                        <issuelink>
            <issuekey id="32827">LU-7340</issuekey>
        </issuelink>
                            </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzz1u7:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10023"><![CDATA[4]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>