<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 02:04:44 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-6954] LustreError: 12934:0:(mdd_device.c:305:mdd_changelog_llog_init()) fsrzb-MDD0000: changelog init failed: rc = -5 </title>
                <link>https://jira.whamcloud.com/browse/LU-6954</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;lustre-2.5.4-4chaos_2.6.32_504.16.2.1chaos.ch5.3.x86_64.x86_64&lt;/p&gt;

&lt;p&gt;The mds service on both porter and stout fails to start.  Able to import zfs pool on both systems with no problem.  The mgs device mounts with no problem but the mdt on both systems fails to mount.  Doing a &quot;writeconf&quot; on the stout mds did not help.  The following console messages were reported on stout-mds1 console: &lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;2015-08-02 16:38:26 Lustre: Lustre: Build Version: 2.5.4-4chaos-4chaos--PRISTINE-2.6.32-504.16.2.1chaos.ch5.3.x86_64
2015-08-02 16:38:27 Lustre: MGC172.21.1.99@o2ib200: Connection restored to MGS (at 0@lo)
2015-08-02 16:38:28 Lustre: MGS: Logs for fs fsrzb were removed by user request.  All servers must be restarted in order to regenerate the logs.
2015-08-02 16:38:30 LustreError: 11-0: fsrzb-MDT0000-lwp-MDT0000: Communicating with 0@lo, operation mds_connect failed with -11.
2015-08-02 16:38:31 Lustre: 12934:0:(llog_cat.c:718:llog_cat_reverse_process()) catalog 0x2:10 crosses index zero
2015-08-02 16:38:31 LustreError: 12934:0:(mdd_device.c:305:mdd_changelog_llog_init()) fsrzb-MDD0000: changelog init failed: rc = -5
2015-08-02 16:38:31 LustreError: 12934:0:(mdd_device.c:380:mdd_changelog_init()) fsrzb-MDD0000: changelog setup during init failed: rc = -5
2015-08-02 16:38:31 LustreError: 12934:0:(mdd_device.c:963:mdd_prepare()) fsrzb-MDD0000: failed to initialize changelog: rc = -5
2015-08-02 16:38:31 Lustre: fsrzb-MDT0000: Unable to start target: -5
2015-08-02 16:38:31 Lustre: Failing over fsrzb-MDT0000
2015-08-02 16:38:32 Lustre: server umount fsrzb-MDT0000 complete
2015-08-02 16:38:32 LustreError: 12934:0:(obd_mount.c:1331:lustre_fill_super()) Unable to mount  (-5)
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;A workaround was found to allow the MDT to mount:&lt;br/&gt;
Mounting the MDT via ZPL&lt;br/&gt;
Delete the changelog_catalog and changelog_users files&lt;br/&gt;
Unmount&lt;br/&gt;
Mount the MDT via lustre in the normal manner&lt;/p&gt;</description>
                <environment>lustre-2.5.4-4chaos_2.6.32_504.16.2.1chaos.ch5.3.x86_64.x86_64</environment>
        <key id="31350">LU-6954</key>
            <summary>LustreError: 12934:0:(mdd_device.c:305:mdd_changelog_llog_init()) fsrzb-MDD0000: changelog init failed: rc = -5 </summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="2" iconUrl="https://jira.whamcloud.com/images/icons/priorities/critical.svg">Critical</priority>
                        <status id="6" iconUrl="https://jira.whamcloud.com/images/icons/statuses/closed.png" description="The issue is considered finished, the resolution is correct. Issues which are closed can be reopened.">Closed</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="bfaccini">Bruno Faccini</assignee>
                                    <reporter username="ofaaland">Olaf Faaland</reporter>
                        <labels>
                            <label>llnl</label>
                    </labels>
                <created>Tue, 4 Aug 2015 17:09:54 +0000</created>
                <updated>Fri, 19 Feb 2016 19:10:57 +0000</updated>
                            <resolved>Fri, 19 Feb 2016 19:10:57 +0000</resolved>
                                    <version>Lustre 2.5.4</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>6</watches>
                                                                            <comments>
                            <comment id="123221" author="bfaccini" created="Tue, 4 Aug 2015 17:15:37 +0000"  >&lt;p&gt;Hello Olaf,&lt;br/&gt;
Did you keep a copy of changelog_catalog and changelog_users files that you can provide ?&lt;/p&gt;</comment>
                            <comment id="123227" author="ofaaland" created="Tue, 4 Aug 2015 17:27:02 +0000"  >&lt;p&gt;Bruno,&lt;br/&gt;
Sorry, yes, attached now.&lt;/p&gt;</comment>
                            <comment id="123269" author="bfaccini" created="Tue, 4 Aug 2015 22:30:22 +0000"  >&lt;p&gt;Thanks Olaf, and here is what I can tell after analyzing the changelog_catalog file you have provided.&lt;br/&gt;
I am not able to confirm that Lustre v2.5.4, you run with, contains or not patch from &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4528&quot; title=&quot;osd_trans_exec_op()) ASSERTION( oti-&amp;gt;oti_declare_ops_rb[rb] &amp;gt; 0 ) failed: rb = 0&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4528&quot;&gt;&lt;del&gt;LU-4528&lt;/del&gt;&lt;/a&gt; (&lt;a href=&quot;http://review.whamcloud.com/#/c/10108/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#/c/10108/&lt;/a&gt;, Commit 7c243a561ffe8503a6abf5c4cafef0c3566192bc). Can you check this for me?&lt;br/&gt;
But if this is the case and since your changelog_catalog had just reached its end and was about to loop-back, I think you likely encountered the same kind of regression described in &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-6556&quot; title=&quot;changelog catalog corruption if all possible records is define &quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-6556&quot;&gt;&lt;del&gt;LU-6556&lt;/del&gt;&lt;/a&gt;.&lt;/p&gt;</comment>
                            <comment id="123270" author="morrone" created="Tue, 4 Aug 2015 22:39:00 +0000"  >&lt;p&gt;I find it exceptionally unlikely that two different filesystems had both independently &quot;just reached&quot; the end of the changelog_catalog and were just about to loop back at exactly the same time.  More explanation is needed.&lt;/p&gt;</comment>
                            <comment id="123274" author="bfaccini" created="Tue, 4 Aug 2015 23:23:57 +0000"  >&lt;p&gt;Hello Chris, thanks to warn me about the fact that 2 FSs are affected, I should have better read the description text for this ticket, sorry about that.&lt;/p&gt;

&lt;p&gt;But, if the same symptoms/msgs have occurred for both Filesystems failures, I can already confirm that the &quot;crosses index zero&quot; msg is an indication of a Catalog loop-back, and I will also need the 2nd changelog_catalog file for the 2nd filesystem to analyze it.&lt;br/&gt;
We may have end up in a situation where both Catalog have loop-back, when only one/1st was just doing so ... And this since last Filesystems restarts.&lt;/p&gt;

&lt;p&gt;And BTW, I have double-checked the 1st Catalog you have provided and I can also confirm you that it shows the same corruption (Catalog records written past normal end, leading to a Catalog size &amp;gt; header+bitmap+records) than what has been found for &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-6556&quot; title=&quot;changelog catalog corruption if all possible records is define &quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-6556&quot;&gt;&lt;del&gt;LU-6556&lt;/del&gt;&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Concerning the fact I used the &quot;just reached&quot; comment, this may come from the fact that, for an unexplained reason at the moment, bits at the beginning of the bitmap have been cleared (or may be never set).&lt;/p&gt;</comment>
                            <comment id="123277" author="ofaaland" created="Wed, 5 Aug 2015 00:40:33 +0000"  >&lt;p&gt;Bruno,&lt;/p&gt;

&lt;p&gt;I confirmed that both filesystems produced the same sequence of error messages when attempting to start the MDT, including the &quot;crosses index zero&quot; and &quot;changelog init failed&quot; messages, same rc&apos;s.&lt;/p&gt;

&lt;p&gt;We do have the &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4528&quot; title=&quot;osd_trans_exec_op()) ASSERTION( oti-&amp;gt;oti_declare_ops_rb[rb] &amp;gt; 0 ) failed: rb = 0&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4528&quot;&gt;&lt;del&gt;LU-4528&lt;/del&gt;&lt;/a&gt; patch in our build.&lt;/p&gt;

&lt;p&gt;I&apos;ll attach the second changelog_catalog.  The one you&apos;ve already seen is from porter.&lt;/p&gt;</comment>
                            <comment id="124118" author="ofaaland" created="Fri, 14 Aug 2015 01:07:37 +0000"  >&lt;p&gt;Bruno,&lt;/p&gt;

&lt;p&gt;Do you have any updates on this?  I see that the stout catalog file contains 67272 records, and it looks like the bitmap has only 64767 bits for tracking the status of the non- llog_log_hdr records.   So it does seem to me that the changelog_catalog file is corrupt.&lt;/p&gt;

&lt;p&gt;The records that appear after that have indices in the range 12197 - 14701, which seems odd.   The code in llog_osd_prev_block() appears to me to assume that the records within a block have monotonically increasing indices, since only lrt_index is generally checked before deciding whether to read another block from disk or not.  Am I correct that requirement for increasing indices?&lt;/p&gt;

&lt;p&gt;thanks,&lt;br/&gt;
Olaf&lt;/p&gt;</comment>
                            <comment id="124209" author="bfaccini" created="Sat, 15 Aug 2015 01:45:34 +0000"  >&lt;p&gt;Olaf,&lt;br/&gt;
Thanks to have provided the second ChangeLog catalog file for stout.&lt;br/&gt;
You are correct in the analysis of his already looped-back and corrupted content, which is again the same than for porter, and the one that has already been investigated in &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-6556&quot; title=&quot;changelog catalog corruption if all possible records is define &quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-6556&quot;&gt;&lt;del&gt;LU-6556&lt;/del&gt;&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;So the scenario which is likely to have happen for both filesystems is that prior to start running with patch for &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4528&quot; title=&quot;osd_trans_exec_op()) ASSERTION( oti-&amp;gt;oti_declare_ops_rb[rb] &amp;gt; 0 ) failed: rb = 0&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4528&quot;&gt;&lt;del&gt;LU-4528&lt;/del&gt;&lt;/a&gt; in, both ChangeLog catalogs had already looped-back, but the patch has caused new records to be put at the end of each files instead of updating their corresponding records in-place. The corruption was detected at next FSs restart.&lt;/p&gt;

&lt;p&gt;Am I right when I suspect that problem has occured during the 2nd restart for each FS since upgrading with &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4528&quot; title=&quot;osd_trans_exec_op()) ASSERTION( oti-&amp;gt;oti_declare_ops_rb[rb] &amp;gt; 0 ) failed: rb = 0&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4528&quot;&gt;&lt;del&gt;LU-4528&lt;/del&gt;&lt;/a&gt; in ?&lt;/p&gt;

</comment>
                            <comment id="124611" author="ofaaland" created="Wed, 19 Aug 2015 17:14:01 +0000"  >&lt;p&gt;Bruno,&lt;/p&gt;

&lt;p&gt;Yes, it is likely we encountered this error on the 2nd restart after upgrading with &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4528&quot; title=&quot;osd_trans_exec_op()) ASSERTION( oti-&amp;gt;oti_declare_ops_rb[rb] &amp;gt; 0 ) failed: rb = 0&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4528&quot;&gt;&lt;del&gt;LU-4528&lt;/del&gt;&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;I am looking at the code in llog_osd_write_rec() and your patch for &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-6556&quot; title=&quot;changelog catalog corruption if all possible records is define &quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-6556&quot;&gt;&lt;del&gt;LU-6556&lt;/del&gt;&lt;/a&gt;.  I see that in the old code, lgi_off is set to la_size unconditionally.&lt;/p&gt;

&lt;p&gt;thanks,&lt;br/&gt;
Olaf &lt;/p&gt;</comment>
                            <comment id="124653" author="ofaaland" created="Wed, 19 Aug 2015 20:49:31 +0000"  >&lt;p&gt;Bruno,&lt;/p&gt;

&lt;p&gt;I can see how we could end up with the changelog_catalog file corruption if, before we upgraded to &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4528&quot; title=&quot;osd_trans_exec_op()) ASSERTION( oti-&amp;gt;oti_declare_ops_rb[rb] &amp;gt; 0 ) failed: rb = 0&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4528&quot;&gt;&lt;del&gt;LU-4528&lt;/del&gt;&lt;/a&gt; code, our changelog_catalog was already wrapped around, so that lgh_last_idx == 12196 and changelog_catalog size == 4,153,280.  I think this is what you are saying happened.&lt;/p&gt;

&lt;p&gt;However, in the &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4528&quot; title=&quot;osd_trans_exec_op()) ASSERTION( oti-&amp;gt;oti_declare_ops_rb[rb] &amp;gt; 0 ) failed: rb = 0&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4528&quot;&gt;&lt;del&gt;LU-4528&lt;/del&gt;&lt;/a&gt; patch, and in the previous code it applied to, I don&apos;t see something implementing changelog_catalog wrap-around - setting lgh_last_idx in some way other than incrementing or setting to 0 when creating changelog_catalog for the first time.  Do you?&lt;/p&gt;

&lt;p&gt;thanks,&lt;br/&gt;
Olaf&lt;/p&gt;
</comment>
                            <comment id="124658" author="ofaaland" created="Wed, 19 Aug 2015 22:05:51 +0000"  >&lt;p&gt;I guess what I&apos;m really asking is, did changelog_catalogs wrap around prior to the &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4528&quot; title=&quot;osd_trans_exec_op()) ASSERTION( oti-&amp;gt;oti_declare_ops_rb[rb] &amp;gt; 0 ) failed: rb = 0&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4528&quot;&gt;&lt;del&gt;LU-4528&lt;/del&gt;&lt;/a&gt; patch?  Some comment made me think so, maybe I misunderstood.&lt;/p&gt;

&lt;p&gt;thanks,&lt;br/&gt;
Olaf&lt;/p&gt;</comment>
                            <comment id="124661" author="ofaaland" created="Wed, 19 Aug 2015 23:00:00 +0000"  >&lt;p&gt;Bruno,&lt;/p&gt;

&lt;p&gt;Looks to me like this code in llog_cat_new_log() implemented wrap-around before &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4528&quot; title=&quot;osd_trans_exec_op()) ASSERTION( oti-&amp;gt;oti_declare_ops_rb[rb] &amp;gt; 0 ) failed: rb = 0&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4528&quot;&gt;&lt;del&gt;LU-4528&lt;/del&gt;&lt;/a&gt;.  Please confirm I&apos;m not misreading.&lt;/p&gt;

&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;76         bitmap_size = LLOG_BITMAP_SIZE(llh);
77
78         index = (cathandle-&amp;gt;lgh_last_idx + 1) % bitmap_size;
...
118         cathandle-&amp;gt;lgh_last_idx = index;
119         llh-&amp;gt;llh_tail.lrt_index = index;
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;thanks,&lt;br/&gt;
Olaf&lt;/p&gt;</comment>
                            <comment id="124663" author="bfaccini" created="Wed, 19 Aug 2015 23:35:18 +0000"  >&lt;p&gt;Olaf,&lt;br/&gt;
To be quick, Catalog wrap-around was working before &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4528&quot; title=&quot;osd_trans_exec_op()) ASSERTION( oti-&amp;gt;oti_declare_ops_rb[rb] &amp;gt; 0 ) failed: rb = 0&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4528&quot;&gt;&lt;del&gt;LU-4528&lt;/del&gt;&lt;/a&gt; patch and no longer after, and also situations where Catalog has already wrapped-around will lead to corruption with new records being written past of expected Catalog&apos;s normal end of file size.&lt;/p&gt;</comment>
                            <comment id="135502" author="bfaccini" created="Tue, 8 Dec 2015 15:39:52 +0000"  >&lt;p&gt;Hello Olaf,&lt;br/&gt;
Do you agree that this ticket can be closed as a dup of &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-6556&quot; title=&quot;changelog catalog corruption if all possible records is define &quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-6556&quot;&gt;&lt;del&gt;LU-6556&lt;/del&gt;&lt;/a&gt;?&lt;br/&gt;
Thanks again and in advance for your help and answer.&lt;/p&gt;</comment>
                            <comment id="135614" author="ofaaland" created="Wed, 9 Dec 2015 01:37:20 +0000"  >&lt;p&gt;Hi Bruno,&lt;br/&gt;
Yes, I agree this should be closed as a dup of &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-6556&quot; title=&quot;changelog catalog corruption if all possible records is define &quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-6556&quot;&gt;&lt;del&gt;LU-6556&lt;/del&gt;&lt;/a&gt;.  Thank you.&lt;br/&gt;
-Olaf&lt;/p&gt;</comment>
                            <comment id="136125" author="jfc" created="Fri, 11 Dec 2015 23:53:52 +0000"  >&lt;p&gt;Thanks Bruno and Olaf.&lt;br/&gt;
~ jfc.&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                        <issuelink>
            <issuekey id="29826">LU-6556</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="30350">LU-6634</issuekey>
        </issuelink>
                            </outwardlinks>
                                                        </issuelinktype>
                    </issuelinks>
                <attachments>
                            <attachment id="18568" name="changelog_catalog" size="4161088" author="ofaaland" created="Tue, 4 Aug 2015 17:26:31 +0000"/>
                            <attachment id="18570" name="changelog_catalog.stout" size="4313600" author="ofaaland" created="Wed, 5 Aug 2015 00:40:57 +0000"/>
                            <attachment id="18567" name="changelog_users" size="8448" author="ofaaland" created="Tue, 4 Aug 2015 17:26:20 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzxjr3:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>