<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 02:05:13 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-7010] &quot;Local llog found corrupted&quot; during DNE2 recovery</title>
                <link>https://jira.whamcloud.com/browse/LU-7010</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;Recent recovery issues in Maloo show the following:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;00000040:00020000:1.0:1439669743.543046:0:5740:0:(llog.c:489:llog_process_thread()) Local llog found corrupted
00000040:00100000:1.0:1439669743.545890:0:5740:0:(llog.c:167:llog_cancel_rec()) Canceling 1 in log 0x1:1024
00000040:00100000:1.0:1439669743.546205:0:5740:0:(llog.c:167:llog_cancel_rec()) Canceling 64838 in log 0x1:1024
00000040:00100000:1.0:1439669743.546229:0:5740:0:(llog.c:167:llog_cancel_rec()) Canceling 64864 in log 0x1:1024
00000040:00100000:1.0:1439669743.546242:0:5740:0:(llog.c:167:llog_cancel_rec()) Canceling 64896 in log 0x1:1024
00000040:00100000:1.0:1439669743.546254:0:5740:0:(llog.c:167:llog_cancel_rec()) Canceling 64897 in log 0x1:1024
00000040:00100000:1.0:1439669743.546267:0:5740:0:(llog.c:167:llog_cancel_rec()) Canceling 64899 in log 0x1:1024
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;As I can see, the DNE2 &apos;update recovery&apos; may return -EIO error if some update  was applied with error. That cause whole llog processing to stop and cancel all other updates. After that recovery stops with various errors.&lt;/p&gt;

&lt;p&gt;Here is an example, test_70b:&lt;br/&gt;
&lt;a href=&quot;https://testing.hpdd.intel.com/test_sets/1a8282a6-43d8-11e5-a4bc-5254006e85c2&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://testing.hpdd.intel.com/test_sets/1a8282a6-43d8-11e5-a4bc-5254006e85c2&lt;/a&gt;&lt;/p&gt;</description>
                <environment></environment>
        <key id="31496">LU-7010</key>
            <summary>&quot;Local llog found corrupted&quot; during DNE2 recovery</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="2" iconUrl="https://jira.whamcloud.com/images/icons/priorities/critical.svg">Critical</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="3">Duplicate</resolution>
                                        <assignee username="di.wang">Di Wang</assignee>
                                    <reporter username="tappro">Mikhail Pershin</reporter>
                        <labels>
                    </labels>
                <created>Sun, 16 Aug 2015 07:38:19 +0000</created>
                <updated>Mon, 21 Sep 2015 05:25:05 +0000</updated>
                            <resolved>Mon, 21 Sep 2015 05:25:05 +0000</resolved>
                                                    <fixVersion>Lustre 2.8.0</fixVersion>
                                        <due></due>
                            <votes>0</votes>
                                    <watches>4</watches>
                                                                            <comments>
                            <comment id="127284" author="pjones" created="Mon, 14 Sep 2015 20:39:00 +0000"  >&lt;p&gt;Di is taking care of this one&lt;/p&gt;</comment>
                            <comment id="127305" author="di.wang" created="Tue, 15 Sep 2015 05:25:38 +0000"  >&lt;p&gt;Hmm, It seems to me, this corruption is related this unlanded patch &lt;a href=&quot;http://review.whamcloud.com/#/c/14912/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#/c/14912/&lt;/a&gt;  And this patch did change something inside llog, so it is quite possible related with this specific patch.&lt;br/&gt;
Mike, did you see this failure happen on other patches? . If not, I will close this ticket.&lt;/p&gt;</comment>
                            <comment id="127332" author="tappro" created="Tue, 15 Sep 2015 13:20:11 +0000"  >&lt;p&gt;I see that this problem is not related with that patch, it already exists in code. See, the llog_process_thread() has an old code about the corruption handling:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;	&lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt; (unlikely(rc == -EIO &amp;amp;&amp;amp; loghandle-&amp;gt;lgh_obj != NULL)) {
		/* something bad happened to the processing of a local
		 * llog file, probably I/O error or the log got corrupted..
		 * to be able to &lt;span class=&quot;code-keyword&quot;&gt;finally&lt;/span&gt; release the log we discard any
		 * remaining bits in the header */
		CERROR(&lt;span class=&quot;code-quote&quot;&gt;&quot;Local llog found corrupted\n&quot;&lt;/span&gt;);
		&lt;span class=&quot;code-keyword&quot;&gt;while&lt;/span&gt; (index &amp;lt;= last_index) {
			&lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt; (ext2_test_bit(index, LLOG_HDR_BITMAP(llh)) != 0)
				llog_cancel_rec(lpi-&amp;gt;lpi_env, loghandle, index);
			index++;
		}
		rc = 0;
	}
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Meanwhile the update_recovery.c introduced new callbacks which may return -EIO. Technically that doesn&apos;t mean the llog itself is corrupted, and we shouldn&apos;t cancel all other llog records. I think we shouldn&apos;t use EIO error code there at all.&lt;/p&gt;</comment>
                            <comment id="127426" author="di.wang" created="Wed, 16 Sep 2015 00:09:37 +0000"  >&lt;p&gt;Mike: I understand the corrupt checking code is already there. But I mean the reason to cause this corruption is quite related with the change, since this &quot;Local llog found corruption&quot; error seems happen on every run of that patch. I did not see this on other patch. Do I miss sth? Do you know why did it return -EIO? could you please explain here. Thanks.&lt;/p&gt;</comment>
                            <comment id="127446" author="tappro" created="Wed, 16 Sep 2015 06:30:45 +0000"  >&lt;p&gt;It is not only happening with that patch, in fact I don&apos;t see even how it can be related. Meanwhile it looks like duplicate of &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-6844&quot; title=&quot;replay-single test 70b failure: &amp;#39;rundbench load on * failed!&amp;#39;&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-6844&quot;&gt;&lt;del&gt;LU-6844&lt;/del&gt;&lt;/a&gt;, some reports there also have the same problem with corrupted log. E.g. check report &lt;a href=&quot;https://testing.hpdd.intel.com/test_sets/1a8282a6-43d8-11e5-a4bc-5254006e85c2&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://testing.hpdd.intel.com/test_sets/1a8282a6-43d8-11e5-a4bc-5254006e85c2&lt;/a&gt;, MDS1 console log contains:&lt;br/&gt;
20:15:58:LustreError: 5740:0:(llog.c:489:llog_process_thread()) Local llog found corrupted&lt;/p&gt;

&lt;p&gt;I am not sure about the reason, but I see that -EIO from llog callbacks also will cause this message and llog cancelling, maybe update llog processing callback may return -EIO?&lt;/p&gt;</comment>
                            <comment id="127461" author="di.wang" created="Wed, 16 Sep 2015 10:00:13 +0000"  >&lt;p&gt;IMHO,  this EIO usually comes from llog_osd_next_block(), which means the local llog is indeed corrupted. Hmm, &lt;a href=&quot;https://testing.hpdd.intel.com/test_sets/1a8282a6-43d8-11e5-a4bc-5254006e85c2&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://testing.hpdd.intel.com/test_sets/1a8282a6-43d8-11e5-a4bc-5254006e85c2&lt;/a&gt; is also from the patch 14912, do I miss sth?&lt;br/&gt;
Besides, there is an obvious mistake in 14912, which might corrupt the update llog.&lt;/p&gt;</comment>
                            <comment id="127462" author="di.wang" created="Wed, 16 Sep 2015 10:42:39 +0000"  >&lt;p&gt;Hmm, I checked the failure in &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-6844&quot; title=&quot;replay-single test 70b failure: &amp;#39;rundbench load on * failed!&amp;#39;&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-6844&quot;&gt;&lt;del&gt;LU-6844&lt;/del&gt;&lt;/a&gt;, I do not think they are related. Most failures there are either due to &quot;No space left&quot;. &lt;/p&gt;</comment>
                            <comment id="127504" author="tappro" created="Wed, 16 Sep 2015 15:48:20 +0000"  >&lt;p&gt;what mistake in 14912 do you mean, could you explain, about llh_size? If some reports from &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-6844&quot; title=&quot;replay-single test 70b failure: &amp;#39;rundbench load on * failed!&amp;#39;&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-6844&quot;&gt;&lt;del&gt;LU-6844&lt;/del&gt;&lt;/a&gt; are also related to patch 14912 then it can be the reason, I agree. Let&apos;s wait then for updated patch first.&lt;/p&gt;</comment>
                            <comment id="127518" author="di.wang" created="Wed, 16 Sep 2015 17:03:37 +0000"  >&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;what mistake in 14912 do you mean, could you explain, about llh_size? 
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Yes, it can not use llh_size to calculate the write offset, because for update log, even it is NOT fixed size update record, llh_size are still &amp;gt; 0.   So if the write_offset is wrong, then new write will ruin the llog anyway.  &lt;/p&gt;</comment>
                            <comment id="127945" author="di.wang" created="Mon, 21 Sep 2015 05:24:21 +0000"  >&lt;p&gt;this is clearly caused by patch &lt;a href=&quot;http://review.whamcloud.com/#/c/14912/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#/c/14912/&lt;/a&gt; , since that patch is not landed yet. I will close this one to avoid duplicate efforts.&lt;/p&gt;</comment>
                            <comment id="127946" author="di.wang" created="Mon, 21 Sep 2015 05:25:05 +0000"  >&lt;p&gt;duplicate with &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-6556&quot; title=&quot;changelog catalog corruption if all possible records is define &quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-6556&quot;&gt;&lt;del&gt;LU-6556&lt;/del&gt;&lt;/a&gt;&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                        <issuelink>
            <issuekey id="31497">LU-7011</issuekey>
        </issuelink>
                            </outwardlinks>
                                                                <inwardlinks description="is related to">
                                        <issuelink>
            <issuekey id="26901">LU-5716</issuekey>
        </issuelink>
                            </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzxklb:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>