<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 02:05:14 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-7011] Kernel part of llog subsystem can do self-repairing in some cases</title>
                <link>https://jira.whamcloud.com/browse/LU-7011</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;While working on &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-6696&quot; title=&quot;ASSERTION( rc == 0 || rc == LLOG_PROC_BREAK ) failed: 0 changes, 0 in progress, 0 in flight: -5&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-6696&quot;&gt;&lt;del&gt;LU-6696&lt;/del&gt;&lt;/a&gt; ticket the tool to repair corrupted llog catalogs were introduced. The same job could be done in kernel code to repair llogs online if possible.&lt;/p&gt;</description>
                <environment></environment>
        <key id="31497">LU-7011</key>
            <summary>Kernel part of llog subsystem can do self-repairing in some cases</summary>
                <type id="4" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11310&amp;avatarType=issuetype">Improvement</type>
                                            <priority id="3" iconUrl="https://jira.whamcloud.com/images/icons/priorities/major.svg">Major</priority>
                        <status id="1" iconUrl="https://jira.whamcloud.com/images/icons/statuses/open.png" description="The issue is open and ready for the assignee to start work on it.">Open</status>
                    <statusCategory id="2" key="new" colorName="default"/>
                                    <resolution id="-1">Unresolved</resolution>
                                        <assignee username="tappro">Mikhail Pershin</assignee>
                                    <reporter username="tappro">Mikhail Pershin</reporter>
                        <labels>
                    </labels>
                <created>Sun, 16 Aug 2015 11:34:09 +0000</created>
                <updated>Tue, 20 Oct 2020 14:32:58 +0000</updated>
                                            <version>Lustre 2.8.0</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>8</watches>
                                                                            <comments>
                            <comment id="124249" author="tappro" created="Sun, 16 Aug 2015 11:40:26 +0000"  >&lt;p&gt;I think we need both the tool and online repair. Start with tool for now.&lt;/p&gt;</comment>
                            <comment id="127265" author="adilger" created="Mon, 14 Sep 2015 18:24:18 +0000"  >&lt;p&gt;Tool is &lt;a href=&quot;http://review.whamcloud.com/15245&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/15245&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="129222" author="adilger" created="Fri, 2 Oct 2015 23:36:04 +0000"  >&lt;p&gt;It makes more sense to have the llog code repair, skip, and/or clear broken records rather than using an external tool.  If the external tool can detect and fix these problems (after the user&apos;s MDS has crashed and waited all night for them to figure out the problem and run the tool), why not just add enough checks into the llog processing to clean it up immediately?  That avoids the MDS downtime, and avoids the need for the user to even know an llog repair tool exists and that they need to run it.&lt;/p&gt;</comment>
                            <comment id="129227" author="di.wang" created="Fri, 2 Oct 2015 23:56:26 +0000"  >&lt;p&gt;Mike, just curious, will you check/repair both catalog and plain log in this patch? Thanks  &lt;/p&gt;</comment>
                            <comment id="129242" author="tappro" created="Sat, 3 Oct 2015 19:28:30 +0000"  >&lt;p&gt;Andreas, I agree in general, the difference is that we are more restricted inside kernel, e.g. we can&apos;t just do repair in the current context, but have to start separate repair thread, exclusively accessing that llog. I mean that tool is much simpler to implement than auto-repair, there are no problems with concurrent access, transactions, etc. Meanwhile I agree that auto-repair in preferable and I am going to implement some basic checks/repairs at least.&lt;/p&gt;

&lt;p&gt;Di, it is possible only for fixed size llog in fact, the catalog is the only real example we have now.&lt;/p&gt;</comment>
                            <comment id="129276" author="adilger" created="Mon, 5 Oct 2015 03:30:04 +0000"  >&lt;p&gt;It seems possible to do at least some basic repair of variable-sized llog records.  For example, if a corrupt llog record is found (i.e. hdr len != tail len), one option would be to scan the rest of the chunk for potential matching llog hdr/tail pairs that allow resyncing the stream.  A second option (easier to implement, but recovers fewer logs) would be to jump to the start of the next llog chunk and clear the records between the corrupt chunk and the start of the new chunk.&lt;/p&gt;</comment>
                            <comment id="129283" author="di.wang" created="Mon, 5 Oct 2015 06:47:29 +0000"  >&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Di, it is possible only for fixed size llog in fact, the catalog is the only real example we have now.
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Yes, it seems except catalog, all important plain logs are not fixed size. change log, config log, update log, unlink log are all not fixed size. Unfortunately, most of the corruptions seems happen in plain log, at least that happens in DNE test.  Hmm, actually most of them are  header and tail are not matched to each other, (lrh_len != tail_len or lrh_idx != tail_index),  Hmm, some of them even use LASSERT to check, probably we should change that to CERROR.&lt;/p&gt;</comment>
                            <comment id="129686" author="tappro" created="Wed, 7 Oct 2015 14:03:23 +0000"  >&lt;p&gt;In fact I think we should find the reason of these issues with bad tail, it is not normal behavior and something is wrong there definitely. I mean it is not some sort of corruption due to disk issues, etc., but issue in our code causing that.&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                        <issuelink>
            <issuekey id="30548">LU-6696</issuekey>
        </issuelink>
                            </outwardlinks>
                                                                <inwardlinks description="is related to">
                                        <issuelink>
            <issuekey id="31496">LU-7010</issuekey>
        </issuelink>
                            </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzxklj:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                                                                                </customfields>
    </item>
</channel>
</rss>