<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:30:16 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-3019] Files are corrupted after OSS unmount. </title>
                <link>https://jira.whamcloud.com/browse/LU-3019</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;Our small lustre system recently encountered a strange problem. &lt;br/&gt;
Lustre version &amp;#8211; 2.1.4, precompiled binaries from Whamcloud site (we have CentOS5 compatible system, so we must stick to 2.1.X), upgraded from 2.0.0.1 several weeks before. System contained one MDS and one OSS (single OST with size 8Tb). &lt;/p&gt;

&lt;p&gt;After the addition of the second, empty OSS (also 8Tb), a steady and slow migration of files began. Then the problem arose. When the second OSS was is unmounted (on a particular occasion), files which our users are were working with at the moment and many other unrelated (and untouched for months) randomly scattered files (may be those which were migrating while unmount happened?) became corrupted.  &lt;/p&gt;

&lt;p&gt;These files appeared to be of zero size, and can only be deleted. &lt;/p&gt;

&lt;p&gt;If to access one of these files on the new OSS we observe messages like this:&lt;br/&gt;
&lt;tt&gt;kernel: LustreError: 18600:0:(ldlm_resource.c:1090:ldlm_resource_get()) lvbo_init failed for resource 1501826: rc -2&lt;/tt&gt;&lt;br/&gt;
(No new messages in MDS or old OSS logs.)&lt;/p&gt;

&lt;p&gt;Even graceful unmount &amp;#8212; MDS, than both OSS, than MGS, leads to files corruption. (or this is a wrong way to stop the system correctly?).&lt;/p&gt;

&lt;p&gt;After the check, according to section 27.2 &quot;Recovering from Corruption in the Lustre File System&quot; of the manual, we have millions of (harmless?) messages: &lt;/p&gt;

&lt;p&gt;&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;1&amp;#93;&lt;/span&gt; zero-length orphan objid 0:8371035&quot;&lt;/tt&gt;&lt;/p&gt;

&lt;p&gt;and hundreds of kind: &lt;/p&gt;

&lt;p&gt;&lt;tt&gt;Failed to find fid &lt;span class=&quot;error&quot;&gt;&amp;#91;0x20000a041:0x3f99:0x0&amp;#93;&lt;/span&gt;: DB_NOTFOUND: No matching key/data pair found&lt;/tt&gt;&lt;br/&gt;
&lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;0&amp;#93;&lt;/span&gt;: MDS FID &lt;span class=&quot;error&quot;&gt;&amp;#91;0x20000a041:0x3f99:0x0&amp;#93;&lt;/span&gt; object 0:984246 deleted?&lt;/tt&gt;&lt;/p&gt;

&lt;p&gt;After correcting these errors with lfsck -l -c, I have checked the filesystem one more time, and received many more errors of the same type. (The system was mounted and unmounted to perform the check, no other accesses, except some reads &amp;#8211; cd/ls/cat were done).&lt;/p&gt;

&lt;p&gt;The both OSS&#8217;s are in failout mode. &lt;br/&gt;
Due to historical reasons the old OSS is OSS1, (half a year ago we had to migrate from degrading OSS), so new OSS became OSS0.&lt;/p&gt;

&lt;p&gt;For one of the corrupted files, /lustre0/users/kglukhov/Calcs/Abinit/SPS/para/fo+Cr_par/ Sn2P2S6o_DS3_WFK:&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;root@compute-0-7 fo+Cr_par&amp;#93;&lt;/span&gt;# lfs getstripe Sn2P2S6o_DS3_WFK&lt;br/&gt;
Sn2P2S6o_DS3_WFK&lt;br/&gt;
lmm_stripe_count:   1&lt;br/&gt;
lmm_stripe_size:    1048576&lt;br/&gt;
lmm_stripe_offset:  0&lt;br/&gt;
        obdidx           objid          objid            group&lt;br/&gt;
             0         1500007       0x16e367                0&lt;/p&gt;

&lt;p&gt;On the corresponding (new) OSS node:&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;root@lustre-compute-0-3 mnt&amp;#93;&lt;/span&gt;# debugfs -c -R &quot;stat O/0/d$((1500007 % 32))/1500007&quot; /dev/md3&lt;br/&gt;
debugfs 1.42.6.wc2 (10-Dec-2012)&lt;br/&gt;
/dev/md3: catastrophic mode - not reading inode or group bitmaps&lt;br/&gt;
O/0/d7/1500007: File not found by ext2_lookup&lt;/p&gt;

&lt;p&gt;Can you help us to understand, what is going on and how to tackle it? &lt;/p&gt;

&lt;p&gt;I attach an output of all commands of the first check (according to the manual section 27.2), an output of  lfsck for first check (-n), error fixing (-l -c) and the second full check.&lt;/p&gt;</description>
                <environment>Rocks 5.3 cluster distributive (CentOS5 based).</environment>
        <key id="18063">LU-3019</key>
            <summary>Files are corrupted after OSS unmount. </summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="3" iconUrl="https://jira.whamcloud.com/images/icons/priorities/major.svg">Major</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="5">Cannot Reproduce</resolution>
                                        <assignee username="wc-triage">WC Triage</assignee>
                                    <reporter username="indrekis">Oleg Farenyuk</reporter>
                        <labels>
                    </labels>
                <created>Fri, 22 Mar 2013 14:11:57 +0000</created>
                <updated>Thu, 9 Jan 2020 06:58:17 +0000</updated>
                            <resolved>Thu, 9 Jan 2020 06:58:17 +0000</resolved>
                                    <version>Lustre 2.1.4</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>3</watches>
                                                                            <comments>
                            <comment id="54733" author="indrekis" created="Mon, 25 Mar 2013 00:31:39 +0000"  >&lt;p&gt;It turned out that the utility simply could fix those mistakes. So the message from lfsck were all about the same files. After removal of defective files manually, these messages have gone.&lt;/p&gt;

&lt;p&gt;Then I&apos;ve made an additional experiment.&lt;/p&gt;

&lt;p&gt;1. According to section 14.4. of manual (Regenerating Lustre Configuration Logs), dismounted all and regenerated logs.&lt;br/&gt;
2. Mounted file system.&lt;br/&gt;
3. Wrote 20Gb of files on Lustre.&lt;br/&gt;
4. Checked - no new defective files appeared.&lt;br/&gt;
5. An hour later started to copy another 20Gb of files. While copying, unmounted OSS0 for 30 seconds. &lt;/p&gt;

&lt;p&gt;Result: several hundred files being copied were damaged. And a &lt;b&gt;few dozen files, copied an hour ago were damaged too&lt;/b&gt;! Fortunately, unlike last time, more old files were not affected.&lt;/p&gt;

&lt;p&gt;(Damaged &amp;#8211; means they show zero size, and only can be deleted, any other operation fails.)&lt;/p&gt;</comment>
                            <comment id="55045" author="indrekis" created="Thu, 28 Mar 2013 19:00:03 +0000"  >&lt;p&gt;Switching both OSS to failover mode reduced the likelihood of damage by 4-5 orders of magnitude, though (simulated) outage of OSS and/or clients sometimes still leads to files corruption - they become inaccessible and can not be cured by lfsck.&lt;/p&gt;</comment>
                            <comment id="57909" author="mnnguyen" created="Wed, 8 May 2013 15:47:01 +0000"  >&lt;p&gt;We encountered the same problem after restarting our Lustre filesystem that was on 2.1.4.  Our experience seems to be very similar with Oleg&apos;s Lustre system in his bug description.  Many files untouched for months became corrupted (zero size) on many OST that were cleanly unmounted.  I hope that there is a way to recover users files.&lt;/p&gt;</comment>
                            <comment id="260846" author="adilger" created="Thu, 9 Jan 2020 06:58:17 +0000"  >&lt;p&gt;Close old bug&lt;/p&gt;</comment>
                    </comments>
                    <attachments>
                            <attachment id="12418" name="chklog_after_fix.gz" size="1333550" author="indrekis" created="Fri, 22 Mar 2013 14:11:57 +0000"/>
                            <attachment id="12419" name="chklog_before_fix.gz" size="1333551" author="indrekis" created="Fri, 22 Mar 2013 14:11:57 +0000"/>
                            <attachment id="12420" name="chklog_fixing.gz" size="1326004" author="indrekis" created="Fri, 22 Mar 2013 14:11:57 +0000"/>
                            <attachment id="12417" name="first_check_logs.tar.gz" size="1640" author="indrekis" created="Fri, 22 Mar 2013 14:11:57 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzvlzb:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>7342</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10023"><![CDATA[4]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>