<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:17:47 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-1569] Many Files missing and others have no info (uid/gid/permissions)</title>
                <link>https://jira.whamcloud.com/browse/LU-1569</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;We have had a catastrophic failure of one of our lustre filesystems. Not sure exactly cause, but in our current state running lfsck on it gives TONS of errors like:&lt;br/&gt;
	Failed to find fid &lt;span class=&quot;error&quot;&gt;&amp;#91;0xc900ec:0xd7e33ef4:0x0&amp;#93;&lt;/span&gt;: DB_NOTFOUND: No matching key/data pair found&lt;/p&gt;

&lt;p&gt;And when we run a find on various users&apos; directories, we find many &quot;No such file&quot; errors:&lt;br/&gt;
	find: ./mrtoeppe/CFD_run_archive/Turbine2DecDet/mcfd_tec.bin.830: No such file or directory which shows up like:&lt;/p&gt;

&lt;p&gt;?--------- ? ?        ?        ?            ? mcfd_tec.bin.660&lt;br/&gt;
?--------- ? ?        ?        ?            ? mcfd_tec.bin.670&lt;br/&gt;
?--------- ? ?        ?        ?            ? mcfd_tec.bin.680&lt;/p&gt;
</description>
                <environment>CentOS release 5.7 (Final)&lt;br/&gt;
Linux nas-0-1.local 2.6.18-194.17.1.el5_lustre.1.8.5 #1 SMP Tue Nov 16 17:59:07 MST 2010 x86_64 x86_64 x86_64 GNU/Linux&lt;br/&gt;
</environment>
        <key id="15046">LU-1569</key>
            <summary>Many Files missing and others have no info (uid/gid/permissions)</summary>
                <type id="3" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11318&amp;avatarType=issuetype">Task</type>
                                            <priority id="2" iconUrl="https://jira.whamcloud.com/images/icons/priorities/critical.svg">Critical</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="4">Incomplete</resolution>
                                        <assignee username="wc-triage">WC Triage</assignee>
                                    <reporter username="brianandrus">Brian Andrus</reporter>
                        <labels>
                    </labels>
                <created>Tue, 26 Jun 2012 14:42:02 +0000</created>
                <updated>Wed, 6 Nov 2013 17:40:29 +0000</updated>
                            <resolved>Wed, 6 Nov 2013 17:40:29 +0000</resolved>
                                    <version>Lustre 1.8.7</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>2</watches>
                                                                            <comments>
                            <comment id="41156" author="cliffw" created="Tue, 26 Jun 2012 14:49:55 +0000"  >&lt;p&gt;Have you successfully run &apos;fsck -fy&apos; on all devices? Are you using the latest version of e2fsprogs, available at &lt;a href=&quot;http://downloads.whamcloud.com/public/e2fsprogs/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://downloads.whamcloud.com/public/e2fsprogs/&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="41157" author="brianandrus" created="Tue, 26 Jun 2012 14:53:09 +0000"  >&lt;p&gt;Initially our lustre filesystem (/work) had one of the osts disconnect (there are 10 each 7.8TB OSTs) and not reconnect. This put /work in read-only mode.&lt;br/&gt;
I attempted to reconnect work_ost9, but it failed with Transport Endpoint shutdown errors (odd for an OST I think).&lt;br/&gt;
I took the entire system down and ran fsck on each ost and the mdt. There were numerous errors on work_ost9 as well as a few errors on the mdt and 2 other OSTs.&lt;br/&gt;
Upon remount, we found that there were almost no files newer than December 12, 2011.&lt;br/&gt;
We also found many files were corrupt and did not show proper UID/GID/Permissions.&lt;br/&gt;
There WERE files accessed and written to by users during this time.&lt;br/&gt;
It seemed there may have been issues with the LAST_ID and/or CATALOG on the MDT, so those were removed and the system brought back online. The made the LAST_ID entries on the MDT match those listed on the OSTs.&lt;br/&gt;
Upon remounting (read only now), we found the corrupt entries were still there and there were still no files newer than December.&lt;br/&gt;
I took down the filesystem again and ran e2fsck to create the MDT and OST databases. I then brought the filesystem back up and ran lfsck in read only.&lt;br/&gt;
This produced many of the &quot;No matching key/data pair found&quot; errors.&lt;br/&gt;
I ran lfsck without &quot;-n&quot; with the same result.&lt;/p&gt;

&lt;p&gt;Currently /work is mounted read only so users that do still have data intact can copy it to a clean filesystem.&lt;/p&gt;</comment>
                            <comment id="41159" author="cliffw" created="Tue, 26 Jun 2012 15:25:47 +0000"  >&lt;p&gt;Okay, thanks &lt;/p&gt;</comment>
                            <comment id="41160" author="cliffw" created="Tue, 26 Jun 2012 15:49:11 +0000"  >&lt;p&gt;First, as explained in the Lustre Manual, lustre-logs which are auto-dumped must be pre-processed on site to be useful, so we can&apos;t do much with what you attached. What we need in this case are the &lt;em&gt;system&lt;/em&gt; logs (typically /var/log/messages) for all OSTs and the MDS/MGS for the period 12 hrs before you had the initial outage to the present time. Please do not filter the logs unless you need to remove IPs for security.  &lt;/p&gt;
</comment>
                            <comment id="41161" author="cliffw" created="Tue, 26 Jun 2012 15:56:58 +0000"  >&lt;p&gt;Please run lfs getstripe on one of the missing files, get the list of stripe objects and check the OSTs to determine if the data actually exists on the OST disk. Debugfs will work for this. &lt;br/&gt;
The lfs setstripe should return a list of ostidx (OST index) and object IDs, for an objid of (example) 818855&lt;br/&gt;
the following debugfs command should tell you if the data is there. &lt;/p&gt;

&lt;p&gt;$ debugfs -c -R &quot;stat O/0/d$((818855 % 32))/818855&quot; /dev/&amp;lt;your OST device&amp;gt; &lt;/p&gt;</comment>
                            <comment id="41162" author="brianandrus" created="Tue, 26 Jun 2012 16:15:44 +0000"  >&lt;p&gt;Here is a quick check on one file that is showing in an ls, but missing info:&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;root@nas-0-1 hale&amp;#93;&lt;/span&gt;# ls -l|grep gempak$&lt;br/&gt;
?---------  ? ?    ?              ?            ? gempak&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;root@nas-0-1 hale&amp;#93;&lt;/span&gt;# lfs getstripe ./gempak&lt;br/&gt;
./gempak&lt;br/&gt;
lmm_stripe_count:   1&lt;br/&gt;
lmm_stripe_size:    1048576&lt;br/&gt;
lmm_stripe_offset:  0&lt;br/&gt;
        obdidx           objid          objid            group&lt;br/&gt;
             0         5703559       0x570787                0&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;root@nas-0-1 hale&amp;#93;&lt;/span&gt;#  debugfs -c -R &quot;stat O/0/d$((5703559 % 32))/5703559&quot; /dev/VG_hamming/work_ost0&lt;br/&gt;
debugfs 1.41.12.2.ora1 (14-Aug-2010)&lt;br/&gt;
/dev/VG_hamming/work_ost0: catastrophic mode - not reading inode or group bitmaps&lt;br/&gt;
O/0/d7/5703559: File not found by ext2_lookup&lt;/p&gt;</comment>
                            <comment id="41163" author="brianandrus" created="Tue, 26 Jun 2012 16:16:54 +0000"  >&lt;p&gt;Tar file of /var/log/messages for MGS and OSSes&lt;/p&gt;</comment>
                            <comment id="41166" author="cliffw" created="Tue, 26 Jun 2012 18:23:32 +0000"  >&lt;p&gt;Thanks - did you keep any logs/output from the first fsck you did after the initial failure? Please attach if so.&lt;/p&gt;</comment>
                            <comment id="41197" author="brianandrus" created="Wed, 27 Jun 2012 11:03:08 +0000"  >&lt;p&gt;The only log I have is the output from lfsck, but it is 7.9GB&lt;br/&gt;
I can gzip it and try to upload if you think it will help.&lt;/p&gt;</comment>
                            <comment id="41207" author="brianandrus" created="Wed, 27 Jun 2012 13:42:14 +0000"  >&lt;p&gt;Attached output from running lctl df on all the dump logs that were generated (lustre.log)&lt;/p&gt;</comment>
                            <comment id="41230" author="cliffw" created="Wed, 27 Jun 2012 22:17:45 +0000"  >&lt;p&gt;We need the fsck data, not the lfsck.&lt;/p&gt;</comment>
                            <comment id="41235" author="brianandrus" created="Thu, 28 Jun 2012 00:32:48 +0000"  >&lt;p&gt;That I do not have. I do know there are many files in lost+found on the backing filesystem. I have not examined them yet since it is now mounted as lustre (albeit read-only).&lt;/p&gt;</comment>
                            <comment id="41503" author="cliffw" created="Thu, 5 Jul 2012 14:35:07 +0000"  >&lt;p&gt;Have you run the lost+found recovery script?&lt;/p&gt;</comment>
                            <comment id="41713" author="adilger" created="Wed, 11 Jul 2012 15:14:35 +0000"  >&lt;p&gt;That would be &quot;ll_recover_lost_found_objs&quot;, which should be installed on all the OSTs.  You need to mount the OST locally using &quot;-t ldiskfs&quot; instead of as &quot;-t lustre&quot; to run this tool.  It will rebuild the corrupted object directories and move all the objects from lost+found back into their proper location.&lt;/p&gt;</comment>
                    </comments>
                    <attachments>
                            <attachment id="11655" name="lustre-log.tgz" size="75330" author="brianandrus" created="Tue, 26 Jun 2012 14:42:02 +0000"/>
                            <attachment id="11661" name="lustre.log" size="724759" author="brianandrus" created="Wed, 27 Jun 2012 13:42:14 +0000"/>
                            <attachment id="11657" name="messages.tgz" size="144390" author="brianandrus" created="Tue, 26 Jun 2012 16:16:54 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10040" key="com.atlassian.jira.plugin.system.customfieldtypes:labels">
                        <customfieldname>Epic</customfieldname>
                        <customfieldvalues>
                                        <label>server</label>
    
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzv33r:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>4002</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                                                                                </customfields>
    </item>
</channel>
</rss>