<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 02:43:52 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-11437] Recovering files in .lustre/lost+found/MDT0000/*</title>
                <link>https://jira.whamcloud.com/browse/LU-11437</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;Recently we suffered an outage for some of servers and after recover started a lfsck run for all the options, namespace, layout etc. During the run a large number of files ended up cached in /lustre/altas2/.lustre/lost+found/MDT0000/* as well as the MDT on the MDS server to the point it crashed the MDS since we don&apos;t have large directory support. To prevent this all zero size files are being deleted. At the same time user files that were accessible after the recovery are no longer accessible after the lfsck launch. Their data is ended up in lost+found. We are attempting to look at recovering this data but no documentation can be found easily on how to do that for lost+found files. I have looked at the sanity-lfsck for information on how to do this but their doesn&apos;t seem to be a clear answer on how to figure out the original location of the files in lost+found. We end attempted to use ll_decode_linkea but the parent return was &quot;lost+found&quot; itself instead of the original directory. As for sanity-lfsck none of the test actually determine the original location of such displaced files by using tools but use the path originally given when creating a test file. So we are looking for pointers on how to recover these files.&#160;&lt;/p&gt;</description>
                <environment>ORNL&amp;#39;s altas files system running patched lustre 2.8.2 with same version for it clients</environment>
        <key id="53436">LU-11437</key>
            <summary>Recovering files in .lustre/lost+found/MDT0000/*</summary>
                <type id="9" iconUrl="https://jira.whamcloud.com/images/icons/issuetypes/undefined.png">Question/Request</type>
                                            <priority id="4" iconUrl="https://jira.whamcloud.com/images/icons/priorities/minor.svg">Minor</priority>
                        <status id="1" iconUrl="https://jira.whamcloud.com/images/icons/statuses/open.png" description="The issue is open and ready for the assignee to start work on it.">Open</status>
                    <statusCategory id="2" key="new" colorName="default"/>
                                    <resolution id="-1">Unresolved</resolution>
                                        <assignee username="laisiyao">Lai Siyao</assignee>
                                    <reporter username="simmonsja">James A Simmons</reporter>
                        <labels>
                    </labels>
                <created>Thu, 27 Sep 2018 15:24:07 +0000</created>
                <updated>Fri, 5 Oct 2018 16:58:08 +0000</updated>
                                            <version>Lustre 2.8.0</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>7</watches>
                                                                            <comments>
                            <comment id="234115" author="pjones" created="Fri, 28 Sep 2018 20:13:21 +0000"  >&lt;p&gt;Lai&lt;/p&gt;

&lt;p&gt;Could you please advise&lt;/p&gt;

&lt;p&gt;Thanks&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="234126" author="fan.yong" created="Sat, 29 Sep 2018 14:22:03 +0000"  >&lt;p&gt;There are two kinds of &quot;lost+found&quot; in Lustre:&lt;/p&gt;

&lt;p&gt;1) One is the backend &quot;lost+found&quot;, that is special for ldiskfs backend. The &lt;tt&gt;e2fsck&lt;/tt&gt; tool will put the backend orphans (no name entry reference the inode) to the backend &quot;/lost+found&quot; directory. The backend lost+found directory and its sub-items are invisible to Lustre client. You have to mount the server as &quot;ldiskfs&quot; if you want to check the backend &quot;lost+found&quot;.&lt;/p&gt;

&lt;p&gt;2) The other is the Lustre global &quot;lost+found&quot; directory. That is visible to Lustre client and under the directory $mount_point/.lustre/lost+found. The LFSCK will link Lustre orphans to the Lustre &quot;lost+found&quot; directory. There are several kinds of Lustre orphans with the infix in its name under the Lustre &quot;lost+found&quot; directory:&lt;/p&gt;

&lt;p&gt;&quot;C&quot;:           Multiple OST-objects claim the same MDT-object and the same slot in the layout EA. Then the LFSCK will create new MDT-object(s) to hold the conflict OST-object(s).&lt;/p&gt;

&lt;p&gt;&quot;N&quot;:           The orphan OST-object does not know which one was the real parent MDT-object, so the LFSCK uses new FID for its parent MDT-object.&lt;/p&gt;

&lt;p&gt;&quot;R&quot;:           The orphan OST-object knows its parent MDT-object FID, but does not know the position (the file name) in the layout.&lt;/p&gt;

&lt;p&gt;&quot;D&quot;:           The MDT-object is a directory, it may knows its parent but because there is no valid linkEA, the LFSCK cannot know where to put it back to the namespace.&lt;/p&gt;

&lt;p&gt;&quot;O&quot;:           The MDT-object has no linkEA, and there is no name entry that references the MDT-object.&lt;/p&gt;

&lt;p&gt;&quot;P&quot;:           The orphan object to be created was a parent directory of some MDT-object which linkEA shows that the @orphan object is missing.&lt;/p&gt;

&lt;p&gt;So please describe what kinds of orphans you hit in detail, then we can analysis how to process next step.&lt;/p&gt;</comment>
                            <comment id="234170" author="simmonsja" created="Mon, 1 Oct 2018 17:06:12 +0000"  >&lt;p&gt;3793 &lt;del&gt;R&lt;/del&gt; files&lt;br/&gt;
~12 million &lt;del&gt;N&lt;/del&gt; files&lt;/p&gt;

&lt;p&gt;Most of the N files are empty and since their are so many we are deleting them to avoid our MDS crashing.&lt;/p&gt;</comment>
                            <comment id="234274" author="simmonsja" created="Wed, 3 Oct 2018 13:59:08 +0000"  >&lt;p&gt;Any advice?&lt;/p&gt;</comment>
                            <comment id="234283" author="adilger" created="Wed, 3 Oct 2018 15:43:58 +0000"  >&lt;p&gt;James, has LFSCK ever been run on this filesystem in the past?  What is the default stripe count on the filesystem? I&apos;m wondering if the zero-length files are potentially objects that were part of files that were smaller than (stripe_count x 1MB), so they were never modified by clients, and do not have a parent (MDT) FID stored on them?&lt;/p&gt;

&lt;p&gt;Is the MDS still crashing after you have removed the zero-length files?  Do you have a stack trace from the crashes? How many files are left after the zero-length ones are removed?  Can you please provide a sample of the filenames?  &lt;/p&gt;

&lt;p&gt;For OST objects to end up in the Lustre &lt;tt&gt;lost+found&lt;/tt&gt;, it would mean that there was corruption of the MDT that resulted in inodes being erased, since they no longer have a LOV EA pointing to them. Separately, there may be inodes in the underlying ext4 &lt;tt&gt;lost+found&lt;/tt&gt; directory that still point to data on the OSTs, but lost their filenames because of directory corruption on the MDT. &lt;/p&gt;

&lt;p&gt;I&apos;m guessing you don&apos;t have any device-level backups of the MDT?  I recommend taking periodic backups of the MDT via &quot;dd&quot; (eg. daily or as often as is practical) to allow recovery in cases like this. While doing the backup from an LVM snapshot is preferred, doing a raw-disk backup of the live MDT is still useful in cases like this. It can either be restored directly in case of serious corruption, or potentially used to recover files that were corrupted. &lt;/p&gt;</comment>
                            <comment id="234318" author="fan.yong" created="Thu, 4 Oct 2018 00:30:54 +0000"  >&lt;p&gt;Usually, the Lustre orphan under global &quot;lost+found&quot; with the name format &quot;&amp;#42;&amp;#45;N&amp;#45;&amp;#42;&quot; is for the pre-created OST-object. There are two possible cases:&lt;/p&gt;

&lt;p&gt;1) Such pre-created OST-object had not been assigned to any MDT-object. Under such case, removing the orphan from global &quot;lost+found&quot; will NOT affect the system.&lt;/p&gt;

&lt;p&gt;2) Or even thought it had been assigned to some MDT-object before the corruption, it had never been modified after the assignment. That is why it does not know who was the parent MDT-object. Under such case, related MDT-object is not handled during the (first-stage) layout LFSCK scanning on the MDT, it is quite possible that related MDT-object is lost or become invisible orphans (under backend &quot;lost+found&quot;). Under such case, removing the empty orphans is also safe.&lt;/p&gt;</comment>
                            <comment id="234376" author="hanleyja" created="Thu, 4 Oct 2018 17:45:51 +0000"  >&lt;p&gt;After removing the &lt;tt&gt;&lt;del&gt;N&lt;/del&gt;&lt;/tt&gt;&#160;entries (which were all empty), we&apos;re not encountering the&#160;ldiskfs_dx_add_entry messages (because we were below the &lt;tt&gt;Large directory feature&lt;/tt&gt; threshold).&#160; I think this is the first time we&apos;ve ran an &lt;tt&gt;lfsck&lt;/tt&gt; on this file system with the &lt;tt&gt;--orphan&lt;/tt&gt; flag, though I&apos;m pretty sure we&apos;ve ran &lt;tt&gt;lfsck&lt;/tt&gt; against it once or twice in the past.&lt;/p&gt;

&lt;p&gt;Andreas, the default stripe count on the FS is 4; with our large number of small files, your statement makes sense.&lt;/p&gt;

&lt;p&gt;For the remaining files, they fit into 2 categories (1539 of these):&lt;/p&gt;
&lt;ul&gt;
	&lt;li&gt;Just file names, but no further information. So the MDT has a record for this file, but it&apos;s not associated with any valid objects?&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;&#160;&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
# lfs path2fid &lt;span class=&quot;code-quote&quot;&gt;&apos;/lustre/atlas2/.lustre/lost+found/MDT0000/[0x20042c912:0x14cd4:0x0]-R-0&apos;&lt;/span&gt;
[0x20042c912:0x14cd4:0x0]
# lfs fid2path /lustre/atlas2/ &lt;span class=&quot;code-quote&quot;&gt;&apos;[0x20042c912:0x14cd4:0x0]&apos;&lt;/span&gt;
/lustre/atlas2/.lustre/lost+found/MDT0000/[0x20042c912:0x14cd4:0x0]-R-0
# lfs getstripe &lt;span class=&quot;code-quote&quot;&gt;&apos;/lustre/atlas2/.lustre/lost+found/MDT0000/[0x20042c912:0x14cd4:0x0]-R-0&apos;&lt;/span&gt;
/lustre/atlas2/.lustre/lost+found/MDT0000/[0x20042c912:0x14cd4:0x0]-R-0
lmm_stripe_count: 4
lmm_stripe_size: 1048576
lmm_pattern: 40000001
lmm_layout_gen: 0
lmm_stripe_offset: 0
 obdidx objid objid group
 0 0 0 0
 0 0 0 0
 0 0 0 0
 483 64774796 0x3dc628c 0
# stat &lt;span class=&quot;code-quote&quot;&gt;&apos;/lustre/atlas2/.lustre/lost+found/MDT0000/[0x20042c912:0x14cd4:0x0]-R-0&apos;&lt;/span&gt;
stat: cannot stat &#8216;/lustre/atlas2/.lustre/lost+found/MDT0000/[0x20042c912:0x14cd4:0x0]-R-0&#8217;: No such file or directory&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;I can then take that object and check it with debugfs against that OST:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
O/0/d12/64774796: File not found by ext2_lookup&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;&#160;&lt;/p&gt;
&lt;ul&gt;
	&lt;li&gt;The other files appear to have real data in them, along with ownership that (generally) looks correct:&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;&#160;&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
# lfs path2fid &lt;span class=&quot;code-quote&quot;&gt;&apos;/lustre/atlas2/.lustre/lost+found/MDT0000/[0x200380a26:0x12ecf:0x0]-R-0&apos;&lt;/span&gt;
[0x200380a26:0x12ecf:0x0]
# lfs fid2path /lustre/atlas2/ [0x200380a26:0x12ecf:0x0]
/lustre/atlas2/.lustre/lost+found/MDT0000/[0x200380a26:0x12ecf:0x0]-R-0
# lfs getstripe &lt;span class=&quot;code-quote&quot;&gt;&apos;/lustre/atlas2/.lustre/lost+found/MDT0000/[0x200380a26:0x12ecf:0x0]-R-0&apos;&lt;/span&gt;
/lustre/atlas2/.lustre/lost+found/MDT0000/[0x200380a26:0x12ecf:0x0]-R-0
lmm_stripe_count: 3
lmm_stripe_size: 1048576
lmm_pattern: 40000001
lmm_layout_gen: 1
lmm_stripe_offset: 284
 obdidx objid objid group
 284 49165778 0x2ee35d2 0
 0 0 0 0
 271 50018306 0x2fb3802 0
# ls -l &lt;span class=&quot;code-quote&quot;&gt;&apos;/lustre/atlas2/.lustre/lost+found/MDT0000/[0x200380a26:0x12ecf:0x0]-R-0&apos;&lt;/span&gt;
-r-------- 1 root root 76 Sep 14 08:36 /lustre/atlas2/.lustre/lost+found/MDT0000/[0x200380a26:0x12ecf:0x0]-R-0
# file &lt;span class=&quot;code-quote&quot;&gt;&apos;/lustre/atlas2/.lustre/lost+found/MDT0000/[0x200380a26:0x12ecf:0x0]-R-0&apos;&lt;/span&gt;
/lustre/atlas2/.lustre/lost+found/MDT0000/[0x200380a26:0x12ecf:0x0]-R-0: ASCII text &lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;&#160;&lt;/p&gt;</comment>
                            <comment id="234436" author="dustb100" created="Fri, 5 Oct 2018 13:27:40 +0000"  >&lt;p&gt;Are there any updates on this?&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;We need to give an update to our uses about the status of their files...&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;Thanks,&lt;/p&gt;

&lt;p&gt;Dustin&lt;/p&gt;</comment>
                            <comment id="234447" author="adilger" created="Fri, 5 Oct 2018 16:55:20 +0000"  >&lt;p&gt;The files like &lt;tt&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;0x20042c912:0x14cd4:0x0&amp;#93;&lt;/span&gt;&amp;#45;R&amp;#45;0&lt;/tt&gt; shown above appear to have some, but not all of the OST objects.  It is entirely possible that the missing objects were among the &lt;tt&gt;&amp;#45;N&amp;#45;&lt;/tt&gt; objects that were removed, but were never accessed/modified by clients, in case of sparse files.  It is also possible that some of the &lt;tt&gt;&amp;#45;R&amp;#45;&lt;/tt&gt; files are just some abandoned objects that was left over from past issues and are only being discovered now.&lt;/p&gt;

&lt;p&gt;In terms of moving forward, there are a couple of options:&lt;/p&gt;
&lt;ul class=&quot;alternate&quot; type=&quot;square&quot;&gt;
	&lt;li&gt;files with missing objects will return &lt;tt&gt;-ENOENT&lt;/tt&gt; (&quot;&lt;tt&gt;no such file or directory&lt;/tt&gt;&quot;) when trying to access the missing objects (e.g. &quot;&lt;tt&gt;ls -l&lt;/tt&gt;&quot; or &lt;tt&gt;stat()&lt;/tt&gt; from GNU &lt;tt&gt;rm&lt;/tt&gt;), but accessing the available objects will return data.&lt;/li&gt;
	&lt;li&gt;I &lt;em&gt;think&lt;/em&gt; the recovered files in &lt;tt&gt;lost+found&lt;/tt&gt; will have ownership based on the OST objects, which would allow you to determine the file ownership and create a per-user &lt;tt&gt;lost+found&lt;/tt&gt; directory and move their files therein so that users can decide what to do with their files.&lt;/li&gt;
	&lt;li&gt;something like &quot;&lt;tt&gt;dd if=/lustre/atlas2/.lustre/lost+found/MDT0000/&lt;span class=&quot;error&quot;&gt;&amp;#91;0x20042c912:0x14cd4:0x0&amp;#93;&lt;/span&gt;&amp;#45;R&amp;#45;0 bs=1M conv=sync,noerror&lt;/tt&gt;&quot; can be used to access the sparse files to determine content, and whether it is worthwhile to recover the files or not.&lt;/li&gt;
	&lt;li&gt;running &quot;&lt;tt&gt;lctl lfsck_start -r -A -t layout -c&lt;/tt&gt;&quot; (note &quot;&lt;tt&gt;-c&lt;/tt&gt;&quot; option) could be used create the missing OST objects, but will leave &quot;holes&quot; in the file where the missing objects were replaced.  That simplifies file access, but will take some time to run.  Using the &quot;&lt;tt&gt;dd&lt;/tt&gt;&quot; command above may tell you that the files with missing objects are not worthwhile to recover and they could just be deleted, or the output could be saved into a new file and then the files can be deleted, &lt;b&gt;instead of&lt;/b&gt; running the &lt;tt&gt;lfsck_start&lt;/tt&gt; command.&lt;/li&gt;
	&lt;li&gt;to delete the files with missing objects, use &quot;&lt;tt&gt;unlink /path/to/filename&lt;/tt&gt;&quot; instead of &lt;tt&gt;rm&lt;/tt&gt; since it avoids the &lt;tt&gt;stat()&lt;/tt&gt; that &lt;tt&gt;rm&lt;/tt&gt; does.&lt;/li&gt;
&lt;/ul&gt;
</comment>
                    </comments>
                    <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|i0036v:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                                                                                </customfields>
    </item>
</channel>
</rss>