<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 03:33:10 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-17164] Old files not accessible anymore with lma incompat=2 and no lov</title>
                <link>https://jira.whamcloud.com/browse/LU-17164</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;Hello!&lt;br/&gt;
On our Oak filesystem, today running 2.12.8+patches (very close to 2.12.9), a few old files, which were last modified in March 2020, cannot be accessed anymore. From a client:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[root@oak-cli01 ~]# ls -l /oak/stanford/groups/khavari/users/dfporter/before_2021_projects/genome/fastRepEnrich_hg38/fastRepEnrich/fastRE_Setup
ls: cannot access &apos;/oak/stanford/groups/khavari/users/dfporter/before_2021_projects/genome/fastRepEnrich_hg38/fastRepEnrich/fastRE_Setup/pseudogenome.fasta&apos;: No such file or directory
ls: cannot access &apos;/oak/stanford/groups/khavari/users/dfporter/before_2021_projects/genome/fastRepEnrich_hg38/fastRepEnrich/fastRE_Setup/repnames.bed&apos;: No such file or directory
total 0
-????????? ? ? ? ?            ? pseudogenome.fasta
-????????? ? ? ? ?            ? repnames.bed
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;We found them with no &lt;tt&gt;trusted.lov&lt;/tt&gt;, just a &lt;tt&gt;trusted.lma&lt;/tt&gt; and ACLs (system.posix_acl_access), owned by root / root and 0000 as permissions (note that I since changed the ownership/permissions which are reflected in the debugfs output below, so the ctime has been updated too):&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;oak-MDT0000&amp;gt; debugfs:  stat ROOT/stanford/groups/khavari/users/dfporter/before_2021_projects/genome/fastRepEnrich_hg38/fastRepEnrich/fastRE_Setup/pseudogenome.fasta
Inode: 745295211   Type: regular    Mode:  0440   Flags: 0x0
Generation: 392585436    Version: 0x00000000:00000000
User:     0   Group:     0   Project:     0   Size: 0
File ACL: 0
Links: 1   Blockcount: 0
Fragment:  Address: 0    Number: 0    Size: 0
 ctime: 0x651b1517:a256cacc -- Mon Oct  2 12:08:07 2023
 atime: 0x649be9dc:d980bf08 -- Wed Jun 28 01:05:48 2023
 mtime: 0x5e7ae8a2:437ca450 -- Tue Mar 24 22:14:10 2020
crtime: 0x649be9dc:d980bf08 -- Wed Jun 28 01:05:48 2023
Size of extra inode fields: 32
Extended attributes:
  lma: fid=[0x2f800028cf:0x944c:0x0] compat=0 incompat=2
  system.posix_acl_access:
    user::r--
    group::rwx
    group:3352:rwx
    mask::r--
    other::---
BLOCKS:

oak-MDT0000&amp;gt; debugfs:  stat ROOT/stanford/groups/khavari/users/dfporter/before_2021_projects/genome/fastRepEnrich_hg38/fastRepEnrich/fastRE_Setup/repnames.bed
Inode: 745295212   Type: regular    Mode:  0440   Flags: 0x0
Generation: 392585437    Version: 0x00000000:00000000
User:     0   Group:     0   Project:     0   Size: 0
File ACL: 0
Links: 1   Blockcount: 0
Fragment:  Address: 0    Number: 0    Size: 0
 ctime: 0x651b1517:a256cacc -- Mon Oct  2 12:08:07 2023
 atime: 0x649be9dc:d980bf08 -- Wed Jun 28 01:05:48 2023
 mtime: 0x5e7ae8ad:07654c1c -- Tue Mar 24 22:14:21 2020
crtime: 0x649be9dc:d980bf08 -- Wed Jun 28 01:05:48 2023
Size of extra inode fields: 32
Extended attributes:
  lma: fid=[0x2f800028cf:0x953d:0x0] compat=0 incompat=2
  system.posix_acl_access:
    user::r--
    group::rwx
    group:3352:rwx
    mask::r--
    other::---
BLOCKS:
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Note also that the crtime is recent because we migrated this MDT (MDT0000) using a backup/restore method to new hardware last June 2023, but we have verified yesterday that these files were already like that before the MDT migration (we still have access to the old storage array). So we know it&apos;s not something we introduced during the migration. Just in case you notice the crtime and ask &lt;img class=&quot;emoticon&quot; src=&quot;https://jira.whamcloud.com/images/icons/emoticons/smile.png&quot; height=&quot;16&quot; width=&quot;16&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/p&gt;

&lt;p&gt;Timeline as we understand it:&lt;/p&gt;
&lt;ul&gt;
	&lt;li&gt;March 2020 these files likely created, or at least last modified, this was on Lustre 2.10.8&lt;/li&gt;
	&lt;li&gt;October 2020: we upgraded from 2.10.8 to 2.12.5&lt;/li&gt;
	&lt;li&gt;June 2022: we have recorded a SATTR changelog event with those FIDs but on oak-MDT0002, I don&apos;t know why as they are stored on MDT0.&lt;/li&gt;
	&lt;li&gt;June 2023: we perform a MDT backup/restore to new hardware but we confirmed this didn&apos;t introduce the problem&lt;/li&gt;
	&lt;li&gt;October 2023: our users notice and report the problem.&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;Changelog events on those FIDs (we log them to Splunk):&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;2022-06-08T13:15:14.793547861-0700 mdt=oak-MDT0002 id=9054081490 type=SATTR flags=0x44 uid=0 gid=0 target=[0x2f800028cf:0x944c:0x0]
2022-06-08T13:15:14.795309940-0700 mdt=oak-MDT0002 id=9054081491 type=SATTR flags=0x44 uid=0 gid=0 target=[0x2f800028cf:0x953d:0x0]
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;It&apos;s really curious to see those coming from oak-MDT0002 !??&lt;br/&gt;
We have also noticed these errors in the logs:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Oct 02 11:35:12 oak-md1-s1 kernel: LustreError: 59611:0:(mdt_open.c:1227:mdt_cross_open()) oak-MDT0002: [0x2f800028cf:0x944c:0x0] doesn&apos;t exist!: rc = -14
Oct 02 11:35:37 oak-md1-s1 kernel: LustreError: 59615:0:(mdt_open.c:1227:mdt_cross_open()) oak-MDT0002: [0x2f800028cf:0x944c:0x0] doesn&apos;t exist!: rc = -14
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Could Lustre be confused on which MDT these FIDs are supposed to be served because of corrupted metadata? Why on earth could oak-MDT0002 be involved here?&lt;/p&gt;

&lt;p&gt;Parent FID:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[root@oak-cli01 ~]# lfs path2fid /oak/stanford/groups/khavari/users/dfporter/before_2021_projects/genome/fastRepEnrich_hg38/fastRepEnrich/fastRE_Setup
[0x200033e88:0x114:0x0]
[root@oak-cli01 ~]# lfs getdirstripe /oak/stanford/groups/khavari/users/dfporter/before_2021_projects/genome/fastRepEnrich_hg38/fastRepEnrich/fastRE_Setup
lmv_stripe_count: 0 lmv_stripe_offset: 0 lmv_hash_type: none
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;We tried to run lfsck namespace but it crashed our MDS likely due to &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-14105&quot; title=&quot;lfsck shouldn&amp;#39;t LBUG() on disk data&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-14105&quot;&gt;&lt;del&gt;LU-14105&lt;/del&gt;&lt;/a&gt; which is only fixed in 2.14:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;      KERNEL: /usr/lib/debug/lib/modules/3.10.0-1160.83.1.el7_lustre.pl1.x86_64/vmlinux
    DUMPFILE: vmcore  [PARTIAL DUMP]
        CPUS: 64
        DATE: Mon Oct  2 22:55:53 2023
      UPTIME: 48 days, 16:17:05
LOAD AVERAGE: 2.94, 3.39, 3.52
       TASKS: 3287
    NODENAME: oak-md1-s2
     RELEASE: 3.10.0-1160.83.1.el7_lustre.pl1.x86_64
     VERSION: #1 SMP Sun Feb 19 18:38:37 PST 2023
     MACHINE: x86_64  (3493 Mhz)
      MEMORY: 255.6 GB
       PANIC: &quot;Kernel panic - not syncing: LBUG&quot;
         PID: 24913
     COMMAND: &quot;lfsck_namespace&quot;
        TASK: ffff8e62979fa100  [THREAD_INFO: ffff8e5f41a48000]
         CPU: 8
       STATE: TASK_RUNNING (PANIC)

crash&amp;gt; bt
PID: 24913  TASK: ffff8e62979fa100  CPU: 8   COMMAND: &quot;lfsck_namespace&quot;
 #0 [ffff8e5f41a4baa8] machine_kexec at ffffffffaac69514
 #1 [ffff8e5f41a4bb08] __crash_kexec at ffffffffaad29d72
 #2 [ffff8e5f41a4bbd8] panic at ffffffffab3ab713
 #3 [ffff8e5f41a4bc58] lbug_with_loc at ffffffffc06538eb [libcfs]
 #4 [ffff8e5f41a4bc78] lfsck_namespace_assistant_handler_p1 at ffffffffc1793e68 [lfsck]
 #5 [ffff8e5f41a4bd80] lfsck_assistant_engine at ffffffffc177604e [lfsck]
 #6 [ffff8e5f41a4bec8] kthread at ffffffffaaccb511
 #7 [ffff8e5f41a4bf50] ret_from_fork_nospec_begin at ffffffffab3c51dd
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;According to Robinhood, these files&apos; striping is likely 1 so we&apos;re going to try to find their object IDs.&lt;/p&gt;

&lt;p&gt;Do you have any idea on how to resolve this without running lfsck? How can we find/reattach the objects?&lt;/p&gt;

&lt;p&gt;Thanks!&lt;/p&gt;</description>
                <environment>2.12.8+patches, CentOS 7.9, ldiskfs</environment>
        <key id="78226">LU-17164</key>
            <summary>Old files not accessible anymore with lma incompat=2 and no lov</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="3" iconUrl="https://jira.whamcloud.com/images/icons/priorities/major.svg">Major</priority>
                        <status id="1" iconUrl="https://jira.whamcloud.com/images/icons/statuses/open.png" description="The issue is open and ready for the assignee to start work on it.">Open</status>
                    <statusCategory id="2" key="new" colorName="default"/>
                                    <resolution id="-1">Unresolved</resolution>
                                        <assignee username="wc-triage">WC Triage</assignee>
                                    <reporter username="sthiell">Stephane Thiell</reporter>
                        <labels>
                    </labels>
                <created>Tue, 3 Oct 2023 18:07:39 +0000</created>
                <updated>Thu, 5 Oct 2023 00:47:34 +0000</updated>
                                            <version>Lustre 2.12.8</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>2</watches>
                                                                            <comments>
                            <comment id="388148" author="sthiell" created="Wed, 4 Oct 2023 18:43:16 +0000"  >&lt;p&gt;We still don&apos;t know what caused this in the first place. Perhaps this was due to a &lt;tt&gt;lfs migrate&lt;/tt&gt; which didn&apos;t end well, or introduced when we upgraded from Lustre 2.10 to 2.12. Any clue would be appreciated...&lt;br/&gt;
The good news here for us is that we were able to restore the files thanks to the striping info stored in Robinhood in the &lt;tt&gt;STRIPE_ITEMS&lt;/tt&gt; table, indeed the &lt;tt&gt;details&lt;/tt&gt; column has the object gen, sequence and objid in hex that can be decoded to find the objects.&lt;/p&gt;</comment>
                            <comment id="388175" author="adilger" created="Thu, 5 Oct 2023 00:22:56 +0000"  >&lt;p&gt;Stephane, the inode is marked in the &lt;tt&gt;trusted.lma&lt;/tt&gt; xattr with &lt;tt&gt;incompat: 2&lt;/tt&gt; which is:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
&lt;span class=&quot;code-keyword&quot;&gt;enum&lt;/span&gt; lma_incompat {
        LMAI_AGENT              = 0x00000002, &lt;span class=&quot;code-comment&quot;&gt;/* agent inode */&lt;/span&gt;
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;which means that this is a &quot;proxy&quot; inode created on the local MDT that is pointing at an inode with the given FID &lt;tt&gt;0x2f800028cf:0x944c:0x0&lt;/tt&gt; on the remote MDT, presumably MDT0002.  Inodes created on MDT0000 would have a sequence number like &lt;tt&gt;0x20000xxxx&lt;/tt&gt;.  Because the remote MDT0002 inode doesn&apos;t exist, it might be exposing the underlying agent inode, or possibly you are extracting this info from the underlying ldiskfs filesystem?&lt;/p&gt;

&lt;p&gt;You would need to look for &lt;tt&gt;0x2f800028cf:0x944c:0x0&lt;/tt&gt; in the &lt;tt&gt;REMOTE_PARENT_DIR&lt;/tt&gt; on MDT0002 to see if it is there or missing.  Running LFSCK would potentially be able to recreate the inode on MDT0002  if the OST objects still exist (they will have a backpointer to &lt;tt&gt;0x2f800028cf:0x944c:0x0&lt;/tt&gt;).  If the OST objects are missing, then you could delete this inode from the local filesystem (possibly via ldiskfs).&lt;/p&gt;</comment>
                            <comment id="388176" author="sthiell" created="Thu, 5 Oct 2023 00:47:34 +0000"  >&lt;p&gt;Hi Andreas!&lt;/p&gt;

&lt;p&gt;Ah! All the inodes in &lt;tt&gt;REMOTE_PARENT_DIR&lt;/tt&gt; on MDT0002 start with the sequence &lt;tt&gt;0x2f8000xxxx&lt;/tt&gt; but &lt;tt&gt;0x2f800028cf:0x944c:0x0&lt;/tt&gt; cannot be found. That also explains the &lt;tt&gt;mdt_cross_open()&lt;/tt&gt; errors we were seeing on MDT0002.&lt;/p&gt;

&lt;p&gt;It looks like this user had access to another directory tree on MDT0002. Do you think there is a possibility of a &lt;tt&gt;mv&lt;/tt&gt; done by a user at some point (which may have be done with Lustre 2.10 or 2.12) that somehow was incomplete, perhaps after a server crash, and had let this agent inode on MDT0000 but no target inode on MDT0002?&lt;/p&gt;

&lt;p&gt;I am glad to hear that LFSCK would likely help in that case. We&apos;d like to start using it but after we upgrade Oak to 2.15.&lt;br/&gt;
In any case, this is extremely helpful, thanks! Enjoy LAD, sorry I am missing it this year.&lt;/p&gt;</comment>
                    </comments>
                    <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|i03xcf:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>