<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 03:00:53 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-13392] FID-in-LMA does not match the object self-fid after upgrade from 2.10 to 2.12</title>
                <link>https://jira.whamcloud.com/browse/LU-13392</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;After upgrading a filesystem from Lustre 2.10.8 to 2.12.4 (following the major release upgrade procedure from chapter 17.2 of the manual), lstat() would hang on some of the files. After disabling auto_scrub on all OSTs, lstat() returns wtih &lt;tt&gt;-1 EREMCHG (Remote address changed)&lt;/tt&gt;. This appears to be related to the following errors in the OSS syslogs:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;2020-03-26T10:50:21.222726+01:00 oss1 kernel: [249279.579945] LustreError: 32828:0:(osd_object.c:481:osd_check_lma()) aeromdo-OST0001: FID-in-LMA [0x100000000:0x0:0x0] does not match the object self-fid [0x100010000:0x0:0x0]
2020-03-26T10:50:21.222757+01:00 oss1 kernel: [249279.656311] LustreError: 32828:0:(osd_object.c:481:osd_check_lma()) Skipped 600 previous similar messages
2020-03-26T10:50:22.438285+01:00 oss1 kernel: [249280.818924] LustreError: 32828:0:(ofd_dev.c:1507:ofd_create_hdl()) aeromdo-OST0001: Can&apos;t find FID Sequence 0x0: rc = -78
2020-03-26T10:50:22.438306+01:00 oss1 kernel: [249280.872078] LustreError: 32828:0:(ofd_dev.c:1507:ofd_create_hdl()) Skipped 599 previous similar messages
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;&lt;tt&gt;lctl lfsck_start -A -o&lt;/tt&gt; did not resolve the issue; according to OI_scrub info, 258 out of 4659445 failed to be repaired on OST0000, as well as 322 out of 4661773 on OST0001.&lt;/p&gt;

&lt;p&gt;The issue appears to affect old files (created around 2015) rather than recently modified ones.&lt;/p&gt;</description>
                <environment>Lustre on ZFS, CentOS 7.7, 3.10.0-1062.9.1.el7_lustre.x86_64</environment>
        <key id="58516">LU-13392</key>
            <summary>FID-in-LMA does not match the object self-fid after upgrade from 2.10 to 2.12</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="3" iconUrl="https://jira.whamcloud.com/images/icons/priorities/major.svg">Major</priority>
                        <status id="1" iconUrl="https://jira.whamcloud.com/images/icons/statuses/open.png" description="The issue is open and ready for the assignee to start work on it.">Open</status>
                    <statusCategory id="2" key="new" colorName="default"/>
                                    <resolution id="-1">Unresolved</resolution>
                                        <assignee username="wc-triage">WC Triage</assignee>
                                    <reporter username="knut.franke">Knut Franke</reporter>
                        <labels>
                    </labels>
                <created>Thu, 26 Mar 2020 10:05:49 +0000</created>
                <updated>Wed, 3 Feb 2021 08:53:18 +0000</updated>
                                            <version>Lustre 2.12.4</version>
                                                        <due></due>
                            <votes>1</votes>
                                    <watches>2</watches>
                                                                            <comments>
                            <comment id="266144" author="knut.franke" created="Thu, 26 Mar 2020 10:14:45 +0000"  >&lt;p&gt;This might be related to&#160;&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-12278&quot; title=&quot;sanity-scrub crashes with &amp;#39;BUG: soft lockup - CPU#0 stuck for 23s! [lfsck:27242]&amp;#39;/ [OI_scrub:27272]&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-12278&quot;&gt;LU-12278&lt;/a&gt;, which also mentions the &quot;FID-in-LMA does not match the object self-fid&quot; error; although we do not experience the &quot;&lt;/p&gt;

&lt;p&gt;can&apos;t get bonus&quot; error, nor the crashes reported there.&lt;/p&gt;</comment>
                            <comment id="266600" author="knut.franke" created="Wed, 1 Apr 2020 17:17:19 +0000"  >&lt;p&gt;Digging deeper, I&apos;ve identified two small text files (small enough to fit into one OST object); one affected by the issue, the other not. I&apos;ve tracked down the following difference in the on-disk data structures of the two:&lt;/p&gt;
&lt;h4&gt;&lt;a name=&quot;working%C2%A0&quot;&gt;&lt;/a&gt;working&#160;&lt;/h4&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;$ lfs getstripe --verbose tutorial/pre
tutorial/pre
lmm_magic:         0x0BD10BD0
lmm_seq:           0x300004280
lmm_object_id:     0x2edc
lmm_fid:           [0x300004280:0x2edc:0x0]
lmm_stripe_count:  2
lmm_stripe_size:   1048576
lmm_pattern:       1
lmm_layout_gen:    0
lmm_stripe_offset: 0
    obdidx       objid       objid       group
         0       707339712     0x2a2925c0                0
         1       707431336     0x2a2a8ba8                0

# ll_decode_filter_fid O/0/d$((707339712 % 32))/707339712
O/0/d0/707339712: warning: ffid size is unexpected (44 bytes), recompile?
O/0/d0/707339712: parent=[0x300004280:0x2edc:0x0] stripe=0
# ll_decode_filter_fid O/0/d$((707431336 % 32))/707431336
O/0/d8/707431336: warning: ffid size is unexpected (44 bytes), recompile?
O/0/d8/707431336: parent=[0x300004280:0x2edc:0x0] stripe=1
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;So everything works out correctly, despite the fact that ll_decode_filter is apparently unhappy about the size of the trusted.fid attribute.&lt;/p&gt;
&lt;h4&gt;&lt;a name=&quot;inaccessible%28stat%28%29fails%29&quot;&gt;&lt;/a&gt;inaccessible (stat() fails)&lt;/h4&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;$ lfs getstripe --verbose watchdog.log
watchdog.log
lmm_magic:         0x0BD10BD0
lmm_seq:           0x300004280
lmm_object_id:     0x2eee
lmm_fid:           [0x300004280:0x2eee:0x0]
lmm_stripe_count:  2
lmm_stripe_size:   1048576
lmm_pattern:       1
lmm_layout_gen:    0
lmm_stripe_offset: 0
    obdidx       objid       objid       group
         0       498981674     0x1dbddb2a                0
         1       498980471     0x1dbdd677                0

# ll_decode_filter_fid O/0/d$((498981674 % 32))/498981674
O/0/d10/498981674: parent=[0x300004280:0x2eee:0x0] stripe=0 stripe_size=1048576 stripe_count=2 layout_version=0 range=0
# ll_decode_filter_fid O/0/d$((498980471 % 32))/498980471
O/0/d23/498980471: error reading fid: No data available
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Indeed, I could verify using zdb that object&#160;498980471 on OST 1 is missing the trusted.fid and trusted.version attributes (though trusted.lma is present).&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;Given that everything was working with Luster 2.10, this looks to me as if 2.12 no longer supports the on-disk format used by some/all of the older files (for the example above, the ZFS object was created in May 2016; I&apos;d have to do more research to find out what Lustre version we were using back then and whether a migration from an older ldiskfs installation was involved).&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;</comment>
                            <comment id="267834" author="knut.franke" created="Thu, 16 Apr 2020 16:37:23 +0000"  >&lt;p&gt;So after digging through the source code and the filesystem some more, the puzzle pieces are slowly coming together. I&apos;m still not sure whether or not the (empty) object on OST1 is supposed to have a trusted.fid EA, but apparently trusted.lma is inconsistent with something else:&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
oss0 # getfattr -n trusted.lma --only-values O/0/d$((498981674 % 32))/498981674 | xxd
0000000: 0000 0000 0000 0000 0000 0000 0100 0000  ................
0000010: 2adb bd1d 0000 0000                      *.......
oss1 # getfattr -n trusted.lma --only-values O/0/d$((498980471 % 32))/498980471 | xxd
0000000: 0000 0000 0000 0000 0000 0000 0100 0000  ................
0000010: 77d6 bd1d 0000 0000                      w.......
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;which (I think) translates into&#160;FID-in-LMA values of &lt;span class=&quot;error&quot;&gt;&amp;#91;0x100000000:0x1dbddb2a:0x0&amp;#93;&lt;/span&gt; and &lt;span class=&quot;error&quot;&gt;&amp;#91;0x100000000:0x1dbdd677:0x0&amp;#93;&lt;/span&gt;, respectively. After the failed attempt at accessing the file, searching the lctl debug_kernel output for these yields the following on oss1:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
00080000:00020000:18.0:1587047810.568980:0:1951:0:(osd_object.c:481:osd_check_lma()) aeromdo-OST0001: FID-in-LMA [0x100000000:0x1dbdd677:0x0] does not match the object self-fid [0x100010000:0x1dbdd677:0x0] &lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;and no results on oss0. This suggests to me that the sequence number of the object self-fid is off on OST1, but unfortunately I have no idea how this is derived during lookup.&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;</comment>
                            <comment id="268341" author="knut.franke" created="Thu, 23 Apr 2020 12:25:57 +0000"  >&lt;p&gt;&lt;tt&gt;After manually setting the&#160;FID-in-LMA on ost1 to }}&lt;span class=&quot;error&quot;&gt;&amp;#91;0x100010000:0x1dbdd677:0x0&amp;#93;&lt;/span&gt;{{, stat on the file succeeds, but reports incorrect ownership (root.root) and timestamps (too recent), so clearly something else is amiss here.&lt;/tt&gt;&lt;/p&gt;</comment>
                            <comment id="268361" author="knut.franke" created="Thu, 23 Apr 2020 15:54:59 +0000"  >&lt;p&gt;In&#160;lustre/osp/osp_internal.h, I found the following comment:&lt;/p&gt;
&lt;blockquote&gt;&lt;p&gt;In 2.6+ ost_idx is packed into IDIF FID, while in 2.4 and 2.5 IDIF is always FID_SEQ_IDIF(0x100000000ULL), which does not include OST index in the seq.&lt;/p&gt;&lt;/blockquote&gt;
&lt;p&gt;Looking at the inaccessible files (and the OSS logs), it seems that the entire issue can be traced to lookup failures of&#160; objects on OST 1 with&#160;FID-in-LMA sequence number&#160;0x100000000 (i.e. written by Lustre 2.4/2.5, which is a reasonable assumption for the filesystem and files in question), where Lustre erroneously adds the OST index to the self-fid during comparsion. If this is true, this error should occur for basically all files written by Lustre 2.4/2.5 (except if they have a stripe count of 1 and only reside on OST 0).&lt;/p&gt;</comment>
                            <comment id="268646" author="knut.franke" created="Mon, 27 Apr 2020 12:31:36 +0000"  >&lt;p&gt;During further testing with other affected files, I could not reproduce the issue with incorrect ownership/timestamp after manually updating the FID-in-LMA to include the OST index. I&apos;m assuming that issue was due to some other tampering of mine while debugging with that particular file.&lt;/p&gt;</comment>
                            <comment id="268852" author="knut.franke" created="Wed, 29 Apr 2020 09:21:36 +0000"  >&lt;p&gt;I&apos;ve updated all affected objects in the filesystem (script attached). So far everything looks fine, no more hangs or stat() failures or errors in the Lustre logs.&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                                                <inwardlinks description="is related to">
                                        <issuelink>
            <issuekey id="61557">LU-14119</issuekey>
        </issuelink>
                            </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                            <attachment id="34784" name="update_25_objects" size="1335" author="knut.franke" created="Wed, 29 Apr 2020 09:18:52 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|i00wdr:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>