<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 03:19:35 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-15586] ZFS VERIFY3(sa.sa_magic == SA_MAGIC) failed</title>
                <link>https://jira.whamcloud.com/browse/LU-15586</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;On one of our pre-production file systems, which is used for testing some future functionalities, we run Lustre 2.14 with ZFS 2.1.0, as we use dRAID on the OSS nodes. Same OS/ZFS/Lustre stack is used on the MDS nodes, however in this case just a basic setup with one vdev consisting of one LUN from an all-flash array is used:{&lt;tt&gt;}&lt;/tt&gt;&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
~# zpool status -Lv
&#160; pool: mdt0-bkp
&#160;state: ONLINE
config:&#160; &#160; &#160; &#160; NAME &#160; &#160; &#160; &#160;STATE &#160; &#160; READ WRITE CKSUM
&#160; &#160; &#160; &#160; mdt0-bkp &#160; &#160;ONLINE &#160; &#160; &#160; 0 &#160; &#160; 0 &#160; &#160; 0
&#160; &#160; &#160; &#160; &#160; dm-2 &#160; &#160; &#160;ONLINE &#160; &#160; &#160; 0 &#160; &#160; 0 &#160; &#160; 0 &lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;The system was running in pre-prod enviorment for few months without any issues, however last week we have observed a problem, which has started from a single event&#160; on the MDS causing:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
kernel: [16066.783270] list_del corruption, ffff9ed0cc31e028-&amp;gt;next is LIST_POISON1 (dead000000000100)-
kernel: [16066.783679] kernel BUG at lib/list_debug.c:47!&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;After rebooting the MDS server and running the workload for some time we have observed:&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
kernel: VERIFY3(sa.sa_magic == SA_MAGIC) failed (8 == 3100762)
kernel: PANIC at zfs_quota.c:89:zpl_get_file_info() &lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Full trace looks like this:&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
Feb 16 21:51:53 ascratch-mds01 kernel: VERIFY3(sa.sa_magic == SA_MAGIC) failed (8 == 3100762)
Feb 16 21:51:53 ascratch-mds01 kernel: PANIC at zfs_quota.c:89:zpl_get_file_info()
Feb 16 21:51:53 ascratch-mds01 kernel: Showing stack &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; process 30151
Feb 16 21:51:53 ascratch-mds01 kernel: CPU: 8 PID: 30151 Comm: mdt00_096 Tainted: P          IOE    --------- -  - 4.18.0-348.2.1.el8_5.x86_64 #1
Feb 16 21:51:53 ascratch-mds01 kernel: Hardware name: Huawei 2288H V5/BC11SPSCB0, BIOS 7.99 03/11/2021
Feb 16 21:51:53 ascratch-mds01 kernel: Call Trace:
Feb 16 21:51:53 ascratch-mds01 kernel: dump_stack+0x5c/0x80
Feb 16 21:51:53 ascratch-mds01 kernel: spl_panic+0xd3/0xfb [spl]
Feb 16 21:51:53 ascratch-mds01 kernel: ? sg_init_table+0x11/0x30
Feb 16 21:51:54 ascratch-mds01 kernel: ? __sg_alloc_table+0x6e/0x170
Feb 16 21:51:54 ascratch-mds01 kernel: ? sg_alloc_table+0x1f/0x50
Feb 16 21:51:54 ascratch-mds01 kernel: ? sg_init_one+0x80/0x80
Feb 16 21:51:54 ascratch-mds01 kernel: ? _cond_resched+0x15/0x30
Feb 16 21:51:54 ascratch-mds01 kernel: ? _cond_resched+0x15/0x30
Feb 16 21:51:54 ascratch-mds01 kernel: ? mutex_lock+0xe/0x30
Feb 16 21:51:54 ascratch-mds01 kernel: ? spl_kmem_cache_alloc+0x5d/0x160 [spl]
Feb 16 21:51:54 ascratch-mds01 kernel: ? dbuf_rele_and_unlock+0x13d/0x6a0 [zfs]
Feb 16 21:51:54 ascratch-mds01 kernel: ? kmem_cache_alloc+0x12e/0x270
Feb 16 21:51:54 ascratch-mds01 kernel: ? __cv_init+0x3d/0x60 [spl]
Feb 16 21:51:54 ascratch-mds01 kernel: zpl_get_file_info+0x1ea/0x230 [zfs]
Feb 16 21:51:54 ascratch-mds01 kernel: dmu_objset_userquota_get_ids+0x1f8/0x480 [zfs]
Feb 16 21:51:54 ascratch-mds01 kernel: dnode_setdirty+0x2f/0xe0 [zfs]
Feb 16 21:51:54 ascratch-mds01 kernel: dnode_allocate+0x11d/0x180 [zfs]
Feb 16 21:51:54 ascratch-mds01 kernel: dmu_object_alloc_impl+0x32c/0x3c0 [zfs]
Feb 16 21:51:54 ascratch-mds01 kernel: dmu_object_alloc_dnsize+0x1c/0x30 [zfs]
Feb 16 21:51:54 ascratch-mds01 kernel: __osd_object_create+0x78/0x120 [osd_zfs]
Feb 16 21:51:54 ascratch-mds01 kernel: osd_mkreg+0x98/0x250 [osd_zfs]
Feb 16 21:51:54 ascratch-mds01 kernel: ? __osd_xattr_declare_set+0x190/0x260 [osd_zfs]
Feb 16 21:51:54 ascratch-mds01 kernel: osd_create+0x2c6/0xc90 [osd_zfs]
Feb 16 21:51:54 ascratch-mds01 kernel: ? __kmalloc_node+0x10e/0x2f0
Feb 16 21:51:54 ascratch-mds01 kernel: lod_sub_create+0x244/0x4a0 [lod]
Feb 16 21:51:54 ascratch-mds01 kernel: lod_create+0x4b/0x330 [lod] &lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Originally the issue appeared after running few hundred of ADF simulations in parallel. The issue was becoming more frequent (it took hours to trigger it on start, which changed to minutes after few crashes) and we were able to trigger the problem by a simple untar of linux kernel on the Lustre filesystem.&lt;/p&gt;

&lt;p&gt;What is interesting, even though such an untar would crash the system in second while running on Lustre, the same test running on the MDT dataset mounted as a regular ZFS filesystem (canmount=on) on the MDS server, was not crashable with the same procedure.&lt;/p&gt;

&lt;p&gt;As a &#160;mitigation zpool scrub was invoked, but detected no errors. We have decided to perform a restore of the MDT by using zfs send/recv to a fresh pool, created with&#160; dnodesize and xattr changed from default (legacy, on) to the dnodesize=auto and xattr=sa, what has solved the problem for few hours, after which the problem came back.&lt;/p&gt;

&lt;p&gt;We don&apos;t see any hardware issues on the system, both on the disk level and on the server level (no ECC errors etc.), so presumably this is a bug somewhere on the Lustre/ZFS level.&lt;/p&gt;

&lt;p&gt;The only operation done on the Lustre level few days before the issue appeared was tagging some of the directories with project ids for the purpose of project quota accounting, however no enforcment was enabled. Once the issue appeared all ids were cleared to the original state.&lt;/p&gt;

&lt;p&gt;We have opened two related bugreports on openzfs GitHub:&lt;br/&gt;
&lt;a href=&quot;https://github.com/openzfs/zfs/issues/13143&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://github.com/openzfs/zfs/issues/13143&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://github.com/openzfs/zfs/issues/13144&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://github.com/openzfs/zfs/issues/13144&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;{{}}&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;</description>
                <environment>Centos 8.5, 4.18.0-348.2.1.el8_5.x86_64, &lt;br/&gt;
ZFS  2.1.0/2.1.3-staging&lt;br/&gt;
Lustre 2.14 (latest b2_14 build)</environment>
        <key id="68815">LU-15586</key>
            <summary>ZFS VERIFY3(sa.sa_magic == SA_MAGIC) failed</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="2" iconUrl="https://jira.whamcloud.com/images/icons/priorities/critical.svg">Critical</priority>
                        <status id="1" iconUrl="https://jira.whamcloud.com/images/icons/statuses/open.png" description="The issue is open and ready for the assignee to start work on it.">Open</status>
                    <statusCategory id="2" key="new" colorName="default"/>
                                    <resolution id="-1">Unresolved</resolution>
                                        <assignee username="wc-triage">WC Triage</assignee>
                                    <reporter username="m.magrys">Marek Magrys</reporter>
                        <labels>
                            <label>ZFS</label>
                    </labels>
                <created>Wed, 23 Feb 2022 00:53:25 +0000</created>
                <updated>Wed, 23 Feb 2022 00:53:25 +0000</updated>
                                            <version>Lustre 2.14.0</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>4</watches>
                                                                                    <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|i02j4f:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>