<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 03:02:07 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-13536] Lustre ZFS dnode kernel panic</title>
                <link>https://jira.whamcloud.com/browse/LU-13536</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;Hi Folks,&lt;/p&gt;

&lt;p&gt;We recently upgraded our Lustre ZFS servers at SUT and have been experiencing an issues with the ZFS filesytem crashing. Last week we upgraded from Lustre 2.10.5 (plus a dozen patches) &amp;amp; ZFS 0.7.9, over to Lustre 2.12.4 &amp;amp; ZFS 0.7.13&lt;/p&gt;

&lt;p&gt;Now if we import and mount our main zfs/lustre filesystem, then and resume Slurm jobs and move onto starting the Slurm partitions we&apos;ll hit a kernel panic on the MDS shortly after the partitions are up:&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;May  8 20:12:37 warble2 kernel: VERIFY(dnode_add_ref(dn, (void *)(uintptr_t)tx-&amp;gt;tx_txg)) failed
May  8 20:12:37 warble2 kernel: PANIC at dnode.c:1635:dnode_setdirty()
May  8 20:12:37 warble2 kernel: Showing stack for process 45209
May  8 20:12:37 warble2 kernel: CPU: 7 PID: 45209 Comm: mdt01_123 Tainted: P           OE  ------------   3.10.0-1127.el7.x86_64 #1
May  8 20:12:37 warble2 kernel: Hardware name: Dell Inc. PowerEdge R740/0JM3W2, BIOS 2.5.4 01/13/2020
May  8 20:12:37 warble2 kernel: Call Trace:
May  8 20:12:37 warble2 kernel: [&amp;lt;ffffffff9077ff85&amp;gt;] dump_stack+0x19/0x1b
May  8 20:12:37 warble2 kernel: [&amp;lt;ffffffffc04d4f24&amp;gt;] spl_dumpstack+0x44/0x50 [spl]
May  8 20:12:37 warble2 kernel: [&amp;lt;ffffffffc04d4ff9&amp;gt;] spl_panic+0xc9/0x110 [spl]
May  8 20:12:37 warble2 kernel: [&amp;lt;ffffffff900c7780&amp;gt;] ? wake_up_atomic_t+0x30/0x30
May  8 20:12:37 warble2 kernel: [&amp;lt;ffffffffc0c21073&amp;gt;] ? dbuf_rele_and_unlock+0x283/0x4c0 [zfs]
May  8 20:12:37 warble2 kernel: [&amp;lt;ffffffffc04d0238&amp;gt;] ? spl_kmem_zalloc+0xd8/0x180 [spl]
May  8 20:12:37 warble2 kernel: [&amp;lt;ffffffff90784002&amp;gt;] ? mutex_lock+0x12/0x2f
May  8 20:12:37 warble2 kernel: [&amp;lt;ffffffffc0c31a2c&amp;gt;] ? dmu_objset_userquota_get_ids+0x23c/0x440 [zfs]
May  8 20:12:37 warble2 kernel: [&amp;lt;ffffffffc0c40f39&amp;gt;] dnode_setdirty+0xe9/0xf0 [zfs]
May  8 20:12:37 warble2 kernel: [&amp;lt;ffffffffc0c4120c&amp;gt;] dnode_allocate+0x18c/0x230 [zfs]
May  8 20:12:37 warble2 kernel: [&amp;lt;ffffffffc0c2dd2b&amp;gt;] dmu_object_alloc_dnsize+0x34b/0x3e0 [zfs]
May  8 20:12:37 warble2 kernel: [&amp;lt;ffffffffc1630032&amp;gt;] __osd_object_create+0x82/0x170 [osd_zfs]
May  8 20:12:37 warble2 kernel: [&amp;lt;ffffffffc163027b&amp;gt;] osd_mksym+0x6b/0x110 [osd_zfs]
May  8 20:12:37 warble2 kernel: [&amp;lt;ffffffff907850c2&amp;gt;] ? down_write+0x12/0x3d
May  8 20:12:37 warble2 kernel: [&amp;lt;ffffffffc162b966&amp;gt;] osd_create+0x316/0xaf0 [osd_zfs]
May  8 20:12:37 warble2 kernel: [&amp;lt;ffffffffc18ed9c5&amp;gt;] lod_sub_create+0x1f5/0x480 [lod]
May  8 20:12:37 warble2 kernel: [&amp;lt;ffffffffc18de179&amp;gt;] lod_create+0x69/0x340 [lod]
May  8 20:12:37 warble2 kernel: [&amp;lt;ffffffffc1622690&amp;gt;] ? osd_trans_create+0x410/0x410 [osd_zfs]
May  8 20:12:37 warble2 kernel: [&amp;lt;ffffffffc1958173&amp;gt;] mdd_create_object_internal+0xc3/0x300 [mdd]
May  8 20:12:37 warble2 kernel: [&amp;lt;ffffffffc194122b&amp;gt;] mdd_create_object+0x7b/0x820 [mdd]
May  8 20:12:37 warble2 kernel: [&amp;lt;ffffffffc194b7b8&amp;gt;] mdd_create+0xdd8/0x14a0 [mdd]
May  8 20:12:37 warble2 kernel: [&amp;lt;ffffffffc17d96d4&amp;gt;] mdt_create+0xb54/0x1090 [mdt]
May  8 20:12:37 warble2 kernel: [&amp;lt;ffffffffc119ae94&amp;gt;] ? lprocfs_stats_lock+0x24/0xd0 [obdclass]
May  8 20:12:37 warble2 kernel: [&amp;lt;ffffffffc17d9d7b&amp;gt;] mdt_reint_create+0x16b/0x360 [mdt]
May  8 20:12:37 warble2 kernel: [&amp;lt;ffffffffc17dc963&amp;gt;] mdt_reint_rec+0x83/0x210 [mdt]
May  8 20:12:37 warble2 kernel: [&amp;lt;ffffffffc17b9273&amp;gt;] mdt_reint_internal+0x6e3/0xaf0 [mdt]
May  8 20:12:37 warble2 kernel: [&amp;lt;ffffffffc17c46e7&amp;gt;] mdt_reint+0x67/0x140 [mdt]
May  8 20:12:37 warble2 kernel: [&amp;lt;ffffffffc14af64a&amp;gt;] tgt_request_handle+0xada/0x1570 [ptlrpc]
May  8 20:12:37 warble2 kernel: [&amp;lt;ffffffffc1488d91&amp;gt;] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc]
May  8 20:12:37 warble2 kernel: [&amp;lt;ffffffffc07dcbde&amp;gt;] ? ktime_get_real_seconds+0xe/0x10 [libcfs]
May  8 20:12:37 warble2 kernel: [&amp;lt;ffffffffc145447b&amp;gt;] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
May  8 20:12:37 warble2 kernel: [&amp;lt;ffffffffc1451295&amp;gt;] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc]
May  8 20:12:37 warble2 kernel: [&amp;lt;ffffffff900d3dc3&amp;gt;] ? __wake_up+0x13/0x20
May  8 20:12:37 warble2 kernel: [&amp;lt;ffffffffc1457de4&amp;gt;] ptlrpc_main+0xb34/0x1470 [ptlrpc]
May  8 20:12:37 warble2 kernel: [&amp;lt;ffffffffc14572b0&amp;gt;] ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc]
May  8 20:12:37 warble2 kernel: [&amp;lt;ffffffff900c6691&amp;gt;] kthread+0xd1/0xe0
May  8 20:12:37 warble2 kernel: [&amp;lt;ffffffff900c65c0&amp;gt;] ? insert_kthread_work+0x40/0x40
May  8 20:12:37 warble2 kernel: [&amp;lt;ffffffff90792d1d&amp;gt;] ret_from_fork_nospec_begin+0x7/0x21
May  8 20:12:37 warble2 kernel: [&amp;lt;ffffffff900c65c0&amp;gt;] ? insert_kthread_work+0x40/0x40&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;This issue has come up once last week, and twice tonight. We note there&apos;s a little bit of chatter over at: &lt;a href=&quot;https://github.com/openzfs/zfs/issues/8705&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://github.com/openzfs/zfs/issues/8705&lt;/a&gt; but no real feedback yet, and it&apos;s been open for some time now. Are there any recommendations from the experience of Lustre developers on how we might mitigate this particular problem?&lt;/p&gt;

&lt;p&gt;Right now we&apos;re cloning our server image to include ZFS 0.8.3 to see if that will help.&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;Cheers,&lt;/p&gt;

&lt;p&gt;Simon&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;</description>
                <environment>Dell R740. Centos 7.8. Kernel: 3.10.0-1127.el7.x86_64, lustre-2.12.4-1.el7.x86_64, zfs-0.7.13-1.el7.x86_64, spl-0.7.13-1.el7.x86_64</environment>
        <key id="59109">LU-13536</key>
            <summary>Lustre ZFS dnode kernel panic</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="4" iconUrl="https://jira.whamcloud.com/images/icons/priorities/minor.svg">Minor</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="bzzz">Alex Zhuravlev</assignee>
                                    <reporter username="scadmin">SC Admin</reporter>
                        <labels>
                    </labels>
                <created>Fri, 8 May 2020 11:38:01 +0000</created>
                <updated>Tue, 20 Apr 2021 15:46:14 +0000</updated>
                            <resolved>Mon, 21 Sep 2020 11:04:44 +0000</resolved>
                                    <version>Lustre 2.12.4</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>9</watches>
                                                                            <comments>
                            <comment id="269710" author="adilger" created="Fri, 8 May 2020 17:54:53 +0000"  >&lt;p&gt;Have you tried disabling the ZFS &lt;tt&gt;dnodesize=auto&lt;/tt&gt; property?  That may avoid the frequent crashes while this issue is investigated. &lt;/p&gt;</comment>
                            <comment id="269712" author="pjones" created="Fri, 8 May 2020 17:55:51 +0000"  >&lt;p&gt;Alex&lt;/p&gt;

&lt;p&gt;Can you please advise here?&lt;/p&gt;

&lt;p&gt;Thanks&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="269745" author="scadmin" created="Sat, 9 May 2020 07:39:44 +0000"  >&lt;p&gt;Thanks Andreas,&lt;/p&gt;

&lt;p&gt;We have not changed the &lt;tt&gt;dnodesize&lt;/tt&gt; setting on our datasets. If we were to do so, how would we best determine the most suitable fixed value? It looks like the most common sizes for &lt;tt&gt;dnsize&lt;/tt&gt; on our MDT are 512 &amp;amp; 1K, but that&apos;s just skimming through a small selection - not the entire FS.{{}}&lt;/p&gt;

&lt;p&gt;The change to ZFS 0.8.3 went OK. No issues with importing and mounting ZFS &amp;amp; Lustre. Starting the partition in Slurm this time did not cause a crash. That&apos;s not to say it was the cause, it just happened to be one loose theory based only on a few occurances.&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;Cheers,&lt;/p&gt;

&lt;p&gt;Simon&lt;/p&gt;</comment>
                            <comment id="269752" author="adilger" created="Sat, 9 May 2020 11:22:47 +0000"  >&lt;p&gt;The original behavior (before dnodesize was an option) can be had with &lt;tt&gt;dnodesize=512&lt;/tt&gt;, and would be the de-facto safest option to use.  You could try using a static &lt;tt&gt;dnodesize=1024&lt;/tt&gt; to still give you better xattr performance, possibly avoiding whatever problem that &lt;tt&gt;dnodesize=auto&lt;/tt&gt; is causing, but it &lt;em&gt;may&lt;/em&gt; be that &lt;tt&gt;=1024&lt;/tt&gt; is itself the problem.&lt;/p&gt;

&lt;p&gt;It is definitely also possible that 0.8.3 has fixed issues in this area that were not backported to 0.7.13.&lt;/p&gt;</comment>
                            <comment id="269762" author="pjones" created="Sat, 9 May 2020 14:12:46 +0000"  >&lt;p&gt;I could be mistaken but wouldn&apos;t moving to ZFS 0.8.3 require also switching to RHEL 8.x servers? If so, I would caution that this is still in the early stages of support so would need some careful testing against your workloads before considering it ready for production. Work is active in this area but, at the time of writing, this is still somewhat of a WIP.&lt;/p&gt;</comment>
                            <comment id="269763" author="pjones" created="Sat, 9 May 2020 14:16:32 +0000"  >&lt;p&gt;&amp;gt; The change to ZFS 0.8.3 went OK. No issues with importing and mounting ZFS &amp;amp; Lustre&lt;/p&gt;

&lt;p&gt;Hmm I was working backwards through my email so had not seen that when I posted the above. Clearly my memory is faulty on this occasion.&lt;/p&gt;</comment>
                            <comment id="279965" author="pjones" created="Fri, 18 Sep 2020 17:15:13 +0000"  >&lt;p&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/ViewProfile.jspa?name=scadmin&quot; class=&quot;user-hover&quot; rel=&quot;scadmin&quot;&gt;scadmin&lt;/a&gt; just to check in this old ticket to confirm - is no news good news and the move to ZFS 0.8.x resolved this issue for you?&lt;/p&gt;</comment>
                            <comment id="280095" author="scadmin" created="Mon, 21 Sep 2020 03:43:58 +0000"  >&lt;p&gt;Thanks guys. &lt;/p&gt;

&lt;p&gt;Yes, it&apos;s possible the issue has now been resolved in the later ZFS release. That problem doesn&apos;t seem to have come back to bite us yet! Let&apos;s archive this case.&lt;/p&gt;

&lt;p&gt;Cheers,&lt;br/&gt;
simon&lt;/p&gt;</comment>
                            <comment id="280117" author="pjones" created="Mon, 21 Sep 2020 11:04:44 +0000"  >&lt;p&gt;0k - thanks!&lt;/p&gt;</comment>
                            <comment id="290357" author="degremoa" created="Tue, 26 Jan 2021 08:48:46 +0000"  >&lt;p&gt;For reference, we could not reproduce this crash after applying these 2 patches from zfs-0.8.0-rc3 on top of zfs-0.7.13&lt;/p&gt;
&lt;ul&gt;
	&lt;li&gt;78e213946 Fix dnode_hold() freeing dnode behavior&lt;/li&gt;
	&lt;li&gt;58769a4eb &lt;font color=&quot;#1d1c1d&quot;&gt;Don&#8217;t allow dnode allocation if dn_holds != 0&lt;/font&gt;&lt;/li&gt;
&lt;/ul&gt;
</comment>
                            <comment id="299297" author="kaizaad" created="Tue, 20 Apr 2021 14:50:29 +0000"  >&lt;p&gt;Thanks for finding those patches @degremoa&lt;/p&gt;

&lt;p&gt;@pjones We are hitting this issue with Lustre 2.12.6 and I think there have been a few other reports. Anecdotally, I think we hit this bug when we have &quot;badly behaving&quot; jobs that heavily stress the MDS.&lt;/p&gt;

&lt;p&gt;I installed the &quot;zfs-dkms-0.7.13.rpm&quot; from the&#160; Whamcloud download site along with the other Lustre required software. What do you think about Whamcloud applying the above two patches and re-rolling this rpm specifically for Lustre?&lt;/p&gt;

&lt;p&gt;thanks&lt;/p&gt;

&lt;p&gt;-k&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                                                <inwardlinks description="is related to">
                                                        </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|i00zwf:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>