<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 02:40:11 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-11013] Data Corruption error on Lustre ZFS dRaid</title>
                <link>https://jira.whamcloud.com/browse/LU-11013</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;Setting up Lustre testbed in ANL with:&#160;&lt;/p&gt;
&lt;ul class=&quot;alternate&quot; type=&quot;square&quot;&gt;
	&lt;li&gt;4 OSSs, total 16 OSTs ( 8 JBODs each with 60 HDDs)&lt;/li&gt;
	&lt;li&gt;Hybrid lustre, MGT/MDT - mdraid - raid10 - ldiskfs, OST - zfs - dRaid&lt;/li&gt;
	&lt;li&gt;MGT - 2 SSDs, raid10 - ldiskfs&lt;/li&gt;
	&lt;li&gt;MDT0 - 12 SSDs, raid10 - ldiskfs&lt;/li&gt;
	&lt;li&gt;MDT1 - 10 SSDs, raid10 - ldiskfs&lt;/li&gt;
	&lt;li&gt;Each OST, 30 HDDs, zfs dRaid with 3*(8+1) + 1&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;&#160;&lt;/p&gt;

&lt;ul&gt;
	&lt;li&gt;Filled up the fs to about 99%, we got data corruption problem after cleaned up fs and ran zfs scrub. Quite severe and ended up crashed the fs.&lt;/li&gt;
	&lt;li&gt;Rebuild lustre dRaid fs, and test again in order to duplicate the problem.&lt;/li&gt;
	&lt;li&gt;On first iteration of fill and clean up, the fs was holding up. Only got &quot;One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected.&quot; on two dRaid zpool. So just clear up those errors.&lt;/li&gt;
	&lt;li&gt;After 2nd iteration, finally able to reproduce the error, after emptied file system and run scrub, we got the same data corruption problem (&quot;One or more devices has experienced an error resulting in data corruption. Application may be affected&quot;).&lt;/li&gt;
	&lt;li&gt;Change the zpool to raidz2 with 3*(8+2) and we don&apos;t have this problem.&lt;/li&gt;
&lt;/ul&gt;
</description>
                <environment>RHEL-7.4, in-kernel ofed, mellanox FDR10, Lustre-2.10.3, dRaid-(pull-7078), dm-multipath</environment>
        <key id="52204">LU-11013</key>
            <summary>Data Corruption error on Lustre ZFS dRaid</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="4" iconUrl="https://jira.whamcloud.com/images/icons/priorities/minor.svg">Minor</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="5">Cannot Reproduce</resolution>
                                        <assignee username="isaac">Isaac Huang</assignee>
                                    <reporter username="kalfizah">Kurniawan Alfizah</reporter>
                        <labels>
                    </labels>
                <created>Thu, 10 May 2018 16:06:31 +0000</created>
                <updated>Tue, 11 May 2021 11:06:22 +0000</updated>
                            <resolved>Tue, 11 May 2021 11:06:22 +0000</resolved>
                                    <version>Lustre 2.10.3</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>5</watches>
                                                                            <comments>
                            <comment id="227639" author="adilger" created="Thu, 10 May 2018 17:40:18 +0000"  >&lt;p&gt;Have you tried this with native ZFS+dRAID to see if it hits the same corruption? That would isolate the problem to dRAID vs. an interaction between Lustre and dRAID.&#160;&lt;/p&gt;

&lt;p&gt;Note that you should make the native ZFS dataset the same way as Lustre, namely to enable &lt;tt&gt;recordsize=1024k&lt;/tt&gt;, &lt;tt&gt;dnodesize=auto&lt;/tt&gt;, &lt;tt&gt;multimount&lt;/tt&gt;. It might be best to format the OST with &lt;tt&gt;mkfs.lustre&lt;/tt&gt; as today, then set &lt;tt&gt;canmount=yes&lt;/tt&gt; and mount it locally for your testing. &lt;/p&gt;</comment>
                            <comment id="227853" author="isaac" created="Mon, 14 May 2018 22:18:50 +0000"  >&lt;p&gt;When was the dRAID code last refreshed? I pushed quite some changes a couple of weeks ago - please make sure to run the latest code. Did you build zfs and spl with&#160;&lt;em&gt;--enable-debug&lt;/em&gt;&#160;and set zfs module option&#160;draid_debug_lvl=5? Also, please see below on what debug information to gather for dRAID bugs:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://github.com/zfsonlinux/zfs/wiki/dRAID-HOWTO#troubleshooting&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://github.com/zfsonlinux/zfs/wiki/dRAID-HOWTO#troubleshooting&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="228556" author="kalfizah" created="Thu, 24 May 2018 16:27:15 +0000"  >&lt;p&gt;We&apos;re using this one&#160;&apos;&lt;a href=&quot;https://github.com/zfsonlinux/zfs/pull/7078&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://github.com/zfsonlinux/zfs/pull/7078&lt;/a&gt;&apos;&#160;&#160;I cloned them around early March 18.&lt;/p&gt;

&lt;p&gt;Btw, following Andreas suggestion, I think I might be able to re-create the problem in our VM cluster. Created VM with 30 virtual hdds, filled them up to about 98% and then removed, I got data corruption. Same thing, with or without lustre. But I don&apos;t see the problem with raidz2.&lt;/p&gt;

&lt;p&gt;On wolf-16, created the draid with 30 hdds, and then filled them up, I even managed to crash ZFS itself. This one is Isaac built ZFS though, so could be different problem.&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;</comment>
                            <comment id="301160" author="adilger" created="Tue, 11 May 2021 11:06:22 +0000"  >&lt;p&gt;This is presumably fixed in the ZFS 2.1 dRAID implementation upstream.&lt;/p&gt;</comment>
                    </comments>
                    <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzzx1j:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>