<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 02:39:16 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-10911] FLR2: Erasure coding </title>
                <link>https://jira.whamcloud.com/browse/LU-10911</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;h2&gt;&lt;a name=&quot;Overview&quot;&gt;&lt;/a&gt;Overview&lt;/h2&gt;

&lt;p&gt;Erasure coding provides a more space-efficient method for adding data redundancy than mirroring, at a somewhat higher computational cost. This would typically be used for adding redundancy for large and longer-lived files to minimize space overhead. For example, RAID-6 10+2 adds only 20% space overhead while allowing two OST failures, compared to mirroring which adds 100% overhead for single-failure redundancy or 200% overhead for double-failure redundancy. Erasure coding can add redundancy for an arbitrary number of drive failures (e.g. any 3 drives in a group of 16) with a fraction of the overhead.&lt;/p&gt;

&lt;p&gt;It would be possible to implement delayed erasure coding on striped files in a similar manner to Phase 1 mirrored files, by storing the parity stripes in a separate component in the file, having a layout that indicates the erasure coding algorithm, number of data and parity stripes, stripe_size (should probably match file stripe size), etc. The encoding would be similar to RAID-4, with specific &quot;data&quot; stripes (the traditional Lustre RAID-0 file layout) in the primary component, and one or more &quot;parity&quot; stripes stored in a separate parity component, unlike RAID-5/6 that have the parity interleaved. For widely-striped files, there could be separate parity stripes for different sets of file stripes (e.g. 10x 12+3 for a 120-stripe file), so that data+parity would be able to use all of the OSTs in the filesystem without having double failures within a single parity group. For very large files, it would be possible to split the parity component into smaller extents to reduce the parity reconstruction overhead for sub-file overwrites. Erasure coding could also be added after-the-fact to existing RAID-0 striped files, after the initial file write, or when migrating a file from an active storage tier to an archive tier.&lt;/p&gt;

&lt;p&gt;Reads from an erasure-coded file would normally use only the primary RAID-0 component (unless data verification on read was also desired), as with non-redundant files. If a stripe in the primary component for the file fails, the client would read the data stripes and one or more parity stripes component and reconstruct the data from parity on the fly, and/or depend on the resync tool to reconstruct the failed stripe from parity.&lt;/p&gt;

&lt;p&gt;Writes to an erasure-coded file would mark the parity component stale matching the extent of the data component that was modified, as with a regular mirrored file, and writes would continue on the primary RAID-0 striped file. The main difference from an FLR mirrored file is that the writes would always need to go to the primary data component, and the parity component would always be marked stale. It would not be possible to write to an erasure-coded file that has a failure in a primary stripe without first reconstructing it from parity.&lt;/p&gt;

&lt;h2&gt;&lt;a name=&quot;SpaceEfficientDataRedundancy&quot;&gt;&lt;/a&gt;Space Efficient Data Redundancy&lt;/h2&gt;

&lt;p&gt;Erasure coding will add the ability to add full redundancy of large files or whole filesystems, rather than using full mirroring. This will allow striped Lustre files to store redundancy in parity components that allow recovery from a specified number of OST failures (e.g. 3 OST failures per 12 stripes, or 4 OST failures per 24 stripes) in a manner similar to RAID-4 with fixed parity stripes.&lt;/p&gt;

&lt;h2&gt;&lt;a name=&quot;RequiredLustreFunctionality&quot;&gt;&lt;/a&gt;Required Lustre Functionality&lt;/h2&gt;
&lt;h3&gt;&lt;a name=&quot;ErasureCodedFileRead&quot;&gt;&lt;/a&gt;Erasure Coded File Read&lt;/h3&gt;

&lt;p&gt;The actual parity generation will be done with the &lt;tt&gt;lfs mirror resync&lt;/tt&gt; tool in userspace.  The Lustre client will do normal reads from the RAID-0 data component, unless there is an OST failure or other error reading from a data stripe.  Add support for data reconstruction from the data and parity components, leveraging existing functionality for reading mirrored files.&lt;/p&gt;

&lt;h3&gt;&lt;a name=&quot;ErasureCodedFileWrite&quot;&gt;&lt;/a&gt;Erasure Coded File Write&lt;/h3&gt;
&lt;p&gt;To avoid losing redundancy on erasure-coded files that are modified, the Mirrored File Writes functionality could be used during writes to such files. Changes would be merged into the erasure coded component after the file is closed, using the Phase 1 ChangeLog consumer, and then the mirror component can be dropped.&lt;/p&gt;

&lt;h2&gt;&lt;a name=&quot;ExternalComponents&quot;&gt;&lt;/a&gt;External Components&lt;/h2&gt;
&lt;h3&gt;&lt;a name=&quot;ErasureCodedResyncTool&quot;&gt;&lt;/a&gt;Erasure Coded Resync Tool&lt;/h3&gt;

&lt;p&gt;The &lt;tt&gt;lfs mirror resync&lt;/tt&gt; tool needs to be updated to generate the erasure code for the file striped file, storing the parity in a separate component from the main RAID-0 striped file. There are CPU-optimized implementations of the erasure coding algorithms available, so the majority of the work would be integrating these optimized routines into the Lustre kernel modules and userspace tools, rather than actually developing the encoding algorithms. &lt;/p&gt;</description>
                <environment></environment>
        <key id="51807">LU-10911</key>
            <summary>FLR2: Erasure coding </summary>
                <type id="2" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11311&amp;avatarType=issuetype">New Feature</type>
                                            <priority id="4" iconUrl="https://jira.whamcloud.com/images/icons/priorities/minor.svg">Minor</priority>
                        <status id="1" iconUrl="https://jira.whamcloud.com/images/icons/statuses/open.png" description="The issue is open and ready for the assignee to start work on it.">Open</status>
                    <statusCategory id="2" key="new" colorName="default"/>
                                    <resolution id="-1">Unresolved</resolution>
                                        <assignee username="bobijam">Zhenyu Xu</assignee>
                                    <reporter username="adilger">Andreas Dilger</reporter>
                        <labels>
                            <label>FLR2</label>
                    </labels>
                <created>Fri, 13 Apr 2018 08:19:14 +0000</created>
                <updated>Wed, 28 Jun 2023 20:26:35 +0000</updated>
                                                                                <due></due>
                            <votes>0</votes>
                                    <watches>16</watches>
                                                                            <comments>
                            <comment id="263924" author="nrutman" created="Mon, 24 Feb 2020 21:38:49 +0000"  >&lt;p&gt;Is this still planned for 2.14? Any progress? This ticket doesn&apos;t seem to get updated; am I looking in the wrong place?&lt;/p&gt;</comment>
                            <comment id="263928" author="adilger" created="Mon, 24 Feb 2020 22:30:32 +0000"  >&lt;p&gt;The plan is still to get this into 2.14. There are patches in Gerrit that could probably be refreshed. As always, review of the patches would be welcome. &lt;/p&gt;</comment>
                            <comment id="263929" author="adilger" created="Mon, 24 Feb 2020 22:34:12 +0000"  >&lt;p&gt;The patches are in Gerrit under the sub-tasks linked above. &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-12186&quot; title=&quot;EC: add necessary structure to adopt erasure coding layout&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-12186&quot;&gt;LU-12186&lt;/a&gt; thru &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-12189&quot; title=&quot;EC: import isa-l library in Lustre build&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-12189&quot;&gt;&lt;del&gt;LU-12189&lt;/del&gt;&lt;/a&gt; and &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-12668&quot; title=&quot;EC: resync parity components&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-12668&quot;&gt;LU-12668&lt;/a&gt; and &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-12669&quot; title=&quot;EC: recover data from parity code&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-12669&quot;&gt;LU-12669&lt;/a&gt; (I think these last two are still finishing development).&lt;/p&gt;</comment>
                            <comment id="264439" author="shadow" created="Tue, 3 Mar 2020 05:04:52 +0000"  >&lt;p&gt;Andreas - can be right Gerrit links provided in tickets ?&lt;br/&gt;
&lt;a href=&quot;https://review.whamcloud.com/34678&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/34678&lt;/a&gt; isn&apos;t valid anymore. &lt;br/&gt;
same for any other.&lt;/p&gt;</comment>
                            <comment id="266074" author="gerrit" created="Wed, 25 Mar 2020 10:33:26 +0000"  >&lt;p&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;ignore this, patch pushed under wrong ticket #&amp;#93;&lt;/span&gt;&lt;/p&gt;</comment>
                            <comment id="266075" author="gerrit" created="Wed, 25 Mar 2020 10:33:26 +0000"  >&lt;p&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;ignore this, patch pushed under wrong ticket #&amp;#93;&lt;/span&gt;&lt;/p&gt;</comment>
                            <comment id="268320" author="adilger" created="Thu, 23 Apr 2020 03:54:54 +0000"  >&lt;p&gt;Bobijam,&lt;br/&gt;
since we are very close to the end of the 2.14 feature landing window, that it makes sense to submit the patches initially so that they are conditionally compiled under &lt;tt&gt;#ifdef ISAL_ENABLED&lt;/tt&gt;, so that they can be landed and tested not to cause any problems with the current master code (i.e. the code is mostly a no-op initially).  Then, patches can be landed to enable ISAL_ENABLED during the build, and tests should be conditional on this support (so there needs to be some way to detect it in userspace).&lt;/p&gt;

&lt;p&gt;That will ensure that the EC code is included as part of the 2.14 release, and gives us more time to improve the build system, fix EC bugs, etc.  We would want to have the &lt;tt&gt;#ifdef ISAL_ENABLE&lt;/tt&gt; checks for this code anyway, so that Lustre can still build if &lt;tt&gt;ISA-L&lt;/tt&gt; is not available/usable for some systems.  We shouldn&apos;t leave it like disabled for a long time, because untested code is going to break quickly, but the 2.14 feature landing window is supposed to close on April 30 (already 2 weeks late), and I think there are still changes that need to be finished to the before this feature is ready.  Those can still be worked on after the feature is landed to master, before the 2.14 final release.&lt;/p&gt;</comment>
                            <comment id="268323" author="bobijam" created="Thu, 23 Apr 2020 04:29:53 +0000"  >&lt;p&gt;yes, great insight, ISAL_ENABLE could be used to protect pre-EC file behavior and smooth the transition.&lt;/p&gt;</comment>
                            <comment id="268326" author="shadow" created="Thu, 23 Apr 2020 06:10:40 +0000"  >&lt;p&gt;Andreas, i&apos;m confused. You are OK with landing untested / buggy code?&lt;/p&gt;</comment>
                            <comment id="274880" author="shadow" created="Thu, 9 Jul 2020 14:51:08 +0000"  >&lt;p&gt;Can someone provide a better HLD than attached? This document just about some userspace tools, and some common changes for structures. But this document don&apos;t describe anything with parity calculation - a specially in case REwrite don&apos;t covered a whole data stripes and old data need to be read to calculate a parity. No fail scenario in document, no recovery handling but it looks recovery is very complex in this case. No describing how it have plan avoid a parity rewrite with old data in case two parity updates in flight (CR lock permit this). It have bad describing a lock protection for parity between nodes, in case two nodes have a parallel write for half data stripes.&lt;br/&gt;
No description about compatibility with old client.&lt;/p&gt;

&lt;p&gt;Can design document updated to solve these questions ?&lt;/p&gt;</comment>
                            <comment id="295246" author="simmonsja" created="Wed, 17 Mar 2021 16:14:48 +0000"  >&lt;p&gt;Just an update.&#160; We have moved the flr branch to the latest master and having been running normal sanity tests. Currently we are fixing various bugs we are encountering.&lt;/p&gt;</comment>
                            <comment id="299524" author="simmonsja" created="Thu, 22 Apr 2021 20:01:54 +0000"  >&lt;p&gt;I just did a rebase to the latest master and I get a build error with the latest code due to the landing of&#160;&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-12142&quot; title=&quot;Hang in OSC on eviction - threads stuck in read() and ldlm_bl_NN&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-12142&quot;&gt;&lt;del&gt;LU-12142&lt;/del&gt;&lt;/a&gt;. For&#160;lov_io_lru_reserve() we use&lt;/p&gt;

&lt;p&gt;lov_foreach_io_layout() and lov_io_fault_store() uses&#160;lov_io_layout_at(). Both functions have changed to handle both LCT_DATA and LCT_CODE types. The question is it safe to just pass LCT_DATA in both cases or do we need to examine every component to see what type LCT_* we have?&lt;/p&gt;</comment>
                            <comment id="299542" author="bobijam" created="Fri, 23 Apr 2021 02:32:12 +0000"  >&lt;p&gt;I think it&apos;s ok to just pass LCT_DATA in both cases, parity code pages won&apos;t be cached after EC IO since they are ephemeral and later EC IO could use other parity components.&lt;/p&gt;</comment>
                            <comment id="300470" author="simmonsja" created="Tue, 4 May 2021 19:26:33 +0000"  >&lt;p&gt;In my testing I&apos;m seeing:&lt;/p&gt;

&lt;p&gt;kernel: Lustre: DEBUG MARKER: == sanity test 130g: FIEMAP (overstripe file) ================================================&lt;br/&gt;
======== 14:15:49 (1620152149)&lt;br/&gt;
kernel: Lustre: 42446:0:(osd_handler.c:1938:osd_trans_start()) lustre-MDT0000: credits 19393 &amp;gt; trans_max 9984&lt;br/&gt;
kernel: Lustre: 42446:0:(osd_handler.c:1867:osd_trans_dump_creds()) &#160;create: 300/1200/0, destroy: 1/4/0&lt;br/&gt;
kernel: Lustre: 42446:0:(osd_handler.c:1867:osd_trans_dump_creds()) Skipped 4001 previous similar messages&lt;br/&gt;
kernel: Lustre: 42446:0:(osd_handler.c:1874:osd_trans_dump_creds()) &#160;attr_set: 3/3/0, xattr_set: 304/148/0&lt;br/&gt;
kernel: Lustre: 42446:0:(osd_handler.c:1874:osd_trans_dump_creds()) Skipped 4001 previous similar messages&lt;br/&gt;
kernel: Lustre: 42446:0:(osd_handler.c:1884:osd_trans_dump_creds()) &#160;write: 1501/12910/0, punch: 0/0/0, quota 4/4/0&lt;br/&gt;
kernel: Lustre: 42446:0:(osd_handler.c:1884:osd_trans_dump_creds()) Skipped 4001 previous similar messages&lt;br/&gt;
kernel: Lustre: 42446:0:(osd_handler.c:1891:osd_trans_dump_creds()) &#160;insert: 301/5116/0, delete: 2/5/0&lt;br/&gt;
kernel: Lustre: 42446:0:(osd_handler.c:1891:osd_trans_dump_creds()) Skipped 4001 previous similar messages&lt;br/&gt;
kernel: Lustre: 42446:0:(osd_handler.c:1898:osd_trans_dump_creds()) Skipped 4001 previous similar messages&lt;br/&gt;
kernel: Pid: 42446, comm: mdt03_001 3.10.0-1160.15.2.el7.x86_64 #1 SMP Thu Jan 21 16:15:07 EST 2021&lt;br/&gt;
kernel: Call Trace:&lt;br/&gt;
kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;0&amp;gt;&amp;#93;&lt;/span&gt; libcfs_call_trace+0x90/0xf0 &lt;span class=&quot;error&quot;&gt;&amp;#91;libcfs&amp;#93;&lt;/span&gt;&lt;br/&gt;
kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;0&amp;gt;&amp;#93;&lt;/span&gt; osd_trans_start+0x4bb/0x4e0 &lt;span class=&quot;error&quot;&gt;&amp;#91;osd_ldiskfs&amp;#93;&lt;/span&gt;&lt;/p&gt;</comment>
                            <comment id="300473" author="adilger" created="Tue, 4 May 2021 19:51:35 +0000"  >&lt;blockquote&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt; kernel: Lustre: 42446:0:(osd_handler.c:1938:osd_trans_start()) lustre-MDT0000: credits 19393 &amp;gt; trans_max 9984
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;&lt;/blockquote&gt;

&lt;p&gt;That is probably introduced by patches from &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-14134&quot; title=&quot;reduce credits for new writing potentially&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-14134&quot;&gt;&lt;del&gt;LU-14134&lt;/del&gt;&lt;/a&gt;, possibly combined with large write RPCs.  It isn&apos;t really fatal, but annoying and should be fixed.&lt;/p&gt;

&lt;p&gt;There is a prototype patch in &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-14641&quot; title=&quot;per extents bytes allocation stats&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-14641&quot;&gt;&lt;del&gt;LU-14641&lt;/del&gt;&lt;/a&gt; that would be useful to test if you can reproduce this easily.&lt;/p&gt;</comment>
                            <comment id="376788" author="simmonsja" created="Wed, 28 Jun 2023 20:11:57 +0000"  >&lt;p&gt;An outside party&#160; has contacted our group at ORNL so we pushed the current prototype for early review with them. This project is at the beta code stage.&lt;/p&gt;</comment>
                            <comment id="376792" author="shadow" created="Wed, 28 Jun 2023 20:26:35 +0000"  >&lt;p&gt;James, can you drop some comments about recovery with FLR2 ? how it planed to be find which stripe is good and which is outdated and needs to be reconstructed.&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                                                <inwardlinks description="is related to">
                                        <issuelink>
            <issuekey id="56621">LU-12649</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="57881">LUDOC-463</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="76134">LU-16837</issuekey>
        </issuelink>
                            </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                            <attachment id="32421" name="Erasure Coding HDL.docx" size="58834" author="bobijam" created="Tue, 16 Apr 2019 08:49:37 +0000"/>
                    </attachments>
                <subtasks>
                            <subtask id="55410">LU-12186</subtask>
                            <subtask id="55411">LU-12187</subtask>
                            <subtask id="55412">LU-12188</subtask>
                            <subtask id="55421">LU-12189</subtask>
                            <subtask id="56671">LU-12668</subtask>
                            <subtask id="56672">LU-12669</subtask>
                    </subtasks>
                <customfields>
                                                                                                                                                                <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzzvt3:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>