<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 03:29:41 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-16750] optimize ldiskfs internal metadata allocation for hybrid storage LUNs</title>
                <link>https://jira.whamcloud.com/browse/LU-16750</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;With hybrid storage LUNs (combined HDD + SSD, or QLC+TLC flash) it is desirable to be able to separate ldiskfs metadata allocations (that need small random IOs) from data allocations (that are better suited for large sequential IOs) depending on the type of underlying storage.  With LVM it is possible to create an LV with SSD storage at the beginning of the LV, and HDD storage at the end of the LV.  Between 0.5-1% of the OST capacity would need to be high-IOPS storage in order to hold all of the internal ldiskfs metadata.&lt;/p&gt;

&lt;p&gt;This would improve performance for inode and other metadata access, such as &lt;tt&gt;ls -l&lt;/tt&gt;, &lt;tt&gt;(lfs) find&lt;/tt&gt;, &lt;tt&gt;e2fsck&lt;/tt&gt;, and in general file access latency, modification, truncate, unlink, transaction commit, etc.&lt;/p&gt;

&lt;p&gt;For &lt;tt&gt;mke2fs&lt;/tt&gt;, the following options look interesting for hybrid storage, so that all of the static ldiskfs metadata (group descriptors, block/inode bitmaps, inode tables, journal) is located at the start of the device in the (fast) flash region:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;mkfs.lustre --mgsname testfs --ost --index=0 --mkfsoptions=&quot;-O sparse_super2 -E num_backup_sb=2,packed_meta_blocks=1&quot; ... /dev/vgost0/lvost0

sparse_super2
      This feature indicates that there will only be at most two
      backup superblocks and block group descriptors.   The block
      groups used to store the backup superblock(s) and blockgroup
      descriptor(s) are stored in the superblock, but typically, one
      will be located at the beginning of block group #1, and one in
      the last block group in the file system.  This feature is essentially
      a more extreme version of sparse_super and is designed to
      allow a much larger percentage of the disk to have contiguous
      blocks available for data files.

num_backup_sb=&amp;lt;0,1,2&amp;gt;
      If the sparse_super2 file system feature is enabled
      this option controls whether there will be 0, 1, or
      2 backup superblocks created in the file system.

packed_meta_blocks=&amp;lt;0,1&amp;gt;
      Place  the allocation bitmaps and the inode table at
      the beginning of the disk.  This option requires
      that the flex_bg file system feature to be enabled
      in order for it to have effect, and will also create
      the journal at the beginning of the file system.
      This option is useful for flash devices that use SLC
      flash at the beginning of the disk.  It also maximizes
      the range of contiguous data blocks, which can be
      be useful for certain specialized use cases, such as
      supported Shingled Drives.
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Unfortunately, there is not (yet) any mechanism to force dynamic metadata (directory blocks, indirect/index blocks, xattr blocks) to be allocated in the fast region at the start of the device.  It makes sense for &lt;tt&gt;mke2fs&lt;/tt&gt; and/or &lt;tt&gt;tune2fs&lt;/tt&gt; to be able to mark &quot;fast&quot; groups in the group descriptor with a flag, like:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
#define EXT4_BG_IOPS     0x0010
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;(note that &lt;tt&gt;EXT4_BG_WAS_TRIMMED = 0x0008&lt;/tt&gt; is &lt;a href=&quot;https://patchwork.ozlabs.org/project/linux-ext4/patch/1592831677-13945-1-git-send-email-wangshilong1991@gmail.com/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;tentatively reserved&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;This could be set at format time (e.g. &quot;&lt;tt&gt;-E iops=0-1024G,4096-8192G&lt;/tt&gt;&quot; or similar to indicate where the &quot;IOPS&quot; storage lived), but since it is a per-group field, it could also be used at a 128MB granularity for more arbitrary separation of &quot;&lt;tt&gt;IOPS&lt;/tt&gt;&quot; vs. &quot;slow&quot; storage (e.g. add &quot;&lt;tt&gt;IOPS&lt;/tt&gt;&quot; storage at the end of the device, or interleaved in smaller or larger chunks in case of filesystem resize after creation). &lt;/p&gt;

&lt;p&gt;The mballoc code could then use the &lt;tt&gt;IOPS&lt;/tt&gt; flag in the group descriptor to decide which groups to allocate dynamic filesystem metadata, which prefers high-IOPS storage.  Since the block allocator knows that the storage is IOPS oriented, it can make these (mostly individual) block allocations densely-packed rather than trying to align large allocations.&lt;/p&gt;

&lt;p&gt;Having separate block groups for IOPS allocations will also isolate the non-&lt;tt&gt;IOPS&lt;/tt&gt; groups from having such allocations, better allowing it to do large streaming read/write operations, similar to the benefits seen with DoM + HDD OSTs at the Lustre file level, but without the runtime/Lustre layout complexity.&lt;/p&gt;

&lt;p&gt;For the new mballoc list-based allocator (&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-12970&quot; title=&quot;improve mballoc for huge filesystems&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-12970&quot;&gt;LU-12970&lt;/a&gt;) the presence of groups marked &lt;tt&gt;IOPS&lt;/tt&gt; would be best handled by creating a second size-array of &lt;tt&gt;list_heads&lt;/tt&gt; sorting the &lt;tt&gt;IOPS&lt;/tt&gt; groups by free blocks size.  Then, when doing a block allocation for a directory, or an indirect/index block, or an xattr block, mballoc can look into the &lt;tt&gt;IOPS&lt;/tt&gt; array instead of the regular array.  The fact that these metadata blocks are not close to the referencing inodes is mostly irrelevant, since they are on a different block device, and (by nature of the underlying storage) have no seek latency.&lt;/p&gt;

</description>
                <environment></environment>
        <key id="75639">LU-16750</key>
            <summary>optimize ldiskfs internal metadata allocation for hybrid storage LUNs</summary>
                <type id="4" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11310&amp;avatarType=issuetype">Improvement</type>
                                            <priority id="3" iconUrl="https://jira.whamcloud.com/images/icons/priorities/major.svg">Major</priority>
                        <status id="1" iconUrl="https://jira.whamcloud.com/images/icons/statuses/open.png" description="The issue is open and ready for the assignee to start work on it.">Open</status>
                    <statusCategory id="2" key="new" colorName="default"/>
                                    <resolution id="-1">Unresolved</resolution>
                                        <assignee username="bobijam">Zhenyu Xu</assignee>
                                    <reporter username="adilger">Andreas Dilger</reporter>
                        <labels>
                            <label>ldiskfs</label>
                    </labels>
                <created>Thu, 20 Apr 2023 01:41:06 +0000</created>
                <updated>Mon, 18 Sep 2023 06:26:26 +0000</updated>
                                                                                <due></due>
                            <votes>0</votes>
                                    <watches>7</watches>
                                                                            <comments>
                            <comment id="375204" author="adilger" created="Tue, 13 Jun 2023 09:40:48 +0000"  >&lt;p&gt;It is expected that the size of the IOPS storage is relatively small compared to the non-IOPS storage.  About 0.5% is enough to hold the static metadata (inode tables, bitmaps, etc.) plus enough extra space for dynamically allocated metadata (directory blocks, indirect/index blocks, xattr blocks unless there are many large xattrs).  As such, it makes sense to reserve the IOPS storage exclusively for metadata usage, and the non-IOPS storage should be preferred for data unless there is no free IOPS space.&lt;/p&gt;

&lt;p&gt;The IOPS storage should not normally be used for data, but it makes sense to have a tunable parameter (e.g. &lt;tt&gt;/sys/fs/ext4/sdX/iops_free_threshold&lt;/tt&gt; or similar) that controls at what percentage of free space the IOPS groups could be used for data allocations.  Normally this would be &lt;tt&gt;=0&lt;/tt&gt;, meaning the IOPS space should never be used for data, but it could be set to e.g. 1% or 5% (or whatever) free (e.g. when filesystem is above 99% or 95% full) if there is a lot of IOPS space and the administrator really wants to use it for data.&lt;/p&gt;</comment>
                            <comment id="378218" author="gerrit" created="Tue, 11 Jul 2023 05:44:46 +0000"  >&lt;p&gt;&quot;Zhenyu Xu &amp;lt;bobijam@hotmail.com&amp;gt;&quot; uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/c/fs/lustre-release/+/51625&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/c/fs/lustre-release/+/51625&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-16750&quot; title=&quot;optimize ldiskfs internal metadata allocation for hybrid storage LUNs&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-16750&quot;&gt;LU-16750&lt;/a&gt; ldiskfs: optimize metadata allocation for hybrid LUNs&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: c9e31dd0512bbb10bea5bf093ed607222b84f782&lt;/p&gt;</comment>
                            <comment id="379737" author="gerrit" created="Fri, 21 Jul 2023 19:14:50 +0000"  >&lt;p&gt;&quot;Zhenyu Xu &amp;lt;bobijam@hotmail.com&amp;gt;&quot; uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/c/tools/e2fsprogs/+/51735&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/c/tools/e2fsprogs/+/51735&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-16750&quot; title=&quot;optimize ldiskfs internal metadata allocation for hybrid storage LUNs&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-16750&quot;&gt;LU-16750&lt;/a&gt; mke2fs: add &quot;-E iops&quot; to set IOPS storage group&lt;br/&gt;
Project: tools/e2fsprogs&lt;br/&gt;
Branch: master-lustre&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: 177299183e81be66b1c8ead4755357452e87f8a2&lt;/p&gt;</comment>
                            <comment id="381575" author="gerrit" created="Mon, 7 Aug 2023 14:09:43 +0000"  >&lt;p&gt;&quot;Andreas Dilger &amp;lt;adilger@whamcloud.com&amp;gt;&quot; merged in patch &lt;a href=&quot;https://review.whamcloud.com/c/tools/e2fsprogs/+/51735/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/c/tools/e2fsprogs/+/51735/&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-16750&quot; title=&quot;optimize ldiskfs internal metadata allocation for hybrid storage LUNs&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-16750&quot;&gt;LU-16750&lt;/a&gt; mke2fs: add &quot;-E iops&quot; to set IOPS storage group&lt;br/&gt;
Project: tools/e2fsprogs&lt;br/&gt;
Branch: master-lustre&lt;br/&gt;
Current Patch Set: &lt;br/&gt;
Commit: 7ac1b50954cb02d2db18ce462b83ef4ba653b0dc&lt;/p&gt;</comment>
                            <comment id="383696" author="gerrit" created="Fri, 25 Aug 2023 09:11:16 +0000"  >&lt;p&gt;&quot;Zhenyu Xu &amp;lt;bobijam@hotmail.com&amp;gt;&quot; uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/c/tools/e2fsprogs/+/52091&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/c/tools/e2fsprogs/+/52091&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-16750&quot; title=&quot;optimize ldiskfs internal metadata allocation for hybrid storage LUNs&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-16750&quot;&gt;LU-16750&lt;/a&gt; tune2fs: add &quot;-E iops&quot; to set/clear IOPS storage group&lt;br/&gt;
Project: tools/e2fsprogs&lt;br/&gt;
Branch: master-lustre&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: 3f5b37336ca396128352512878de87f65cd07193&lt;/p&gt;</comment>
                            <comment id="384452" author="gerrit" created="Thu, 31 Aug 2023 17:36:24 +0000"  >&lt;p&gt;&quot;Andreas Dilger &amp;lt;adilger@whamcloud.com&amp;gt;&quot; merged in patch &lt;a href=&quot;https://review.whamcloud.com/c/tools/e2fsprogs/+/52091/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/c/tools/e2fsprogs/+/52091/&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-16750&quot; title=&quot;optimize ldiskfs internal metadata allocation for hybrid storage LUNs&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-16750&quot;&gt;LU-16750&lt;/a&gt; tune2fs: add &quot;-E iops&quot; to set/clear IOPS groups&lt;br/&gt;
Project: tools/e2fsprogs&lt;br/&gt;
Branch: master-lustre&lt;br/&gt;
Current Patch Set: &lt;br/&gt;
Commit: a59ac3441448d61d66880e2e5329585191c98716&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10010">
                    <name>Duplicate</name>
                                                                <inwardlinks description="is duplicated by">
                                                        </inwardlinks>
                                    </issuelinktype>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                        <issuelink>
            <issuekey id="62900">LU-14438</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="64414">LU-14712</issuekey>
        </issuelink>
                            </outwardlinks>
                                                                <inwardlinks description="is related to">
                                        <issuelink>
            <issuekey id="66012">LU-15002</issuekey>
        </issuelink>
                            </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|i03jbr:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                                                                                </customfields>
    </item>
</channel>
</rss>