<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 03:09:45 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-14438] backport ldiskfs mballoc patches</title>
                <link>https://jira.whamcloud.com/browse/LU-14438</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;There is an upstream patch series that is adding improved mballoc handling for efficiently finding suitable allocation groups in a filesystem. In particular, patch &lt;br/&gt;
&lt;a href=&quot;https://patchwork.ozlabs.org/project/linux-ext4/patch/20210209202857.4185846-5-harshadshirwadkar@gmail.com/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://patchwork.ozlabs.org/project/linux-ext4/patch/20210209202857.4185846-5-harshadshirwadkar@gmail.com/&lt;/a&gt; &quot;&lt;tt&gt;ext4: improve cr 0 / cr 1 group scanning&lt;/tt&gt;&quot; is the important part of the series. &lt;/p&gt;</description>
                <environment></environment>
        <key id="62900">LU-14438</key>
            <summary>backport ldiskfs mballoc patches</summary>
                <type id="4" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11310&amp;avatarType=issuetype">Improvement</type>
                                            <priority id="4" iconUrl="https://jira.whamcloud.com/images/icons/priorities/minor.svg">Minor</priority>
                        <status id="1" iconUrl="https://jira.whamcloud.com/images/icons/statuses/open.png" description="The issue is open and ready for the assignee to start work on it.">Open</status>
                    <statusCategory id="2" key="new" colorName="default"/>
                                    <resolution id="-1">Unresolved</resolution>
                                        <assignee username="ablagodarenko">Artem Blagodarenko</assignee>
                                    <reporter username="adilger">Andreas Dilger</reporter>
                        <labels>
                            <label>ldiskfs</label>
                    </labels>
                <created>Tue, 16 Feb 2021 17:16:37 +0000</created>
                <updated>Thu, 28 Sep 2023 02:53:15 +0000</updated>
                                                                                <due></due>
                            <votes>2</votes>
                                    <watches>18</watches>
                                                                            <comments>
                            <comment id="292114" author="adilger" created="Tue, 16 Feb 2021 22:51:53 +0000"  >&lt;p&gt;I&apos;ve attached v2 of the &lt;span class=&quot;nobr&quot;&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/attachment/37510/37510_ext4-improve-cr-0-cr-1-group-scanning-v2.patch&quot; title=&quot;ext4-improve-cr-0-cr-1-group-scanning-v2.patch attached to LU-14438&quot;&gt;ext4-improve-cr-0-cr-1-group-scanning-v2.patch&lt;sup&gt;&lt;img class=&quot;rendericon&quot; src=&quot;https://jira.whamcloud.com/images/icons/link_attachment_7.gif&quot; height=&quot;7&quot; width=&quot;7&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/sup&gt;&lt;/a&gt;&lt;/span&gt; from the list (against current Linux master, not ported to any RHEL kernel yet).  While there is still work being done to improve this patch, I think it would be useful to see how much this improves performance for a large &lt;b&gt;fragmented&lt;/b&gt; filesystem and/or hurts performance for a large empty filesystem.  Having some performance feedback earlier would allow improving the patch before it is included in the upstream kernel, and if it shows good promise I think it is a better long-term solution than the current &lt;tt&gt;ext4-simple-blockalloc.patch&lt;/tt&gt; that we are carrying, since that patch just &lt;em&gt;reduces&lt;/em&gt; the number of times useless groups are scanned but doesn&apos;t avoid sequential scanning completely like this new patch does.&lt;/p&gt;</comment>
                            <comment id="292137" author="artem_blagodarenko" created="Wed, 17 Feb 2021 08:24:53 +0000"  >&lt;p&gt;I believe for testing purpose and later, after successful&#160;&lt;span class=&quot;nobr&quot;&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/attachment/37510/37510_ext4-improve-cr-0-cr-1-group-scanning-v2.patch&quot; title=&quot;ext4-improve-cr-0-cr-1-group-scanning-v2.patch attached to LU-14438&quot;&gt;ext4-improve-cr-0-cr-1-group-scanning-v2.patch&lt;sup&gt;&lt;img class=&quot;rendericon&quot; src=&quot;https://jira.whamcloud.com/images/icons/link_attachment_7.gif&quot; height=&quot;7&quot; width=&quot;7&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/sup&gt;&lt;/a&gt;&lt;/span&gt;&#160;testing,&#160;ext4-simple-blockalloc.patch should be dropped completely because it makes porting difficult.&#160;&lt;/p&gt;</comment>
                            <comment id="298222" author="gerrit" created="Thu, 8 Apr 2021 11:46:14 +0000"  >&lt;p&gt;Alex Zhuravlev (bzzz@whamcloud.com) uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/43232&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/43232&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-14438&quot; title=&quot;backport ldiskfs mballoc patches&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-14438&quot;&gt;LU-14438&lt;/a&gt; ldiskfs: improvements to mballoc&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: b7e2d9466f2a45d3c9a687cf06155d4e75b020c9&lt;/p&gt;</comment>
                            <comment id="314190" author="adilger" created="Tue, 28 Sep 2021 17:45:56 +0000"  >&lt;p&gt;The new mballoc patch from the upstream kernel keeps an array with 2^n order of the largest range of free blocks in the group (between 2^0=1 free block and 2^16 = 32768 free blocks), and puts each group into the appropriate list after each alloc/free.  It uses round-robin selection for groups in the per-order list, so it is still possible to get into situations similar to &lt;tt&gt;mb_last_group&lt;/tt&gt; being very large, where the allocations are done at the end of the filesystem (lower bandwidth) even though there are many groups with free space available at the start of the filesystem (higher bandwidth).&lt;/p&gt;

&lt;p&gt;It would make sense to enhance the new allocator to have &lt;b&gt;two&lt;/b&gt; per-order lists for tracking the free blocks (on HDD OSTs at least, based the &quot;rotational&quot; parameter of the block device) - one list for the groups in the first ~70% of the filesystem that have good performance and a second list for groups in the last ~30% of the filesystem that have lower performance.  Groups in the second list would only be used if there are no free groups of the right order in the first list.  That would bias allocations to the start of the device so that it avoids needless slowdowns when the filesystem is not full.  Since the amount of memory used for the per-order array itself is small (array of 17 pointers), and it is easy to decide which array a given group is put into based on the group number this would not increase allocation overhead.  That would probably be much more efficient than trying to keep the groups within each order totally sorted.&lt;/p&gt;</comment>
                            <comment id="314811" author="dauchy" created="Wed, 6 Oct 2021 14:02:40 +0000"  >&lt;p&gt;Andreas,&lt;/p&gt;

&lt;p&gt;Regarding the comment in the code patch: &quot;&lt;em&gt;the groups may not get traversed linearly. That may result in subsequent allocations being not close to each other. And so, the underlying device may get filled up in a non-linear fashion.&lt;/em&gt;&quot;... rather than using a fixed MB_DEFAULT_LINEAR_LIMIT what do you think about using something like the following algorithm?  (I have been playing with external setting of mb_last_group to an &quot;optimal&quot; value, but incorporating the idea into ext4 would be much cleaner.)&lt;/p&gt;

&lt;p&gt;The general idea is to work &lt;b&gt;backwards&lt;/b&gt; through mb_groups info and use a &quot;decay&quot; algorithm to determine an &lt;b&gt;adjusted&lt;/b&gt; value for free block count for each group.  Then set mb_last_group based on the largest adjusted block count value. Attached is a simple script &lt;span class=&quot;nobr&quot;&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/attachment/40818/40818_pick_mb_last_group.sh&quot; title=&quot;pick_mb_last_group.sh attached to LU-14438&quot;&gt;pick_mb_last_group.sh&lt;sup&gt;&lt;img class=&quot;rendericon&quot; src=&quot;https://jira.whamcloud.com/images/icons/link_attachment_7.gif&quot; height=&quot;7&quot; width=&quot;7&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/sup&gt;&lt;/a&gt;&lt;/span&gt; to demonstrate the approach.  In my limited testing, it does seem to pick a group number that not only has a large &quot;bfree&quot; value but is also followed by other groups that generally have large-ish bfree values as well.&lt;br/&gt;
Obviously more cleanup on the script would be needed to make it &quot;production ready&quot;, and some theory and testing applied to set a good $decay value, and there is also a question of how often to run the tool and change the value... but hopefully the script at least clarifies the approach.  Further enhancements could include a check of &quot;/sys/block/${dev}/queue/rotational&quot;; if a device is spinning rust, then adjust the weighted score further with a penalty for higher group numbers.&lt;/p&gt;

&lt;p&gt;Thanks,&lt;br/&gt;
Nathan&lt;/p&gt;</comment>
                            <comment id="346587" author="adilger" created="Wed, 14 Sep 2022 04:21:27 +0000"  >&lt;p&gt;There are additional patches to fix the mballoc mb_optimized_scan=1 use case:&lt;br/&gt;
&lt;a href=&quot;https://patchwork.ozlabs.org/project/linux-ext4/list/?series=317391&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://patchwork.ozlabs.org/project/linux-ext4/list/?series=317391&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;These fix a number of sub-optimal allocation decisions in the earlier patches.&lt;/p&gt;</comment>
                            <comment id="346592" author="adilger" created="Wed, 14 Sep 2022 06:05:28 +0000"  >&lt;p&gt;I&apos;ve filed &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-16155&quot; title=&quot;allow importing inode/block allocation maps to new ldisks filesystem&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-16155&quot;&gt;LU-16155&lt;/a&gt; to enhance debugfs to allow &quot;importing&quot; the block and inode allocation maps into a newly-formatted filesystem to simplify testing of this problem.  We could collect the debugfs information from real filesystems that are having allocation performance issues as needed in order to test changes to mballoc.&lt;/p&gt;</comment>
                            <comment id="376603" author="gerrit" created="Tue, 27 Jun 2023 10:44:58 +0000"  >&lt;p&gt;I&apos;ve tried to port some of the upstream mballoc patches in this, but it looks too big for a single patch.&lt;/p&gt;

&lt;p&gt;&quot;Zhenyu Xu &amp;lt;bobijam@hotmail.com&amp;gt;&quot; uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/c/fs/lustre-release/+/51472&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/c/fs/lustre-release/+/51472&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-14438&quot; title=&quot;backport ldiskfs mballoc patches&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-14438&quot;&gt;LU-14438&lt;/a&gt; ldiskfs: backport ldiskfs mballoc patches&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: 2439579001a928714a640ec469a2d833ea5e8337&lt;/p&gt;</comment>
                            <comment id="380633" author="adilger" created="Sat, 29 Jul 2023 16:15:05 +0000"  >&lt;p&gt;There are cases where we may want to make empty filesystem performance &lt;b&gt;worse&lt;/b&gt;, but the 90% performance better. We could use the new mballoc array lists to spread out allocations across the disk more evenly. &lt;/p&gt;

&lt;p&gt;I had previously considered that we might split groups into two arrays (as we are doing with IOPS groups in &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-16750&quot; title=&quot;optimize ldiskfs internal metadata allocation for hybrid storage LUNs&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-16750&quot;&gt;LU-16750&lt;/a&gt;) 80% at the start of the disk and 20% at the end of the disk (or 90/10%) so groups at end of the filesystem are only used when the first groups are mostly full. However, this would mean that performance would suddenly drop once the filesystem hit 80% full.&lt;/p&gt;

&lt;p&gt;We could instead do things like split the groups into eg. 16 separate arrays by offset, and then have a clock that rotates allocations around the regions eg. every second, so that groups are not used start-to-end during allocation. We would still want &lt;em&gt;some&lt;/em&gt; locality in allocations, so we are not seeking wildly around the disk for files being written concurrently, but are always using the end of the disk some fraction of the time. This would hopefully even out the performance over the filesystem lifetime for uses that demand more consistent performance instead of &quot;best possible&quot;.&lt;/p&gt;

&lt;p&gt;We could even hint via &quot;&lt;tt&gt;lfs ladvise&lt;/tt&gt;&quot; and/or &quot;&lt;tt&gt;ionice&lt;/tt&gt;&quot; for a file or process to force all file allocations to the slow part of the disk for cases of archiving old files. I don&apos;t think it makes sense to allow &quot;improving&quot; allocations because everyone would want that and it would be no different than today. &lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10010">
                    <name>Duplicate</name>
                                            <outwardlinks description="duplicates">
                                                        </outwardlinks>
                                                        </issuelinktype>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                                        </outwardlinks>
                                                                <inwardlinks description="is related to">
                                        <issuelink>
            <issuekey id="37967">LU-8365</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="67459">LU-15319</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="78166">LU-17153</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="72375">LU-16162</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="75639">LU-16750</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="57389">LU-12970</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="72343">LU-16155</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="62254">LU-14305</issuekey>
        </issuelink>
                            </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                            <attachment id="37510" name="ext4-improve-cr-0-cr-1-group-scanning-v2.patch" size="17536" author="adilger" created="Tue, 16 Feb 2021 22:42:56 +0000"/>
                            <attachment id="40818" name="pick_mb_last_group.sh" size="1151" author="dauchy" created="Wed, 6 Oct 2021 14:01:56 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|i01mqv:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                                                                                </customfields>
    </item>
</channel>
</rss>