<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 02:57:14 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-12970] improve mballoc for huge filesystems</title>
                <link>https://jira.whamcloud.com/browse/LU-12970</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;there are number of reports demonstrating a poor behaviour of mballoc on huge filesystems. in one report it was  688TB filesystem with 5.3M groups.&lt;br/&gt;
mballoc tries to allocate large chunks of space, for small allocations it tries to preallocate and share large chunks. while this is good in terms of fragmentation and streaming IO allocation itself may need to scan many groups to find a good candidate.&lt;br/&gt;
mballoc maintains internal in-memory structures (buddy cache) to speed up searching, but that cache is built from regular on-disk bitmaps, meaning IO. and if cache is cold, populating it may take a lot of time.&lt;/p&gt;

&lt;p&gt;there are few ideas how to improve that:&lt;/p&gt;
&lt;ul&gt;
	&lt;li&gt;skip more groups using less information when possible&lt;/li&gt;
	&lt;li&gt;stop scanning if too many groups have been scanned (loaded) and use best found&lt;/li&gt;
	&lt;li&gt;prefetch bitmaps (use lazy init thread? prefetch at scanning)&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;another option for prefetching would be to skip non-initialized groups, but start an async read for the corresponding bitmap.&lt;br/&gt;
also, when mballoc marks the blocks used (allocation has been just made) it could make sense to check/prefetch the subsequent group(s) which is likely a goal for subsequent allocation - while the caller are writting IO to just allocated blocks, the next group(s) will be prefetchted and ready to use.&lt;/p&gt;
</description>
                <environment></environment>
        <key id="57389">LU-12970</key>
            <summary>improve mballoc for huge filesystems</summary>
                <type id="4" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11310&amp;avatarType=issuetype">Improvement</type>
                                            <priority id="4" iconUrl="https://jira.whamcloud.com/images/icons/priorities/minor.svg">Minor</priority>
                        <status id="1" iconUrl="https://jira.whamcloud.com/images/icons/statuses/open.png" description="The issue is open and ready for the assignee to start work on it.">Open</status>
                    <statusCategory id="2" key="new" colorName="default"/>
                                    <resolution id="-1">Unresolved</resolution>
                                        <assignee username="wc-triage">WC Triage</assignee>
                                    <reporter username="bzzz">Alex Zhuravlev</reporter>
                        <labels>
                            <label>ldiskfs</label>
                    </labels>
                <created>Thu, 14 Nov 2019 16:56:49 +0000</created>
                <updated>Wed, 7 Jun 2023 01:20:57 +0000</updated>
                                                                                <due></due>
                            <votes>0</votes>
                                    <watches>10</watches>
                                                                            <comments>
                            <comment id="258356" author="adilger" created="Fri, 15 Nov 2019 08:49:23 +0000"  >&lt;p&gt;I think that prefetching the block bitmaps in large chunks should be relatively easily implemented using the lazy_init thread. There is already patch &lt;a href=&quot;https://review.whamcloud.com/32347&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/32347&lt;/a&gt; &quot;&lt;tt&gt;&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-10946&quot; title=&quot;add an interface to load ldiskfs block bitmaps&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-10946&quot;&gt;&lt;del&gt;LU-10946&lt;/del&gt;&lt;/a&gt; ldiskfs: add an interface to load ldiskfs block bitmaps&lt;/tt&gt;&quot; but I suspect that this is happening only after the mount, which is too late to be useful, and it is loading the bitmaps one-at-a-time, which causes a lot of extra overhead. Also, there was objection upstream to the extra data structure used to reference the bitmaps.&lt;/p&gt;

&lt;p&gt;Instead, the block bitmap prefetch should be done a whole flex_bg at a time (256 blocks), asynchronously during mount and the buddy and group info calculated in the end_io completion handler. It would make sense to keep the same sysfs interface to allow pinning the bitmaps as 32347 to maintain compatibility. &lt;/p&gt;</comment>
                            <comment id="258357" author="adilger" created="Fri, 15 Nov 2019 08:56:35 +0000"  >&lt;p&gt;Reducing size expectations for allocations during mount, and/or limiting scanning should also help. I think for small writes, we should avoid trying to do group preallocation until after the bitmaps have been loaded. That can be handled entirely inside the ldiskfs code and avoids the need to understand what is happening at the Lustre level.&lt;/p&gt;

&lt;p&gt;The bitmap scanning code can also advance the allocation hints itself until it finds some groups that have suitable free space, instead of waiting for an incoming write to do this. &lt;/p&gt;</comment>
                            <comment id="258361" author="wshilong" created="Fri, 15 Nov 2019 11:03:05 +0000"  >&lt;p&gt;I cooked a new patch before to load block bitmaps async using workqueue:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://gerrit.datadirectnet.jp:8082/#/c/2965/2/ldiskfs/kernel_patches/patches/rhel7.6/ext4-loadbitmaps.patch&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://gerrit.datadirectnet.jp:8082/#/c/2965/2/ldiskfs/kernel_patches/patches/rhel7.6/ext4-loadbitmaps.patch&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;And there was interface to control how much blocks could be prefetched each time, but havne&apos;t got some benchmark numbers for it yet.&lt;/p&gt;</comment>
                            <comment id="258362" author="wshilong" created="Fri, 15 Nov 2019 11:06:49 +0000"  >&lt;p&gt; &lt;span class=&quot;nobr&quot;&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/attachment/33874/33874_ext4-loadbitmaps.patch&quot; title=&quot;ext4-loadbitmaps.patch attached to LU-12970&quot;&gt;ext4-loadbitmaps.patch&lt;sup&gt;&lt;img class=&quot;rendericon&quot; src=&quot;https://jira.whamcloud.com/images/icons/link_attachment_7.gif&quot; height=&quot;7&quot; width=&quot;7&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/sup&gt;&lt;/a&gt;&lt;/span&gt; &lt;/p&gt;</comment>
                            <comment id="258406" author="bzzz" created="Fri, 15 Nov 2019 18:01:13 +0000"  >&lt;p&gt;I&apos;ve got a script to prepare a fragemnted filesystem using debugfs&apos;s setb and freeb commands which basically takes few seconds.&lt;br/&gt;
so now I can reproduce this issue easily - I see one by one bitmap load of few hundred non-empty groups initiated by a single-block allocation.&lt;br/&gt;
the next step is to add some instrumentation..&lt;/p&gt;</comment>
                            <comment id="258417" author="adilger" created="Fri, 15 Nov 2019 21:46:12 +0000"  >&lt;p&gt;Alex, I think that setting &quot;RAID stripe size&quot; (&lt;tt&gt;sbi-&amp;gt;s_stripe&lt;/tt&gt;) in the superblock may also be a contribute to the problem.  For large RAID systems this is typically 512 blocks (2MB), up to 2048 blocks (8MB) or more in order to get allocations sized an aligned with the underlying RAID geometry.  That in itself is good for large writes, but for small writes at mount time it can be problematic.&lt;/p&gt;</comment>
                            <comment id="258418" author="adilger" created="Fri, 15 Nov 2019 22:04:31 +0000"  >&lt;p&gt;Shilong, could you please post your patch to WC Gerrit so that it can be reviewed.  Once the block bitmap is loaded, it makes sense to call &lt;tt&gt;mb_regenerate_buddy()&lt;/tt&gt; to create the buddy bitmap and &lt;tt&gt;ext4_group_info&lt;/tt&gt; as part of the &lt;tt&gt;ext4_end_bitmap_read()&lt;/tt&gt; callback rather than waiting in &lt;tt&gt;ext4_wait_block_bitmap()&lt;/tt&gt; for the bitmaps.  That allows submitting IO in batches and letting it complete asyncrhonously (keep an atomic counter of how many blocks need to be processed and submit more IO when it gets large enough), rather than doing read then wait for all blocks to finish, read/wait, ...&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;I&apos;ve got a script to prepare a fragmented filesystem using debugfs&apos;s setb and freeb commands which basically takes few seconds.&lt;/p&gt;&lt;/blockquote&gt;
&lt;p&gt;Alex, it would be very useful to submit this upstream to e2fsprogs, since testing fragmented filesystems is always a problem.&lt;br/&gt;
It also makes sense for you to see if Shilong&apos;s current patch helps your test case, and then we can work on optimizing it further.&lt;/p&gt;</comment>
                            <comment id="258441" author="bzzz" created="Mon, 18 Nov 2019 08:07:23 +0000"  >&lt;p&gt;sure, will try to make the script useful for the outter world.&lt;/p&gt;</comment>
                            <comment id="258544" author="bzzz" created="Wed, 20 Nov 2019 15:03:01 +0000"  >&lt;p&gt;&lt;a href=&quot;https://review.whamcloud.com/#/c/36793/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/#/c/36793/&lt;/a&gt; - this patch limits scanning for a good goup and adds basic prefetching.&lt;br/&gt;
currently it&apos;s more like an RFC, though I tested it manually&lt;/p&gt;</comment>
                            <comment id="258563" author="adilger" created="Wed, 20 Nov 2019 18:52:04 +0000"  >&lt;p&gt;Alex, is this complementary with the &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-12103&quot; title=&quot;Improve block allocation for large partitions&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-12103&quot;&gt;&lt;del&gt;LU-12103&lt;/del&gt;&lt;/a&gt; patch that is already landed?&lt;/p&gt;</comment>
                            <comment id="258566" author="bzzz" created="Wed, 20 Nov 2019 19:10:38 +0000"  >&lt;p&gt;I think it&apos;s a bit different approach. overall fullness doesn&apos;t mean we can&apos;t find good chunks, IMO.&lt;br/&gt;
say, few files have been written very dense so that 1/2 of groups are full, but another 1/2 is nearly empty.&lt;br/&gt;
why should we change the algorithm?&lt;/p&gt;</comment>
                            <comment id="258587" author="adilger" created="Thu, 21 Nov 2019 06:17:14 +0000"  >&lt;p&gt;While it is possible to have the 1/2 full and 1/2 empty groups case you propose, I don&apos;t think that this is a likely condition.  Even so, in this case, wouldn&apos;t the allocator just find the first empty group and allocate linearly from there?&lt;/p&gt;</comment>
                            <comment id="258589" author="bzzz" created="Thu, 21 Nov 2019 06:52:52 +0000"  >&lt;p&gt;hmm, why you think this is not likely? few growing files would fill the filesystem group by group.&lt;br/&gt;
&quot;just find&quot; - this is exactly the issue. the allocator is supposed to be generic enough to work with small and big files, right?&lt;br/&gt;
thus we want to keep some locality, if file A has last extent in the group N, then we should try to write next extent in the same N or nearby, not just any empty group?&lt;br/&gt;
and then searching for the group is what is happening in DDN -923, but the groups weren&apos;t considered &quot;best&quot; and that got worse due to cold cache.&lt;br/&gt;
so that approach I&apos;m trying is to limit coverage of searching.&lt;br/&gt;
I think that coverage can be expressed in number of groups to search in and/or number of uninitialized groups causing IO.&lt;br/&gt;
on the first try we can search for exactly requested chunk in N groups, if failed relax requirement and search for best in N*m groups, then just anything..&lt;/p&gt;</comment>
                            <comment id="258689" author="bzzz" created="Fri, 22 Nov 2019 14:29:52 +0000"  >&lt;p&gt;partly for curiosity attached old SATA 7200 500GB drive to my testing box..:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
[root@rz /]# time cat /proc/fs/ext4/sda/mb_groups &amp;gt;/dev/&lt;span class=&quot;code-keyword&quot;&gt;null&lt;/span&gt;

real	0m24.081s
user	0m0.000s
sys	0m0.274s
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;this is 3726 groups, initialized by mke2fs so all to read during that cat.&lt;/p&gt;</comment>
                            <comment id="258691" author="bzzz" created="Fri, 22 Nov 2019 15:25:30 +0000"  >&lt;p&gt;with 32-groups-at-once prefetching, the same cat:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
real	0m14.150s
user	0m0.000s
sys	0m0.309s
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;with 64-groups-at-once prefetching:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
real	0m13.200s
user	0m0.000s
sys	0m0.277s
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;but this is a single spindle, for any regular site that would be multiple spindles I guess and a larger prefetch window would help more.&lt;/p&gt;</comment>
                            <comment id="259367" author="bzzz" created="Fri, 6 Dec 2019 15:04:35 +0000"  >&lt;p&gt;given in all the case we do forward scan, I think it would be relatively simple to add few lists of groups to be scanned at each criterion.&lt;br/&gt;
each time a group gets/loses free blocks we would move it from one to another list at cost of few CPU cycles.&lt;br/&gt;
I&apos;m not sure what are crossing points for each list yet (in terms of free blocks/fragments/etc), but the inital implementation&lt;br/&gt;
could start with empty/non-empty lists.&lt;/p&gt;</comment>
                            <comment id="259394" author="adilger" created="Fri, 6 Dec 2019 20:04:28 +0000"  >&lt;p&gt;I have thought in the past about something similar to what you describe.  However, it is difficult to know in advance what the size requirements are.&lt;/p&gt;

&lt;p&gt;One though was whether it makes sense to have a higher-level buddy bitmap for groups that is generated at the default preallocation unit size (based on the &lt;tt&gt;s_mb_large_req&lt;/tt&gt; size) that allows quickly finding groups that have available 8MB or 16MB chunks, up to the maximum possible allocation size (probably 64MB is enough).  At 8-64MB chunks this would mean 15MB of bitmap for a 512TiB filesystem (could use &lt;tt&gt;kvmalloc()&lt;/tt&gt;).  This would be essentially a filesystem-wide replacement for the &lt;tt&gt;bb_counters&lt;/tt&gt; array that is tracked on a per-group basis, so would likely reduce overall memory usage, and would essentially replace &quot;group scanning&quot; with &quot;bitmap scanning&quot;.  It could be optimized to save the first set bit to avoid repeatedly scanning the blocks of beginning of the filesystem, assuming they would be preferentially allocated.&lt;/p&gt;

&lt;p&gt;This could also be implemented as an array of linked lists (at power-of-two granularity up to 64MB), with groups being put in the list with their largest aligned free chunk (separate lists for unaligned chunks?).  Allocations would first walk the list for the smallest chunk that they need, then move up to lists with progressively larger chunks if no groups are available at the smaller size.  Once the allocation is done, the group may be demote to to a lower list if the allocation results in a smaller chunk being available.  To add a &lt;tt&gt;list_head&lt;/tt&gt; to each of 4M groups in a 512TiB filesystem would consume 64MB of memory, but it would be split across all of the &lt;tt&gt;ext4_group_info&lt;/tt&gt; allocations.&lt;/p&gt;

&lt;p&gt;Note that using a bigalloc size of e.g. 32KB would reduce the number of groups by a factor of 8 (e.g. 4M -&amp;gt; 512K) so we should also consider fixing the issues with bigalloc so that it is usable.&lt;/p&gt;</comment>
                            <comment id="301137" author="adilger" created="Tue, 11 May 2021 06:16:39 +0000"  >&lt;p&gt;Link to backport of upstream mballoc patches in &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-14438&quot; title=&quot;backport ldiskfs mballoc patches&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-14438&quot;&gt;LU-14438&lt;/a&gt;, which may be enough to resolve this issue.&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                        <issuelink>
            <issuekey id="62900">LU-14438</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="55236">LU-12103</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="51954">LU-10946</issuekey>
        </issuelink>
                            </outwardlinks>
                                                                <inwardlinks description="is related to">
                                        <issuelink>
            <issuekey id="37967">LU-8365</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="67459">LU-15319</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="57401">LU-12976</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="72343">LU-16155</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="57424">LU-12988</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="75370">LU-16691</issuekey>
        </issuelink>
                            </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                            <attachment id="33874" name="ext4-loadbitmaps.patch" size="7322" author="wshilong" created="Fri, 15 Nov 2019 11:06:46 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|i00phj:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                                                                                </customfields>
    </item>
</channel>
</rss>