<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:02:50 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-9] Optimize weighted QOS Round-Robin allocator</title>
                <link>https://jira.whamcloud.com/browse/LU-9</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;New bug for old bugzilla  &lt;a href=&quot;https://bugzilla.lustre.org/show_bug.cgi?id=18547&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;b=18547&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Quoting from the old ticket, since bugzilla.lustre.org is no longer available:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;We want the OSTs to be evenly filled for as much of their lives as possible, to be as evenly loaded (disk and network io) as possible, for optimum filesystem performance.  If the weighting was subtle e.g. round-robin, but skipping more-full OSTs&lt;br/&gt;
1/n of the times (where n is proportional to the imbalance) then it could essentially be in use all of the time.  If we implemented QOS as a part of the round-robin allocator that is always active, just by skipping OSTs periodically (rather than randomly) it would provide much more uniform loading and would prevent the OSTs from getting far out of balance in the first place.&lt;/p&gt;

&lt;p&gt;Say the OSTs are 100MB in size (1% = 1MB) and OST10-19 have 9x as much free space as OST0-9 then the RR QOS selection would skip OST0-9 about 8/9 times, but never skip OST10-19.  If each file created was 1MB in size then the first&lt;br/&gt;
11 files would use 1MB on each of the 10 emptier OSTs, and 1MB on one of the fuller OSTs.  Repeat 10x and 10MB is used uniformly on each of the emptier OSTs, but only 1MB on each of the fuller OSTs.&lt;/p&gt;

&lt;p&gt;This is what we tried to do with the random QOS weighting, but just as with the original LOV allocator (which randomly picked OSTs instead of round-robin) the chance of having an imbalance (e.g. average 2 objects/OST but some had 0&lt;br/&gt;
and some had 4) was too high.&lt;/p&gt;


&lt;p&gt;One strong candidate is to use an &quot;accumulated error&quot; mechanism, similar to Bresenham&apos;s line drawing algorithm or error diffusion dithering, where each of the OSTs is present in the RR list (in the optimized OST/OSS order it normally is), and each time it is skipped from the RR selection (because the OST available space + accumulated error &amp;lt; threshold/average space) the OST free space is added to a &quot;accumulated error&quot; counter for that OST.  When the OST (available space + accumulated error) is large enough the OST will be picked by normal RR selection and then its bonus reset to 0 and the threshold will be again be too low to select it for a number of iterations over the list.&lt;/p&gt;

&lt;p&gt;The MDS can estimate the average file size from the OST &lt;tt&gt;statfs&lt;/tt&gt; data, so we might increment the bonus by the average file size?  One thing we might want to do is initialize the bonus with some random amount (proportional to the free space?) at MDS startup so that some of the OSTs with less free space become available at different times instead of all at once.&lt;/p&gt;

&lt;p&gt;It seems we should also include some fraction (maybe 1/2, tunable via &lt;tt&gt;qos_prio_free&lt;/tt&gt;) of the bonus value into the average so that less-used OSTs (with higher bonus) are more likely to be used instead of all being chosen equally.  The problem with having equal weights (i.e. not taking the bonus into account for &lt;tt&gt;$average_free&lt;/tt&gt;) is that e.g. the OST immediately following the new one in the list will be picked disproportionately often.&lt;/p&gt;

&lt;p&gt;In the above example, if the file size is 1 then we skip the full OSTs 9 times each increasing their bonus by 9, and filling the new OST by approximately 9, giving it a negative bonus (penalty) of 9.  Penalty can be reset at next statfs for this OST, but bonus is not reset.  Then &lt;tt&gt;$average_free = 10 * 9 + 91 = 18&lt;/tt&gt; and all of the OSTs are candidates for selection (modulo some random initial bonus so that more-full OSTs are not all selected at one time).&lt;/p&gt;

&lt;p&gt;For N-striped files, one idea I had was to subtract the just-used OST&apos;s space from the current file&apos;s &quot;average&quot; so that the remaining OSTs become proportionally more attractive. Continuing in the above example, but with a 2-stripe file, the current 2-stripe file&apos;s &lt;tt&gt;$average_free&lt;/tt&gt; is now 10, and any of the OSTs can be selected.&lt;/p&gt;

&lt;p&gt;It is important that the algorithm keep the filesystem total available space updated in an incremental manner as objects are allocated, and only recomputing the OST weight in the &lt;tt&gt;statfs&lt;/tt&gt; interpreter callback every few seconds.  It definitely shouldn&apos;t have to iterate over 1000 OSTs to compute the weighting each time, or it will add too much overhead on the MDS.&lt;/p&gt;

&lt;p&gt;If one of the full OSTs had been previously selected then its bonus = -1, and the rest have bonus = 1, so the average is 10.9.  The just used OST only has weight 9 and will not be selected this round.  One OST will be picked and its weight set to -1, and the remainder get bonus = 2, so average = 11.7 for next round, and only OSTs with weight 12 can be selected for the second stripe.&lt;/p&gt;

&lt;p&gt;For 1-stripe files (using 1/2 weighting of the bonus in the average) we would need to allocate 14 files on the free OST before allocating a file on one of the other OSTs.&lt;/p&gt;

&lt;p&gt;We should have a simulator that can just plug in &lt;tt&gt;lov_qos.c&lt;/tt&gt; so that we can both analyze the current code, and secondly validate any changes made.  In other projects this has been done by adding unit-test code at the bottom of this file, and it is &quot;&lt;tt&gt;#ifdef UNIT_TEST&lt;/tt&gt;&quot; so it is not in normal compiles, but creates a test executable when compiled with &lt;tt&gt;-DUNIT_TEST&lt;/tt&gt;.&lt;/p&gt;&lt;/blockquote&gt;

&lt;blockquote&gt;
&lt;p&gt;We would dynamically skip entries while walking the OST index array, based on the current &quot;weight&quot; of each particular OST.  The weight is potentially made up of many inputs.  Currently the weight is the OST available space, plus/minus any bonus/penalty for recent allocations/skips.  &lt;/p&gt;

&lt;p&gt;There is a per-OST penalty and a per-OSS penalty, which are very rough estimates for the amount of space that will be consumed on the OST and the remaining IO bandwidth of the OSS.  This also needs to be flexible enough to taking into account other weighting factors into the mix such as network bandwidth/latency (e.g. based on multiple NIDs and/or LNet stats), storage bandwidth (e.g.  based on old/new HDDs or impact due to RAID rebuilding, OST-side stats in &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-7880&quot; title=&quot;add performance statistics to obd_statfs&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-7880&quot;&gt;LU-7880&lt;/a&gt;), administrator tuning, etc.&lt;/p&gt;&lt;/blockquote&gt;</description>
                <environment></environment>
        <key id="10082">LU-9</key>
            <summary>Optimize weighted QOS Round-Robin allocator</summary>
                <type id="4" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11310&amp;avatarType=issuetype">Improvement</type>
                                            <priority id="3" iconUrl="https://jira.whamcloud.com/images/icons/priorities/major.svg">Major</priority>
                        <status id="1" iconUrl="https://jira.whamcloud.com/images/icons/statuses/open.png" description="The issue is open and ready for the assignee to start work on it.">Open</status>
                    <statusCategory id="2" key="new" colorName="default"/>
                                    <resolution id="-1">Unresolved</resolution>
                                        <assignee username="wc-triage">WC Triage</assignee>
                                    <reporter username="rread">Robert Read</reporter>
                        <labels>
                            <label>llnl</label>
                    </labels>
                <created>Fri, 22 Oct 2010 16:31:58 +0000</created>
                <updated>Sat, 21 Jan 2023 02:29:01 +0000</updated>
                                                                                <due></due>
                            <votes>0</votes>
                                    <watches>10</watches>
                                                                            <comments>
                            <comment id="10090" author="dferber" created="Wed, 27 Oct 2010 06:41:17 +0000"  >&lt;p&gt;Asked Di to take a look at this, just to estimate effort and skill set it would take to close. Or to recommend someone else to do this initial look. &lt;/p&gt;</comment>
                            <comment id="10122" author="dferber" created="Sun, 31 Oct 2010 17:31:26 +0000"  >&lt;p&gt;I spoke with Di on this.  He glanced the bug a bit, and said it seems not a small bug (improve the lov qos algorithm) given that qos code is very subtle, complicated, and also very important for performance.&lt;/p&gt;

&lt;p&gt;This bug is already assigned to James at ORNL. The Whamcloud engineer we assign it to needs to be familiar with the lov qos code. We have someone starting next week that will be a good choice here, and bobijam is another candidate, or fanyong.&lt;/p&gt;

&lt;p&gt;The bug probably needs 3-4 months at least, including the improved qos - implementing all the ideas from Andreas and Nathan, the qos simulator (plug in lov_qos.c, probably depend on another bug mentioned in comment 3#) and the test case. But this is a very rough estimate only. &lt;/p&gt;</comment>
                            <comment id="10163" author="dferber" created="Tue, 9 Nov 2010 12:57:39 +0000"  >&lt;p&gt;Bug will not be assigned at this point, as we are focusing on bugs and not enhancements right now. &lt;/p&gt;</comment>
                            <comment id="77964" author="adilger" created="Wed, 26 Feb 2014 22:48:57 +0000"  >&lt;p&gt;Initial framework patch &lt;a href=&quot;http://review.whamcloud.com/3529&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/3529&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="82721" author="spimpale" created="Tue, 29 Apr 2014 07:03:41 +0000"  >&lt;p&gt;I am working on rebasing this to current master.&lt;/p&gt;</comment>
                            <comment id="83089" author="spimpale" created="Fri, 2 May 2014 17:54:39 +0000"  >&lt;p&gt;I found that it was almost impossible for me to rebase the above old patch to the current master (countless merge and then compile errors)&lt;br/&gt;
Instead I implemented the same algorithm on top of the current master --&amp;gt; &lt;a href=&quot;http://review.whamcloud.com/#/c/10199/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#/c/10199/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I request the old patch be abandoned in favor of this one.&lt;/p&gt;</comment>
                            <comment id="87237" author="ihara" created="Sun, 22 Jun 2014 05:47:30 +0000"  >&lt;p&gt;OK, I did quick test re-based patches against current master. &lt;a href=&quot;http://review.whamcloud.com/#/c/10199/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#/c/10199/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;There are 4 OSS, 40 OST and created 240 files with IOR (FFP) from 16 clients. The best case, 6 files object allocation per OST. Howerver, with patches, the performance was even worse if it compare without patch cases. there is no fair RR and more biased alllocation to specific OSTs.&lt;br/&gt;
btw, I setup &quot;lctl set_param lov.*.qos_threshold_rr=100&quot; on MDS on both with/without patches case.&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                        <issuelink>
            <issuekey id="35343">LU-7880</issuekey>
        </issuelink>
                            </outwardlinks>
                                                                <inwardlinks description="is related to">
                                        <issuelink>
            <issuekey id="12848">LU-977</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="58389">LU-13363</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="74139">LU-16501</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="46102">LU-9506</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="67147">LU-15216</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="47571">LU-9809</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="52249">LU-11023</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="38291">LU-8417</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="57613">LU-13066</issuekey>
        </issuelink>
                            </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                            <attachment id="15215" name="LU-9.xlsx" size="39229" author="ihara" created="Sun, 22 Jun 2014 05:47:29 +0000"/>
                            <attachment id="43610" name="b=18547.html" size="108503" author="adilger" created="Fri, 13 May 2022 02:14:30 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                    <customfield id="customfield_10020" key="com.atlassian.jira.plugin.system.customfieldtypes:float">
                        <customfieldname>Bugzilla ID</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>18547.0</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzw3m7:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>10673</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                                                                                </customfields>
    </item>
</channel>
</rss>