<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 03:31:15 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-16939] ldlm: not expand the lock extent when the shared file is under lock contention</title>
                <link>https://jira.whamcloud.com/browse/LU-16939</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;We already have the lock contention detection patch: &lt;a href=&quot;https://review.whamcloud.com/35287/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/35287/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;With lock contention detection, the lock server will mark the lock resource (a shared OST object) as contention in a time period (2 seconds, by default). And the client requested a extent lock will be informed the lock contention.&lt;/p&gt;

&lt;p&gt;Under lock contention, the lock server will not expand the lock extent to avoid unnecessary lock conflict callbacks.&lt;br/&gt;
And the client could be switch from buffered I/O to the direct I/O if the I/O size is large enough.&lt;/p&gt;</description>
                <environment></environment>
        <key id="76822">LU-16939</key>
            <summary>ldlm: not expand the lock extent when the shared file is under lock contention</summary>
                <type id="4" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11310&amp;avatarType=issuetype">Improvement</type>
                                            <priority id="4" iconUrl="https://jira.whamcloud.com/images/icons/priorities/minor.svg">Minor</priority>
                        <status id="1" iconUrl="https://jira.whamcloud.com/images/icons/statuses/open.png" description="The issue is open and ready for the assignee to start work on it.">Open</status>
                    <statusCategory id="2" key="new" colorName="default"/>
                                    <resolution id="-1">Unresolved</resolution>
                                        <assignee username="qian_wc">Qian Yingjin</assignee>
                                    <reporter username="qian_wc">Qian Yingjin</reporter>
                        <labels>
                    </labels>
                <created>Tue, 4 Jul 2023 09:42:14 +0000</created>
                <updated>Mon, 22 Jan 2024 16:29:55 +0000</updated>
                                                                                <due></due>
                            <votes>1</votes>
                                    <watches>3</watches>
                                                                            <comments>
                            <comment id="377359" author="paf0186" created="Tue, 4 Jul 2023 14:55:51 +0000"  >&lt;p&gt;I&apos;m glad you found &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-12550&quot; title=&quot;automatic lockahead&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-12550&quot;&gt;LU-12550&lt;/a&gt;, that&apos;s important.&lt;/p&gt;

&lt;p&gt;The switch from buffered to direct is good too - I was planning to do that eventually using &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-12550&quot; title=&quot;automatic lockahead&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-12550&quot;&gt;LU-12550&lt;/a&gt;.&#160; We do need unaligned DIO first.&#160; I have some more thoughts on this specific ticket which I&apos;ll put here in a moment.&lt;/p&gt;</comment>
                            <comment id="377362" author="paf0186" created="Tue, 4 Jul 2023 15:11:49 +0000"  >&lt;p&gt;So, I think we have to be careful here.&#160; When I first developed lockahead, I tested turning off lock expansion when doing multi-client shared file.&#160; So exactly what you&apos;re suggesting here.&lt;/p&gt;

&lt;p&gt;I found it &lt;b&gt;hurt&lt;/b&gt; performance in the shared file case.&lt;/p&gt;

&lt;p&gt;The reason is a little challenging to explain (and the graphs I made for this are lost long ago, heh).&lt;/p&gt;

&lt;p&gt;Normally, when (for example) two clients are writing a single OST object with one process each, we imagine the process is something like this.&lt;/p&gt;

&lt;p&gt;Client 1 writes [0, 1 MiB).&#160; Requests lock, call it lock &apos;A&apos;.&#160; Server expands lock A, 0 - infinity.&lt;br/&gt;
Client 2 writes [1,2 MiB).&#160; Requests lock B.&#160; Server cancels lock A.&#160; Server expands lock B, 0 - infinity.&#160; (Or maybe it&apos;s 1 MiB to infinity?&#160; Doesn&apos;t matter)&lt;br/&gt;
Client 1 writes [2, 3 Mib).&#160; Requests lock C...&#160; etc, etc.&lt;/p&gt;

&lt;p&gt;Here is what actually happens.&lt;br/&gt;
Client 1 writes [0, 1 MiB).&#160; Acquires lock A, 0 - infinity.&lt;/p&gt;

&lt;p&gt;Client 2 tries to write [1, 2 MiB), sends request to server, etc.&lt;br/&gt;
Client 1 writes [2, 3 MiB) using lock A.&lt;/p&gt;

&lt;p&gt;Client 1 writes [4, 5 MiB) using lock A...&lt;/p&gt;

&lt;p&gt;Client 1 writes [6. 7 MiB) using lock A...&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;...&amp;#93;&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;Server calls back lock A, but only after client 1 has done some number of writes.&#160; (In my testing, I think it was like 4 1 MiB writes on average?)&lt;/p&gt;

&lt;p&gt;So client 2 is waiting for client 1, but client 1 gets a bunch of work done before the lock is called back.&lt;/p&gt;

&lt;p&gt;This is very important, &lt;b&gt;because&lt;/b&gt; if you turn off lock expansion, now the client write process looks like this.&lt;/p&gt;

&lt;p&gt;Start write&lt;/p&gt;

&lt;p&gt;Request lock&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Lock request goes to server&lt;/b&gt;&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Get lock from server&lt;/b&gt;&lt;/p&gt;

&lt;p&gt;Do write...&lt;/p&gt;

&lt;p&gt;We have added an extra round trip to the server for every write.&lt;/p&gt;

&lt;p&gt;So what I found in my testing was that until contention was very high (many clients to a single object), the overhead of doing a lock request for &lt;b&gt;every&lt;/b&gt; write was worse than waiting for the other client.&#160; Note if there is &amp;gt; 1 process writing to each OST object on the same client, they also benefit from sharing the expanded lock.&#160; So disabling lock expansion on contention often made performance &lt;b&gt;worse,&lt;/b&gt; because the overhead of asking for a lock for every write was worse than the serialization of waiting for the other client(s).&lt;/p&gt;

&lt;p&gt;This is just something to be aware of and test for.&#160; My testing was on disk, flash may change things.&#160; I think disabling expansion will help under heavy contention, but only under heavy contention (many clients to one object).&#160; So we should do some careful testing of write performance with disabling expansion on contention, with both a small and a large number of clients &lt;b&gt;and&lt;/b&gt;&#160;processes per object.&lt;/p&gt;</comment>
                            <comment id="377975" author="adilger" created="Fri, 7 Jul 2023 21:08:13 +0000"  >&lt;p&gt;There is an OST tunable &lt;tt&gt;ldlm.namespaces.*.contended_locks&lt;/tt&gt; that controls how many lockers are needed on a resource before it considers the file &quot;contended&quot;.  IIRC, once there were more than 4 contending clients the lock would not grow downward, and after &lt;tt&gt;contended_locks=32&lt;/tt&gt; contending clients the lock would also not grow upward.  Even under some contention, if the client is doing e.g. 16MB writes in a 1GB section per process it still makes sense to expand the DLM lock up to the start of the next extent (i.e. 1GB boundary) so that it doesn&apos;t have to get a separate lock for each write.&lt;/p&gt;

&lt;p&gt;I think the right solution is to use the DLM contention information to do lockless (sync+direct) writes instead of getting any DLM locks at all.  At that point the many clients writing to the file should be saturating the OSS so it will have enough outstanding RPCs to fill its pipeline and the clients are blocked on sending so the sync writes will not really hurt performance.  That also allows the OST to merge the writes during submission instead of doing read-modify-write.&lt;/p&gt;</comment>
                            <comment id="378115" author="qian_wc" created="Mon, 10 Jul 2023 09:56:20 +0000"  >&lt;p&gt;Patrick, Andreas,&lt;br/&gt;
Thanks for your helpful comments!&lt;/p&gt;

&lt;p&gt;I am just worried about the lock contention accounting in the current implementation maybe not correct.&lt;br/&gt;
We clear the contention locks only when the contention information on the resource is older than the time window (2s):&lt;/p&gt;

&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
&lt;span class=&quot;code-comment&quot;&gt;/* If our contention info is older than the time window, clear it */&lt;/span&gt;
	&lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt; (now &amp;gt; res-&amp;gt;lr_contention_age + ns-&amp;gt;ns_contention_time)
		res-&amp;gt;lr_contended_locks = 0;
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;For example, &lt;br/&gt;
time wind:    lock contention count:       account:&lt;br/&gt;
In [0s, 2s)  - lock contents count: 32    @lr_contended_locks = 32&lt;br/&gt;
In [2s, 4s) -  lock contents count: 1       @lr_contended_locks = 33&lt;br/&gt;
In [4s, 6s) -  lock contents count: 1       @lr_contended_locks = 34&lt;br/&gt;
In [6s, 8s) -  lock contents count: 1       @lr_contended_locks = 35&lt;br/&gt;
...&lt;/p&gt;

&lt;p&gt;The accounting contended locks is not decaying even with low lock contention conflict.&lt;/p&gt;

&lt;p&gt;So I am thinking about improving the accounting for lock contention based on time windows (similar to heat):&lt;br/&gt;
obd_heat_decay();&lt;br/&gt;
obd_heat_add();&lt;br/&gt;
obd_heat_get();&lt;/p&gt;

&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
*
 * The file heat is calculated &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; every time interval period I. The access
 * frequency during each period is counted. The file heat is only recalculated
 * at the end of a time period.  And a percentage of the former file heat is
 * lost when recalculated. The recursion formula to calculate the heat of the
 * file f is as follow:
 *
 * Hi+1(f) = (1-P)*Hi(f)+ P*Ci
 *
 * Where Hi is the heat value in the period between time points i*I and
 * (i+1)*I; Ci is the access count in the period; the symbol P refers to the
 * weight of Ci. The larger the value the value of P is, the more influence Ci
 * has on the file heat.
 */
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;What&apos;s your opioion?&lt;/p&gt;

&lt;p&gt;Thanks,&lt;br/&gt;
Qian&lt;/p&gt;</comment>
                            <comment id="378234" author="qian_wc" created="Tue, 11 Jul 2023 09:41:11 +0000"  >&lt;p&gt;I found two URLs that can be used for the lock contention rate calculation based on fixed time sliding window:&lt;br/&gt;
&lt;a href=&quot;https://medium.com/@avocadi/rate-limiter-sliding-window-counter-7ec08dbe21d6&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://medium.com/@avocadi/rate-limiter-sliding-window-counter-7ec08dbe21d6&lt;/a&gt;&lt;/p&gt;

&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
previous_window_request_count = 8
current_window_request_count = 5
previous_window_count_weight = 0.47
Hence, 8 * 0.47 + 5 = 8.76
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;&lt;a href=&quot;https://docs.konghq.com/gateway/latest/kong-plugins/rate-limiting/algorithms/rate-limiting/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://docs.konghq.com/gateway/latest/kong-plugins/rate-limiting/algorithms/rate-limiting/&lt;/a&gt;&lt;/p&gt;

&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
current window rate: 10
previous window rate: 40
window size: 60
window position: 30 (seconds past the start of the current window)
weight = .5 (60 second window size - 30 seconds past the window start)

rate = &lt;span class=&quot;code-quote&quot;&gt;&apos;current rate&apos;&lt;/span&gt; + &lt;span class=&quot;code-quote&quot;&gt;&apos;previous weight&apos;&lt;/span&gt; * &lt;span class=&quot;code-quote&quot;&gt;&apos;weight&apos;&lt;/span&gt;
     = 10             + 40                * (&lt;span class=&quot;code-quote&quot;&gt;&apos;window size&apos;&lt;/span&gt; - &lt;span class=&quot;code-quote&quot;&gt;&apos;window position&apos;&lt;/span&gt;) / &lt;span class=&quot;code-quote&quot;&gt;&apos;window size&apos;&lt;/span&gt;
     = 10             + 40                * (60 - 30) / 60
     = 10             + 40                * .5
     = 30

the formula used to define the weighting percentage is as follows:

weight = (window_size - (time() % window_size)) / window_size
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="378283" author="adilger" created="Tue, 11 Jul 2023 14:54:37 +0000"  >&lt;p&gt;I think a simple decaying average is probably OK, and the contention should go to zero if there are no new locks within 2-3 decay intervals.&lt;/p&gt;</comment>
                            <comment id="378288" author="qian_wc" created="Tue, 11 Jul 2023 15:13:59 +0000"  >&lt;p&gt;Hi Andreas,&lt;/p&gt;

&lt;p&gt;Could you please give out the detailed formula or algorithm for the decaying calculation?&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;The above gives our two decaying algorithm based on the fixed time-based sliding windows.&lt;/p&gt;

&lt;p&gt;The time window length is 2:&lt;/p&gt;
&lt;ul&gt;
	&lt;li&gt;previous_count&lt;/li&gt;
	&lt;li&gt;current_count&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;A. Fixed decaying weight for the previous_count: i.e. previous_count_decay_weight = 0.5&lt;br/&gt;
rate = current_count + previous_count * previous_count_decay_weight;&lt;/p&gt;

&lt;p&gt;B. dynamic decaying weight for the previous_count based on the current time pos in the current time winodw:&lt;br/&gt;
rate = current_count + previous_count * (window_size - (time() % window_size)) / window_size;&lt;/p&gt;

&lt;p&gt;Thanks,&lt;br/&gt;
Qian&lt;/p&gt;

&lt;p&gt;&#160; &#160; &#160; &#160;&#160;&lt;/p&gt;</comment>
                            <comment id="379228" author="adilger" created="Wed, 19 Jul 2023 00:11:54 +0000"  >&lt;blockquote&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;previous_count_decay_weight = 0.5
rate = current_count + previous_count * previous_count_decay_weight;
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;&lt;/blockquote&gt;
&lt;p&gt;In this formula, if current_count = previous_count, then  &quot;&lt;tt&gt;rate = 1.5 x current_count&lt;/tt&gt;&quot; which is bad. &lt;/p&gt;

&lt;p&gt;I think one of the important properties of decaying average is that they are equal to the original value if it is constant.  For example:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;decay_weight = 0.3
new_value = current_value * (1 - decay_weight) + previous_value * decay_weight
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;That means that the weighted sum = 1.0 no matter what the weight is, and if &lt;tt&gt;previous_value = current_value&lt;/tt&gt;, then &lt;tt&gt;new_value&lt;/tt&gt; will be the same. &lt;/p&gt;

&lt;p&gt;In the above formulas &lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                        <issuelink>
            <issuekey id="56411">LU-12550</issuekey>
        </issuelink>
                            </outwardlinks>
                                                                <inwardlinks description="is related to">
                                        <issuelink>
            <issuekey id="75221">LU-16669</issuekey>
        </issuelink>
                            </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|i03pj3:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>