<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 03:14:22 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-14976] Changing tbf policy induces high CPU load</title>
                <link>https://jira.whamcloud.com/browse/LU-14976</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;&lt;b&gt;Reproducer:&lt;/b&gt;&lt;/p&gt;
&lt;ol&gt;
	&lt;li&gt;Activate &quot;tbf gid&quot; policy:&lt;br/&gt;
 lctl set_param mds.MDS.mdt.nrs_policies=&quot;tbf gid&quot;&lt;/li&gt;
	&lt;li&gt;Register a rule for a group (with a small rate value):&lt;br/&gt;
 lctl set_param mds.MDS.mdt.nrs_tbf_rule=&quot;start eaujames gid={1000} rate=10&quot;&lt;/li&gt;
	&lt;li&gt;Start doing md oprations with the limited gid on the mdt (multithreaded file creations/deletions)&lt;/li&gt;
	&lt;li&gt;When a message is queued inside the policy, changes the policy to tbf:&lt;br/&gt;
 lctl set_param mds.MDS.mdt.nrs_policies=&quot;tbf&quot;&lt;/li&gt;
	&lt;li&gt;Stop md operations. Lustre consumes 100% on CPU partition where the message is queued:&lt;br/&gt;
 For our production filesystem, on MDT0001 all cpt were impacted (&amp;gt;100 rpc in queue, load ~300) and on MDT0000 one cpt was impacted (1 rpc in queue, load ~90).&lt;/li&gt;
&lt;/ol&gt;


&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
mds.MDS.mdt.nrs_policies=
regular_requests:
  - name: fifo
    state: started
    fallback: yes
    queued: 0
    active: 0  
  
  - name: crrn
    state: stopped
    fallback: no
    queued: 0
    active: 0
  
  - name: tbf
    state: started
    fallback: no
    queued: 1
    active: 0
  
  - name: delay
    state: stopped
    fallback: no
    queued: 0
    active: 0
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;When we try to change the policy to fifo, the proccess is block to &quot;stopping&quot; state:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
mds.MDS.mdt.nrs_policies=
regular_requests:
  - name: fifo
    state: started
    fallback: yes 
    queued: 0    
    active: 0   

  - name: crrn
    state: stopped
    fallback: no
    queued: 0    
    active: 0  

  - name: tbf 
    state: stopping
    fallback: no
    queued: 1    
    active: 0   
  
  - name: delay
    state: stopped
    fallback: no
    queued: 0
    active: 0
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;&lt;b&gt;Analyse:&lt;/b&gt;&lt;/p&gt;

&lt;p&gt;It seems that when we change tbf policy (&quot;tbf gid&quot; -&amp;gt; &quot;tbf&quot;), old rpc queued inside &quot;tbf gid&quot; became inaccessible to ptlrpc threads.&lt;/p&gt;

&lt;p&gt;ptlrpc_wait_event wake up when an rpc is availabled to enqueue. But in that case ptlrpc thread is unable to enqueue the request, so it wake up all the time (causing the cpu load).&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
00000100:00000001:1.0:1630509978.890060:0:4749:0:(service.c:2029:ptlrpc_server_request_get()) &lt;span class=&quot;code-object&quot;&gt;Process&lt;/span&gt; leaving (rc=0 : 0 : 0)
00000100:00000001:0.0:1630509978.890060:0:5580:0:(service.c:2008:ptlrpc_server_request_get()) &lt;span class=&quot;code-object&quot;&gt;Process&lt;/span&gt; entered
00000100:00000001:2.0:1630509978.890061:0:5653:0:(service.c:2029:ptlrpc_server_request_get()) &lt;span class=&quot;code-object&quot;&gt;Process&lt;/span&gt; leaving (rc=0 : 0 : 0)
00000100:00000001:2.0:1630509978.890061:0:5653:0:(service.c:2248:ptlrpc_server_handle_request()) &lt;span class=&quot;code-object&quot;&gt;Process&lt;/span&gt; leaving (rc=0 : 0 : 0)
00000100:00000001:1.0:1630509978.890061:0:4749:0:(service.c:2248:ptlrpc_server_handle_request()) &lt;span class=&quot;code-object&quot;&gt;Process&lt;/span&gt; leaving (rc=0 : 0 : 0)
00000100:00000001:0.0:1630509978.890061:0:5580:0:(service.c:2029:ptlrpc_server_request_get()) &lt;span class=&quot;code-object&quot;&gt;Process&lt;/span&gt; leaving (rc=0 : 0 : 0)
00000100:00000001:0.0:1630509978.890061:0:5580:0:(service.c:2248:ptlrpc_server_handle_request()) &lt;span class=&quot;code-object&quot;&gt;Process&lt;/span&gt; leaving (rc=0 : 0 : 0)
00000100:00000001:1.0:1630509978.890062:0:4749:0:(service.c:2244:ptlrpc_server_handle_request()) &lt;span class=&quot;code-object&quot;&gt;Process&lt;/span&gt; entered
00000100:00000001:1.0:1630509978.890062:0:4749:0:(service.c:2008:ptlrpc_server_request_get()) &lt;span class=&quot;code-object&quot;&gt;Process&lt;/span&gt; entered
00000100:00000001:2.0:1630509978.890063:0:5653:0:(service.c:2244:ptlrpc_server_handle_request()) &lt;span class=&quot;code-object&quot;&gt;Process&lt;/span&gt; entered
00000100:00000001:2.0:1630509978.890063:0:5653:0:(service.c:2008:ptlrpc_server_request_get()) &lt;span class=&quot;code-object&quot;&gt;Process&lt;/span&gt; entered
00000100:00000001:1.0:1630509978.890063:0:4749:0:(service.c:2029:ptlrpc_server_request_get()) &lt;span class=&quot;code-object&quot;&gt;Process&lt;/span&gt; leaving (rc=0 : 0 : 0)
00000100:00000001:0.0:1630509978.890063:0:5580:0:(service.c:2244:ptlrpc_server_handle_request()) &lt;span class=&quot;code-object&quot;&gt;Process&lt;/span&gt; entered
00000100:00000001:2.0:1630509978.890064:0:5653:0:(service.c:2029:ptlrpc_server_request_get()) &lt;span class=&quot;code-object&quot;&gt;Process&lt;/span&gt; leaving (rc=0 : 0 : 0)
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;On my VM for one mdt thread ptlrpc_server_handle_request() is called with 300kHz frequency (doing nothing).&lt;/p&gt;</description>
                <environment>Centos 7 VMs on Lustre 2.14</environment>
        <key id="65896">LU-14976</key>
            <summary>Changing tbf policy induces high CPU load</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="3" iconUrl="https://jira.whamcloud.com/images/icons/priorities/major.svg">Major</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="eaujames">Etienne Aujames</assignee>
                                    <reporter username="eaujames">Etienne Aujames</reporter>
                        <labels>
                    </labels>
                <created>Wed, 1 Sep 2021 16:52:06 +0000</created>
                <updated>Wed, 24 May 2023 12:04:16 +0000</updated>
                            <resolved>Sat, 22 Apr 2023 18:31:14 +0000</resolved>
                                                    <fixVersion>Lustre 2.16.0</fixVersion>
                                        <due></due>
                            <votes>0</votes>
                                    <watches>6</watches>
                                                                            <comments>
                            <comment id="311899" author="adilger" created="Thu, 2 Sep 2021 05:28:20 +0000"  >&lt;p&gt;My guess is that the RPCs are only connected to the old NRS type, and then &quot;fetching&quot; RPCs to process with a new NRS type returns nothing.&#160; What needs to happen in the&#160;&lt;b&gt;very rare&lt;/b&gt; case that the NRS type is changed at runtime is either:&lt;/p&gt;
&lt;ol&gt;
	&lt;li&gt;check the old NRS type to fetch any previous RPCs before fetching RPCs from the new NRS type&lt;/li&gt;
	&lt;li&gt;move all RPCs from the old NRS type and add them to the new NRS type&lt;/li&gt;
&lt;/ol&gt;


&lt;p&gt;My preference would be #2, because this only adds overhead on the rare case when the NRS type is changed, rather than adding overhead for fetching &lt;b&gt;every&lt;/b&gt; RPC from the queue.&#160; However, looking at &lt;tt&gt;ptlrpc_server_request_get-&amp;gt;ptlrpc_nrs_req_get_nolock0()&lt;/tt&gt; it would appear that #1 is &lt;em&gt;supposed&lt;/em&gt; to be handling this case properly:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
        /**
         * Always &lt;span class=&quot;code-keyword&quot;&gt;try&lt;/span&gt; to drain requests from all NRS polices even &lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt; they are
         * inactive, because the user can change policy status at runtime.
         */
        list_for_each_entry(policy, &amp;amp;nrs-&amp;gt;nrs_policy_queued, pol_list_queued) {
               nrq = nrs_request_get(policy, peek, force);
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;but that doesn&apos;t seem to be working properly (only &lt;tt&gt;nrs_tbf_req_get()&lt;/tt&gt; appears in the flame graph).  It may be that the &quot;&lt;tt&gt;nrs gid&lt;/tt&gt;&quot; queue internal to the TBF policy itself is not making those RPCs available?&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/ViewProfile.jspa?name=eaujames&quot; class=&quot;user-hover&quot; rel=&quot;eaujames&quot;&gt;eaujames&lt;/a&gt;, to step back a minute, what is the reason to change the NRS policy type while the system is in use?  Is this just something you hit during benchmarking?  The NRS policy type should basically never change during the lifetime of a system.&lt;/p&gt;</comment>
                            <comment id="311901" author="gerrit" created="Thu, 2 Sep 2021 06:00:43 +0000"  >&lt;p&gt;&quot;Andreas Dilger &amp;lt;adilger@whamcloud.com&amp;gt;&quot; uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/44817&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/44817&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-14976&quot; title=&quot;Changing tbf policy induces high CPU load&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-14976&quot;&gt;&lt;del&gt;LU-14976&lt;/del&gt;&lt;/a&gt; ptlrpc: align function names with param names&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: d2504288bf6b798666f3c44a1bb685e455ea5fa0&lt;/p&gt;</comment>
                            <comment id="311902" author="adilger" created="Thu, 2 Sep 2021 06:02:23 +0000"  >&lt;p&gt;My above patch does not make any attempt to fix this problem, just cleans up code for the &quot;&lt;tt&gt;nrs_policies&lt;/tt&gt;&quot; parameter, and other parameters in this file, so that this code is easier to find.&lt;/p&gt;</comment>
                            <comment id="311904" author="eaujames" created="Thu, 2 Sep 2021 06:50:26 +0000"  >&lt;p&gt;This issue occurred on a filesystem in production.&lt;/p&gt;

&lt;p&gt;Here the context:&lt;br/&gt;
A user was filling the changelog list 18k open/s&#160; (changelog usage jump from 30% to 70% in one night). So the admin wanted to limit this user to avoid MDT crash.&lt;br/&gt;
The activated NRS policy was &quot;tbf gid&quot;, the admin changed the tbf policy to &quot;tbf&quot; to limit the user by uid.&lt;/p&gt;</comment>
                            <comment id="316621" author="gerrit" created="Wed, 27 Oct 2021 00:35:16 +0000"  >&lt;p&gt;&quot;Oleg Drokin &amp;lt;green@whamcloud.com&amp;gt;&quot; merged in patch &lt;a href=&quot;https://review.whamcloud.com/44817/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/44817/&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-14976&quot; title=&quot;Changing tbf policy induces high CPU load&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-14976&quot;&gt;&lt;del&gt;LU-14976&lt;/del&gt;&lt;/a&gt; ptlrpc: align function names with param names&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: &lt;br/&gt;
Commit: 7fe49f1e7cf0586da0f389188325014a8a13b849&lt;/p&gt;</comment>
                            <comment id="346339" author="gerrit" created="Mon, 12 Sep 2022 11:40:47 +0000"  >&lt;p&gt;&quot;Etienne AUJAMES &amp;lt;eaujames@ddn.com&amp;gt;&quot; uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/48523&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/48523&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-14976&quot; title=&quot;Changing tbf policy induces high CPU load&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-14976&quot;&gt;&lt;del&gt;LU-14976&lt;/del&gt;&lt;/a&gt; nrs: change nrs policies at run time&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: 37530fe5fc53a80519e4334a3a295e690f03afbc&lt;/p&gt;</comment>
                            <comment id="370227" author="gerrit" created="Sat, 22 Apr 2023 17:27:53 +0000"  >&lt;p&gt;&quot;Oleg Drokin &amp;lt;green@whamcloud.com&amp;gt;&quot; merged in patch &lt;a href=&quot;https://review.whamcloud.com/c/fs/lustre-release/+/48523/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/c/fs/lustre-release/+/48523/&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-14976&quot; title=&quot;Changing tbf policy induces high CPU load&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-14976&quot;&gt;&lt;del&gt;LU-14976&lt;/del&gt;&lt;/a&gt; nrs: change nrs policies at run time&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: &lt;br/&gt;
Commit: c098c09564a125dd44ffe0c135cd1cb6359229e7&lt;/p&gt;</comment>
                            <comment id="370267" author="pjones" created="Sat, 22 Apr 2023 18:31:14 +0000"  >&lt;p&gt;Landed for 2.16&lt;/p&gt;</comment>
                            <comment id="373290" author="gerrit" created="Wed, 24 May 2023 10:35:27 +0000"  >&lt;p&gt;&quot;Etienne AUJAMES &amp;lt;eaujames@ddn.com&amp;gt;&quot; uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/c/fs/lustre-release/+/51118&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/c/fs/lustre-release/+/51118&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-14976&quot; title=&quot;Changing tbf policy induces high CPU load&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-14976&quot;&gt;&lt;del&gt;LU-14976&lt;/del&gt;&lt;/a&gt; nrs: change nrs policies at run time&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: b2_15&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: 8292d1a744b996a43acb8d1f34210d8f9b6c7581&lt;/p&gt;</comment>
                            <comment id="373292" author="gerrit" created="Wed, 24 May 2023 10:59:55 +0000"  >&lt;p&gt;&quot;Etienne AUJAMES &amp;lt;eaujames@ddn.com&amp;gt;&quot; uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/c/fs/lustre-release/+/51119&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/c/fs/lustre-release/+/51119&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-14976&quot; title=&quot;Changing tbf policy induces high CPU load&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-14976&quot;&gt;&lt;del&gt;LU-14976&lt;/del&gt;&lt;/a&gt; nrs: change nrs policies at run time&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: b2_12&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: bdb237e26e903d2eb9d7fb1697965c7234a431f5&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                        <issuelink>
            <issuekey id="47843">LU-9885</issuekey>
        </issuelink>
                            </outwardlinks>
                                                                <inwardlinks description="is related to">
                                        <issuelink>
            <issuekey id="76199">LU-16846</issuekey>
        </issuelink>
                            </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                            <attachment id="40395" name="change_tbf_policy_dk.log" size="4543702" author="eaujames" created="Wed, 1 Sep 2021 17:15:21 +0000"/>
                            <attachment id="40393" name="tbf_cpu_load_after.svg" size="42021" author="eaujames" created="Wed, 1 Sep 2021 17:15:10 +0000"/>
                            <attachment id="40394" name="tbf_cpu_load_after_dk.log" size="7009684" author="eaujames" created="Wed, 1 Sep 2021 17:15:13 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|i023av:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>