<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 03:35:57 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-17501] ptlrpcd to avoid client cores that are very busy</title>
                <link>https://jira.whamcloud.com/browse/LU-17501</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;Some (non Lustre) filesystems consume 100% of the CPU cycles on one or more cores busy-waiting by polling for event completion and scheduled at a high priority (not sure if &quot;real time&quot; or not).  They also apparently configure the CPU scheduler to deny scheduling of other processes on some CPU cores.&lt;/p&gt;

&lt;p&gt;The &lt;tt&gt;ptlrpcd&lt;/tt&gt; threads handle RPC sending and receiving and are normally distributed evenly across cores and bound to each NUMA domain to minimize cross-CPU memory traffic when there are well-distributed application workloads running on a system (e.g. multi-threaded computational workload) that allocate and dirty data pages on all of the NUMA domains evenly.  In some cases, where the number of cores is larger than the number of active application threads, then it is advantageous for &lt;tt&gt;ptlrpcd&lt;/tt&gt; threads on other CPU cores to take over the RPC processing in order to offload CPU-intensive tasks like checksums, compression, and encryption to cores that are otherwise under utilized.&lt;/p&gt;

&lt;p&gt;At no time do &lt;tt&gt;ptlrpcd&lt;/tt&gt; (or other Lustre service) threads exclusively utilize or busy wait on CPU cores or prevent application threads from using them when they are not actively processing requests on behalf of the application. &lt;/p&gt;

&lt;p&gt;However, if &lt;tt&gt;ptlrpcd&lt;/tt&gt; threads are started on a NUMA core and then try to process RPCs, they can become stalled when threads on that NUMA domain could not be scheduled for lengthy periods of time.  This causes intermittent laggy RPC handling when those threads are processing a time-sensitive RPC.&lt;/p&gt;

&lt;p&gt;To work around this issue, we used &lt;tt&gt;lscpu&lt;/tt&gt; to determine the NUMA configuration of the CPUs installed and then created a CPT configuration that avoided scheduling the &lt;tt&gt;ptlrpcd&lt;/tt&gt; threads on cores that had been taken over by the other filesystem:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;# lscpu | grep NUMA
NUMA:
 NUMA node(s):     2
 NUMA node0 CPU(s):   0-63,128-191
 NUMA node1 CPU(s):   64-127,192-255
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;In the &lt;tt&gt;/etc/modprobe.d/lustre.conf&lt;/tt&gt; file the following lines were added to create the Lustre CPU Partition Table to the last 8 cores (of  64) in each of 4 NUMA nodes, to avoid the other filesystem that was heating up the first two cores on each of the NUMA nodes:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;options libcfs cpu_npartitions=4
options libcfs cpu_pattern=&quot;0[56-63] 1[120-127] 2[184-191] 3[248-255]&quot;
options ptlrpcd max_ptlrpcds=64
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;That allows those threads to run on 32 different cores, with a maximum of 16 threads running across the 8 cores in each NUMA node.&lt;/p&gt;

&lt;p&gt;However, this is only a workaround solution, as specifying the &lt;tt&gt;cpu_pattern&lt;/tt&gt; and &lt;tt&gt;cpu_npartitions&lt;/tt&gt; is relatively complex and CPU-specific, and likely needs to be different for different systems within the same cluster.  It would be better to have more flexible mechanisms to avoid this issue.&lt;/p&gt;

&lt;p&gt;One option is to add an exclude pattern option to libcfs which &lt;b&gt;avoids&lt;/b&gt; the specified cores when configuring the CPT map, something like the following to exclude the specified 2 cores in each of two NUMA nodes:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;options libcfs cpu_pattern=&quot;X[0-1] X[128-129]&quot;
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;That allows a relatively simple (and mostly universal) option to avoid e.g. core0 and core1 on all machines, without having to know the full NUMA configuration details of each one.  To exclude cores on each NUMA node, a syntax like the following could be used:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;options libcfs cpu_pattern=&quot;N X[0-1]&quot;
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;which would mean &quot;exclude all of the cores in NUMA node0 and node1&quot;, to be aligned with the  &quot;&lt;tt&gt;N 0&lt;span class=&quot;error&quot;&gt;&amp;#91;0-1&amp;#93;&lt;/span&gt;&lt;/tt&gt;&quot; definition, which means &quot;include all of the cores in NUMA node0 and node1 into CPT0&quot;.&lt;/p&gt;

&lt;p&gt;To exclude specific &lt;b&gt;cores&lt;/b&gt; in each NUMA node, an option like the following could be used:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;options libcfs cpu_pattern=&quot;N C[0-7]&quot;
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;to exclude the first eight cores on each NUMA domain.  The meaning of &quot;&lt;tt&gt;X&lt;/tt&gt;&quot; and &quot;&lt;tt&gt;C&lt;/tt&gt;&quot; would be identical if &quot;&lt;tt&gt;N&lt;/tt&gt;&quot; is not specified.  Possibly it makes sense to also allow &quot;&lt;tt&gt;N C&lt;span class=&quot;error&quot;&gt;&amp;#91;-2&amp;#93;&lt;/span&gt;&lt;/tt&gt;&quot; to allow excluding the &lt;b&gt;last&lt;/b&gt; two cores on each NUMA node, in case that is needed at some point?&lt;/p&gt;

&lt;p&gt;Having an exclude list for cores would also be an easy way to reserve CPU cores for userspace threads running on server nodes (e.g. HA (Corosync/Pacemaker), monitoring, logging, sshd, etc.).&lt;/p&gt;


&lt;p&gt;A further improvement would be to dynamically detect when the CPU scheduler has been configured to avoid scheduling processes on a particular core, and/or dynamically detect when &lt;tt&gt;ptlrpcd&lt;/tt&gt; is unable to be scheduled on a core and avoid using it entirely (probably with a console message to that effect), similar to CPU hot-unplug.  Dynamic exclusion/load detection is more complex to implement, but would avoid the need to statically configure nodes at all, and work around the breakage that is introduced by other filesystems.&lt;/p&gt;</description>
                <environment></environment>
        <key id="80631">LU-17501</key>
            <summary>ptlrpcd to avoid client cores that are very busy</summary>
                <type id="4" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11310&amp;avatarType=issuetype">Improvement</type>
                                            <priority id="4" iconUrl="https://jira.whamcloud.com/images/icons/priorities/minor.svg">Minor</priority>
                        <status id="1" iconUrl="https://jira.whamcloud.com/images/icons/statuses/open.png" description="The issue is open and ready for the assignee to start work on it.">Open</status>
                    <statusCategory id="2" key="new" colorName="default"/>
                                    <resolution id="-1">Unresolved</resolution>
                                        <assignee username="wc-triage">WC Triage</assignee>
                                    <reporter username="adilger">Andreas Dilger</reporter>
                        <labels>
                            <label>medium</label>
                    </labels>
                <created>Sat, 3 Feb 2024 21:37:18 +0000</created>
                <updated>Mon, 5 Feb 2024 11:07:24 +0000</updated>
                                                                                <due></due>
                            <votes>0</votes>
                                    <watches>5</watches>
                                                                                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                                        </outwardlinks>
                                                        </issuelinktype>
                    </issuelinks>
                <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|i04a8n:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>