<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 02:09:53 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-7553] Lustre cpu_npartitions default value breaks memory allocation on clients</title>
                <link>https://jira.whamcloud.com/browse/LU-7553</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;I brought this up in &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-5050&quot; title=&quot;cpu partitioning oddities&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-5050&quot;&gt;&lt;del&gt;LU-5050&lt;/del&gt;&lt;/a&gt;, but failed to get traction.  This is a specific example of how Lustre&apos;s default cpu_partition_table and related memory node code is broken by default.&lt;/p&gt;

&lt;p&gt;We have Power7 nodes that appear to have 48 cpus under Linux (12 physical cores, 4 way SMT).  There is only a &lt;em&gt;single&lt;/em&gt; memory zone on this machine:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Node 0, zone      DMA   4840   3290   3289   1676    749    325    114    105     69     10      4      1   3664 
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;For no good reason at all, Lustre decides to lay out the cpu_partition_table like this:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;0       : 0 1 2 3 4 5 
1       : 6 7 8 9 10 11 
2       : 12 13 14 15 16 17 
3       : 18 19 20 21 22 23 
4       : 24 25 26 27 28 29 
5       : 30 31 32 33 34 35 
6       : 36 37 38 39 40 41 
7       : 42 43 44 45 46 47 
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This table has no basis in reality.  Not only that, the code seems to assume two memory zones, again for no clear reason that I can see.  The memory zone selection doesn&apos;t seem to be visible anywhere, so I needed to add debugging code to figure out what was going on.  Take a look at this:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;00000100:00100000:24.0:1450144617.022705:5648:4651:0:(service.c:2891:ptlrpc_start_thread()) ldlm_cbd[0] started 0 min 2 max 2
00000100:00100000:24.0:1450144617.022707:5648:4651:0:(service.c:2950:ptlrpc_start_thread()) starting thread &apos;ldlm_cb00_000&apos;
00000400:00100000:29.0:1450144617.022761:1296:4718:0:(linux-cpu.c:631:cfs_cpt_bind()) cpumask=63 nodemask=1
00000100:00100000:24.0:1450144617.022809:5648:4651:0:(service.c:2891:ptlrpc_start_thread()) ldlm_cbd[0] started 1 min 2 max 2
00000100:00100000:24.0:1450144617.022811:5648:4651:0:(service.c:2950:ptlrpc_start_thread()) starting thread &apos;ldlm_cb00_001&apos;
00000400:00100000:33.0F:1450144617.022906:1296:4720:0:(linux-cpu.c:631:cfs_cpt_bind()) cpumask=63 nodemask=1
00000100:00100000:24.0:1450144617.022930:5648:4651:0:(service.c:2891:ptlrpc_start_thread()) ldlm_cbd[1] started 0 min 2 max 2
00000100:00100000:24.0:1450144617.022932:5648:4651:0:(service.c:2950:ptlrpc_start_thread()) starting thread &apos;ldlm_cb01_000&apos;
00000400:00100000:29.0:1450144617.022973:1296:4721:0:(linux-cpu.c:631:cfs_cpt_bind()) cpumask=4032 nodemask=1
00000100:00100000:24.0:1450144617.023029:5648:4651:0:(service.c:2891:ptlrpc_start_thread()) ldlm_cbd[1] started 1 min 2 max 2
00000100:00100000:24.0:1450144617.023031:5648:4651:0:(service.c:2950:ptlrpc_start_thread()) starting thread &apos;ldlm_cb01_001&apos;
00000400:00100000:29.0:1450144617.023071:1296:4722:0:(linux-cpu.c:631:cfs_cpt_bind()) cpumask=4032 nodemask=1
00000100:00100000:24.0:1450144617.023087:5648:4651:0:(service.c:2891:ptlrpc_start_thread()) ldlm_cbd[2] started 0 min 2 max 2
00000100:00100000:24.0:1450144617.023089:5648:4651:0:(service.c:2950:ptlrpc_start_thread()) starting thread &apos;ldlm_cb02_000&apos;
00000400:00100000:29.0:1450144617.023127:1296:4723:0:(linux-cpu.c:631:cfs_cpt_bind()) cpumask=258048 nodemask=1
00000100:00100000:24.0:1450144617.023165:5648:4651:0:(service.c:2891:ptlrpc_start_thread()) ldlm_cbd[2] started 1 min 2 max 2
00000100:00100000:24.0:1450144617.023167:5648:4651:0:(service.c:2950:ptlrpc_start_thread()) starting thread &apos;ldlm_cb02_001&apos;
00000400:00100000:29.0:1450144617.023203:1296:4724:0:(linux-cpu.c:631:cfs_cpt_bind()) cpumask=258048 nodemask=1
00000100:00100000:24.0:1450144617.023218:5648:4651:0:(service.c:2891:ptlrpc_start_thread()) ldlm_cbd[3] started 0 min 2 max 2
00000100:00100000:24.0:1450144617.023219:5648:4651:0:(service.c:2950:ptlrpc_start_thread()) starting thread &apos;ldlm_cb03_000&apos;
00000400:00100000:29.0:1450144617.023257:1296:4725:0:(linux-cpu.c:631:cfs_cpt_bind()) cpumask=16515072 nodemask=1
00000100:00100000:24.0:1450144617.023296:5648:4651:0:(service.c:2891:ptlrpc_start_thread()) ldlm_cbd[3] started 1 min 2 max 2
00000100:00100000:24.0:1450144617.023299:5648:4651:0:(service.c:2950:ptlrpc_start_thread()) starting thread &apos;ldlm_cb03_001&apos;
00000400:00100000:29.0:1450144617.023335:1296:4726:0:(linux-cpu.c:631:cfs_cpt_bind()) cpumask=16515072 nodemask=1
00000100:00100000:24.0:1450144617.023351:5648:4651:0:(service.c:2891:ptlrpc_start_thread()) ldlm_cbd[4] started 0 min 2 max 2
00000100:00100000:24.0:1450144617.023353:5648:4651:0:(service.c:2950:ptlrpc_start_thread()) starting thread &apos;ldlm_cb04_000&apos;
00000400:00100000:29.0:1450144617.023388:1296:4727:0:(linux-cpu.c:631:cfs_cpt_bind()) cpumask=1056964608 nodemask=2
00000100:00100000:24.0:1450144617.023416:5648:4651:0:(service.c:2891:ptlrpc_start_thread()) ldlm_cbd[4] started 1 min 2 max 2
00000100:00100000:24.0:1450144617.023418:5648:4651:0:(service.c:2950:ptlrpc_start_thread()) starting thread &apos;ldlm_cb04_001&apos;
00000400:00100000:29.0:1450144617.023453:1296:4728:0:(linux-cpu.c:631:cfs_cpt_bind()) cpumask=1056964608 nodemask=2
00000100:00100000:24.0:1450144617.023464:5648:4651:0:(service.c:2891:ptlrpc_start_thread()) ldlm_cbd[5] started 0 min 2 max 2
00000100:00100000:24.0:1450144617.023466:5648:4651:0:(service.c:2950:ptlrpc_start_thread()) starting thread &apos;ldlm_cb05_000&apos;
00000400:00100000:29.0:1450144617.023503:1296:4729:0:(linux-cpu.c:631:cfs_cpt_bind()) cpumask=67645734912 nodemask=2
00000100:00100000:24.0:1450144617.023537:5648:4651:0:(service.c:2891:ptlrpc_start_thread()) ldlm_cbd[5] started 1 min 2 max 2
00000100:00100000:24.0:1450144617.023540:5648:4651:0:(service.c:2950:ptlrpc_start_thread()) starting thread &apos;ldlm_cb05_001&apos;
00000400:00100000:29.0:1450144617.023576:1296:4730:0:(linux-cpu.c:631:cfs_cpt_bind()) cpumask=67645734912 nodemask=2
00000100:00100000:24.0:1450144617.023594:5648:4651:0:(service.c:2891:ptlrpc_start_thread()) ldlm_cbd[6] started 0 min 2 max 2
00000100:00100000:24.0:1450144617.023596:5648:4651:0:(service.c:2950:ptlrpc_start_thread()) starting thread &apos;ldlm_cb06_000&apos;
00000400:00100000:29.0:1450144617.023635:1296:4731:0:(linux-cpu.c:631:cfs_cpt_bind()) cpumask=4329327034368 nodemask=2
00000100:00100000:24.0:1450144617.023670:5648:4651:0:(service.c:2891:ptlrpc_start_thread()) ldlm_cbd[6] started 1 min 2 max 2
00000100:00100000:24.0:1450144617.023673:5648:4651:0:(service.c:2950:ptlrpc_start_thread()) starting thread &apos;ldlm_cb06_001&apos;
00000400:00100000:29.0:1450144617.023709:1296:4732:0:(linux-cpu.c:631:cfs_cpt_bind()) cpumask=4329327034368 nodemask=2
00000100:00100000:24.0:1450144617.023724:5648:4651:0:(service.c:2891:ptlrpc_start_thread()) ldlm_cbd[7] started 0 min 2 max 2
00000100:00100000:24.0:1450144617.023726:5648:4651:0:(service.c:2950:ptlrpc_start_thread()) starting thread &apos;ldlm_cb07_000&apos;
00000400:00100000:29.0:1450144617.023766:1296:4733:0:(linux-cpu.c:631:cfs_cpt_bind()) cpumask=277076930199552 nodemask=2
00000100:00100000:24.0:1450144617.023806:5648:4651:0:(service.c:2891:ptlrpc_start_thread()) ldlm_cbd[7] started 1 min 2 max 2
00000100:00100000:24.0:1450144617.023808:5648:4651:0:(service.c:2950:ptlrpc_start_thread()) starting thread &apos;ldlm_cb07_001&apos;
00000400:00100000:29.0:1450144617.023843:1296:4734:0:(linux-cpu.c:631:cfs_cpt_bind()) cpumask=277076930199552 nodemask=2
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;kmalloc()s are failing on the threads that have nodemask=2.  You can&apos;t see the failed memory allocation in the above trace only because I commented out the call to set_mems_allowed() in cfs_cpt_bind()).&lt;/p&gt;

&lt;p&gt;So now we know that the default cpu_parition_table layout code is broken in:&lt;/p&gt;

&lt;ul&gt;
	&lt;li&gt;Robin Humble&apos;s example of &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-5050&quot; title=&quot;cpu partitioning oddities&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-5050&quot;&gt;&lt;del&gt;LU-5050&lt;/del&gt;&lt;/a&gt;&lt;/li&gt;
	&lt;li&gt;This broken behavior on a Power7 node&lt;/li&gt;
	&lt;li&gt;My information that the default layout algorithm didn&apos;t match the actual hardware on &lt;em&gt;any&lt;/em&gt; LLNL system at the time in &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-5050&quot; title=&quot;cpu partitioning oddities&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-5050&quot;&gt;&lt;del&gt;LU-5050&lt;/del&gt;&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;I think we now have overwhelming evidence that we should set cpu_npartition to 1 by default in Lustre until such time that the cpu_parition_table can actually make sane decisions on its own.&lt;/p&gt;

&lt;p&gt;Lustre &lt;em&gt;must&lt;/em&gt; have sane defaults.  A default value that makes things fast on only a tiny subset of systems where the table happens to match does not justify turning this on by default.  That small, unlikely, benefit does not out the many ways in which the current default out right breaks things.&lt;/p&gt;

&lt;p&gt;cpu_nparitions=1 would totally work for &lt;em&gt;everyone&lt;/em&gt; by default.&lt;/p&gt;

&lt;p&gt;Lets please restore a sane default, already!&lt;/p&gt;</description>
                <environment></environment>
        <key id="33696">LU-7553</key>
            <summary>Lustre cpu_npartitions default value breaks memory allocation on clients</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="2" iconUrl="https://jira.whamcloud.com/images/icons/priorities/critical.svg">Critical</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="3">Duplicate</resolution>
                                        <assignee username="wc-triage">WC Triage</assignee>
                                    <reporter username="morrone">Christopher Morrone</reporter>
                        <labels>
                            <label>llnl</label>
                    </labels>
                <created>Tue, 15 Dec 2015 02:40:19 +0000</created>
                <updated>Sat, 21 Jul 2018 04:58:54 +0000</updated>
                            <resolved>Tue, 15 Dec 2015 18:50:33 +0000</resolved>
                                                                        <due></due>
                            <votes>0</votes>
                                    <watches>2</watches>
                                                                            <comments>
                            <comment id="136396" author="adilger" created="Tue, 15 Dec 2015 18:50:33 +0000"  >&lt;p&gt;Closing this as a duplicate of &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-5050&quot; title=&quot;cpu partitioning oddities&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-5050&quot;&gt;&lt;del&gt;LU-5050&lt;/del&gt;&lt;/a&gt;, will address the comments there. &lt;/p&gt;</comment>
                            <comment id="136673" author="adilger" created="Thu, 17 Dec 2015 06:10:42 +0000"  >&lt;p&gt;Chris, I was looking at this ticket again to see how we can fix the memory allocation binding, but am confused about something.  If there is only a single memory zone on this system, the set_mems_allowed() call shouldn&apos;t make any difference because all of the allocations would be coming from the same zone no matter which CPU it is allocated on?&lt;/p&gt;</comment>
                            <comment id="136752" author="morrone" created="Thu, 17 Dec 2015 19:55:24 +0000"  >&lt;p&gt;The set_mems_allowed() call is all tied in with the cpu partition table code, so setting cpu_npatitions=1 and disabling all of that has production operations back on line.  Yes, cpu binding and memory node binding are not necessarily related, but the code has them fairly tangled together.  &lt;/p&gt;

&lt;p&gt;Since the code can&apos;t figure out what sockets, cores, and smt threads are really in existance and map those correctly, then I would not be terribly surprised is Lustre is messing up the binding to memory nodes and assigning half of the processes to a node that doesn&apos;t really exist.  Granted, there is some speculation there, so take it with a grain of salt.&lt;/p&gt;

&lt;p&gt;But I do know this much:&lt;/p&gt;

&lt;p&gt;/proc/buddyinfo shows one memory node.  When Lustre binds processes to the second (presumably non-existent) node, that process goes on to fail very simple small kmalloc() calls, despite there being nearly 60GB of free memory, and buddyinfo verifies that there are plenty of order 0 blocks free (and plenty in all of the other order sizes as well).  The processes that were bound to the first memory node (presumably the real one that exists), those processes did not exhibit memory allocation problems.&lt;/p&gt;

&lt;p&gt;Like I said, you can&apos;t see the failure in the lustre log snippet that I provided because I had already commented out set_mems_allowed().  But for all of the earlier runs where set_mems_allowed() was active, the first process that used nodemask 2 always hits a kmalloc() failure, and Lustre completely aborts the setup at that point.&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                        <issuelink>
            <issuekey id="24686">LU-5050</issuekey>
        </issuelink>
                            </outwardlinks>
                                                                <inwardlinks description="is related to">
                                        <issuelink>
            <issuekey id="38161">LU-8395</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="52781">LU-11163</issuekey>
        </issuelink>
                            </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzxvuv:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>