<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:25:53 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-2519] cfs_cpu_init() Failed to create ptable with npartitions 0</title>
                <link>https://jira.whamcloud.com/browse/LU-2519</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;I recently built lustre-2.3.0 client on sles11sp2. When I tried to load the libcfs module, it failed:&lt;/p&gt;
&lt;ol&gt;
	&lt;li&gt;modprobe libcfs&lt;br/&gt;
FATAL: Error inserting libcfs (/lib/modules/3.0.42-0.7.3.20121219-nasuv/updates/kernel/net/lustre/libcfs.ko): Operation not permitted&lt;/li&gt;
&lt;/ol&gt;


&lt;p&gt;The /var/log/message said:&lt;br/&gt;
Dec 21 10:19:05 service331 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;59144.393322&amp;#93;&lt;/span&gt; LNetError: 14632:0:(linux-cpu.c:881:cfs_cpt_table_create()) Failed to setup CPU-partition-table with 2 CPU-partitions, online HW nodes: 8, HW cpus: 8.&lt;br/&gt;
Dec 21 10:19:05 service331 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;59144.436812&amp;#93;&lt;/span&gt; LNetError: 14632:0:(linux-cpu.c:1093:cfs_cpu_init()) Failed to create ptable with npartitions 0&lt;/p&gt;


&lt;p&gt;The sles11sp1 version of lustre client 2.3.0 worked fine for me. I have scraped the systems so comparison is not available at this point.&lt;/p&gt;

&lt;p&gt;There must be an easy answer for this problem, but my searching for answer came out empty. Please help! My testing of lustre-client 2.3.0 on sles11sp2 stalls. Shoudn&apos;t a default value just work?&lt;/p&gt;

&lt;p&gt;I thought I tested 2.3.0 on sles11sp2 before, but I was wrong. It was 2.3.0 on sles11sp1 that I tested.&lt;/p&gt;
</description>
                <environment>sles11sp2 x86_64</environment>
        <key id="17015">LU-2519</key>
            <summary>cfs_cpu_init() Failed to create ptable with npartitions 0</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="3" iconUrl="https://jira.whamcloud.com/images/icons/priorities/major.svg">Major</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="3">Duplicate</resolution>
                                        <assignee username="liang">Liang Zhen</assignee>
                                    <reporter username="jaylan">Jay Lan</reporter>
                        <labels>
                    </labels>
                <created>Fri, 21 Dec 2012 13:42:19 +0000</created>
                <updated>Mon, 18 Nov 2013 20:32:14 +0000</updated>
                            <resolved>Mon, 18 Nov 2013 20:32:14 +0000</resolved>
                                    <version>Lustre 2.3.0</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>4</watches>
                                                                            <comments>
                            <comment id="49562" author="pjones" created="Fri, 21 Dec 2012 14:10:00 +0000"  >&lt;p&gt;Jay&lt;/p&gt;

&lt;p&gt;You have marked this ticket as Sev 1 which is reserved for production sites out of service. My understanding is that this is that you are experimenting on a test system and this issue does not affect production systems.&lt;/p&gt;

&lt;p&gt;Is this correct?&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="49563" author="pjones" created="Fri, 21 Dec 2012 14:29:13 +0000"  >&lt;p&gt;Bob will help with this&lt;/p&gt;</comment>
                            <comment id="49564" author="jaylan" created="Fri, 21 Dec 2012 14:48:24 +0000"  >&lt;p&gt;Since there is no other message in the /var/log/messages, the error can be narrowed down to the for_each_online_node() loop in cfs_cpt_table_create(0) of libcfs/libcfs/linux/linux-cpu.c.&lt;/p&gt;</comment>
                            <comment id="49566" author="bogl" created="Fri, 21 Dec 2012 15:05:04 +0000"  >&lt;p&gt;Just as a workaround until we have a good solution, there is a modparam for libcfs to turn off cpu partitioning.  cpu_npartitions=1.&lt;/p&gt;

&lt;p&gt;see section 25.4 of Lustre Operations manual.&lt;/p&gt;</comment>
                            <comment id="49567" author="jaylan" created="Fri, 21 Dec 2012 16:16:05 +0000"  >&lt;p&gt;Still failed.&lt;/p&gt;

&lt;p&gt;Dec 21 12:34:47 service331 kernel: [  758.708600] LNetError: 8058:0:(linux-cpu.c:881:cfs_cpt_table_create()) Failed to setup CPU-partition-table with 1 CPU-partitions, online HW nodes: 8, HW cpus: 8.&lt;br/&gt;
Dec 21 12:34:47 service331 kernel: [  758.751826] LNetError: 8058:0:(linux-cpu.c:1093:cfs_cpu_init()) Failed to create ptable with npartitions 1&lt;/p&gt;</comment>
                            <comment id="49568" author="bogl" created="Fri, 21 Dec 2012 16:32:28 +0000"  >&lt;p&gt;I have been trying to reproduce your failure, but can&apos;t.  I&apos;ve tried 4 and 8 cpus, both with &amp;amp; without specifying cpu_npartitions and it all works for me.  However I only have VMs to work with, not real hardware.  Is there anything special about your HW platform?&lt;/p&gt;</comment>
                            <comment id="49569" author="bogl" created="Fri, 21 Dec 2012 16:35:55 +0000"  >&lt;p&gt;What&apos;s the kernel version in your sles11 sp2?  I update mine frequently with latest updates, my current version is 3.0.51-0.7.9.   Don&apos;t know if that would make a difference, just trying to guess how your environment might be different from mine.&lt;/p&gt;</comment>
                            <comment id="49570" author="jaylan" created="Fri, 21 Dec 2012 16:43:24 +0000"  >&lt;p&gt;There is nothing special about my HW platform afaik.&lt;br/&gt;
The kernel is 3.0.42-0.7.3.&lt;/p&gt;</comment>
                            <comment id="49571" author="jaylan" created="Fri, 21 Dec 2012 17:01:55 +0000"  >&lt;p&gt;It failed here:&lt;/p&gt;

&lt;p&gt;libcfs/libcfs/linux/linux-cpu.c&lt;br/&gt;
static struct cfs_cpt_table *&lt;br/&gt;
cfs_cpt_table_create(int ncpt)&lt;br/&gt;
{&lt;br/&gt;
...&lt;br/&gt;
    for_each_online_node&lt;img class=&quot;emoticon&quot; src=&quot;https://jira.whamcloud.com/images/icons/emoticons/information.png&quot; height=&quot;16&quot; width=&quot;16&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt; {&lt;br/&gt;
        cfs_node_to_cpumask(i, mask);&lt;br/&gt;
        CWARN(&quot;for_each_online_node: i=%d\n&quot;, i);&lt;/p&gt;

&lt;p&gt;        while (!cpus_empty(*mask)) {&lt;br/&gt;
            struct cfs_cpu_partition *part;&lt;br/&gt;
            int    n;&lt;/p&gt;

&lt;p&gt;            CWARN(&quot;!cpus_empty: cpt=%d\n&quot;, cpt);&lt;br/&gt;
            if (cpt &amp;gt;= ncpt) &lt;/p&gt;
{
                CERROR(&quot;cpt %d &amp;gt;= ncput %d\n&quot;,
                        cpt, ncpt);
                goto failed;
            }

&lt;p&gt;It failed on the second cpu (i=1), first while-loop (cpt=1).&lt;/p&gt;

&lt;p&gt;Since cpt is not reset for each for loop, and I have 8 cpu, the cpt clearly will become 7 on the 8th cpu. So the if statement will be guaranteed to fail.&lt;/p&gt;

&lt;p&gt;Should the cpt be reset to 0 at beginning of the for-loop?&lt;/p&gt;</comment>
                            <comment id="49572" author="bogl" created="Fri, 21 Dec 2012 17:06:02 +0000"  >&lt;p&gt;I put some extra debug in the success path of cfs_cpt_table_create() and I see:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;13085:0:(linux-cpu.c:877:cfs_cpt_table_create()) Setup CPU-partition-table with 2 CPU-partitions, online HW nodes: 1, HW cpus: 8.
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Note that even with multiple cpus I have only 1 HW node.  This is probably why it works for me.&lt;/p&gt;</comment>
                            <comment id="49573" author="jaylan" created="Fri, 21 Dec 2012 17:06:50 +0000"  >&lt;p&gt;No, cpt would increment conditionally:&lt;/p&gt;

&lt;p&gt;     if (num == cpus_weight(*part-&amp;gt;cpt_cpumask))&lt;br/&gt;
                                cpt++;&lt;/p&gt;

&lt;p&gt;But in my case, even cpu_npartitions=1, cpt was incremented to 1 and caused the logic to fail.&lt;/p&gt;</comment>
                            <comment id="49574" author="jaylan" created="Fri, 21 Dec 2012 17:16:17 +0000"  >&lt;p&gt;Hmm, i think my system should be just 1 node. &lt;/p&gt;</comment>
                            <comment id="49575" author="bogl" created="Fri, 21 Dec 2012 17:21:03 +0000"  >&lt;p&gt;yes, that is what I would expect, but your reported error msg says &quot;online HW nodes: 8&quot;&lt;/p&gt;</comment>
                            <comment id="49577" author="jaylan" created="Fri, 21 Dec 2012 18:37:39 +0000"  >&lt;p&gt;I hacked cfs_cpt_table_create() to assume single node for now. I just confirmed that lustre-2.3.0 libcfs was loaded fine on a similar system running sles11sp1.&lt;/p&gt;

&lt;p&gt;I will check if sles11sp1 return num_online_nodes 1 or 8. If it returns 8, we will need more debugging on lustre; if it return 1, it will be an issue of hardware vendor/kernel. Thanks~&lt;/p&gt;</comment>
                            <comment id="49578" author="bogl" created="Fri, 21 Dec 2012 18:55:28 +0000"  >&lt;p&gt;You said your sles11sp1 was on a similar system, not the same system.  I was wondering if there might be some BIOS or other firmware level setting on your platform that could deceive the OS about the number of HW nodes.  Could you check for different settings on the platforms you are using? If there are varying settings it might be those, not the distro version that makes the difference.&lt;/p&gt;

&lt;p&gt;Trying to cover all the bases here.  It would be a lot easier if I could reproduce this myself, but no luck so far.&lt;/p&gt;</comment>
                            <comment id="49582" author="jaylan" created="Fri, 21 Dec 2012 20:34:51 +0000"  >&lt;p&gt;Well, similar system today, but same system 2 weeks ago. I ran acc-sm testing on both clients running lustre-2.3 client sles11sp1 two weeks ago.&lt;/p&gt;

&lt;p&gt;I have acc-sm testing running now, both 2.3 clients, one running sles11sp1 kernel and the other sles11sp2 kernel. The 2.3 client running on sles11sp2 has cfs_cpt_table_create() hacked. I do not wish to interrupt the testing now &lt;img class=&quot;emoticon&quot; src=&quot;https://jira.whamcloud.com/images/icons/emoticons/smile.png&quot; height=&quot;16&quot; width=&quot;16&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/p&gt;

&lt;p&gt;The hardware vendor has a week-long furlough next week, so I will not get any response back from them next week. But I will try to determine if sles11sp1 responds num_online_nodes() with 1 or 8 next week.&lt;/p&gt;</comment>
                            <comment id="49586" author="liang" created="Fri, 21 Dec 2012 22:02:23 +0000"  >&lt;p&gt;Hi Bob, sorry I think I should take over this bug, it must be something wrong in my code. I will look into it. &lt;br/&gt;
I will create a debug patch for this.&lt;/p&gt;</comment>
                            <comment id="49689" author="jaylan" created="Wed, 26 Dec 2012 18:38:59 +0000"  >&lt;p&gt;How do you think there is something wrong in your code, Liang Zhen?&lt;/p&gt;

&lt;p&gt;I have confirmed that the num_online_nodes() macro in sles11sp1 (2.6.32.54-0.3.1) returned 1 in my 8-cpu test system, and returned 8 in sles11sp2 (3.0.42-0.7.3).&lt;/p&gt;

&lt;p&gt;I hacked libcfs to assume always 1 node to continue my testing; however, it would be a real problem when I move my testing to a big SMP system, unless lustre can find a way to do cfs_cpu_init without calling num_online_nodes().&lt;/p&gt;</comment>
                            <comment id="49701" author="liang" created="Wed, 26 Dec 2012 23:59:58 +0000"  >&lt;p&gt;What&apos;s CPU topology of you system? I&apos;m wondering why both num_online_cpus() and num_online_nodes() return 8, does it mean your system has 8 CPU sockets and each socket has a single core?&lt;br/&gt;
I think there is a way to use &quot;cpu_pattern&quot; parameter of libcfs to workaround this, but I need to know how many NUMA nodes, CPU sockets, CPU cores in your system.  &lt;/p&gt;</comment>
                            <comment id="49732" author="jaylan" created="Thu, 27 Dec 2012 16:54:42 +0000"  >&lt;p&gt;Attached /proc/cpuinfo from two systems. S331 runs sles11sp2 and s332 runs sles11sp1. They looks almost the same to me.&lt;/p&gt;</comment>
                            <comment id="49738" author="liang" created="Thu, 27 Dec 2012 23:37:21 +0000"  >&lt;p&gt;so your system has 2 CPU sockets, each socket has 4 cores, but num_online_nodes() will return 8 on sp2 for unknown reason, is this correct? could you run this under sp2 to see how many online nodes:&lt;br/&gt;
ls /sys/devices/system/node/&lt;br/&gt;
and run this for each node:&lt;br/&gt;
cat /sys/devices/system/node/node0/cpulist&lt;br/&gt;
cat /sys/devices/system/node/node1/cpulist&lt;br/&gt;
...&lt;/p&gt;

&lt;p&gt;I think one way to make it work is put this line in /etc/modprobe.d/lustre.conf:&lt;br/&gt;
options libcfs cpu_pattern=&quot;0&lt;span class=&quot;error&quot;&gt;&amp;#91;0,2,4,6&amp;#93;&lt;/span&gt; 1&lt;span class=&quot;error&quot;&gt;&amp;#91;1,3,5,7&amp;#93;&lt;/span&gt;&quot;&lt;/p&gt;</comment>
                            <comment id="49757" author="jaylan" created="Fri, 28 Dec 2012 13:49:46 +0000"  >&lt;p&gt;Liang&apos;s workaround worked for me.&lt;/p&gt;

&lt;p&gt;I filed a bug report to the hardware vendor but do not expect to get response until next week. Let&apos;s keep this LU open until we have a better understanding on the problem on this hardware platform.&lt;/p&gt;</comment>
                            <comment id="49892" author="jaylan" created="Thu, 3 Jan 2013 13:35:00 +0000"  >&lt;p&gt;I think we can close this ticket. It appeared to be a problem of some particular hardware platforms. All newer hardware platforms seem to work correctly. Fortunately the troubled hardware we use in production are not used as lustre client (except those I use in testing.)&lt;/p&gt;

&lt;p&gt;I am happy to use the workaround provided by Liang in my test rack.&lt;/p&gt;</comment>
                            <comment id="49893" author="pjones" created="Thu, 3 Jan 2013 13:36:54 +0000"  >&lt;p&gt;ok thanks Jay&lt;/p&gt;</comment>
                            <comment id="69545" author="jaylan" created="Tue, 22 Oct 2013 17:38:55 +0000"  >&lt;p&gt;Could you reopen this ticket?&lt;/p&gt;

&lt;p&gt;This problem happened again and this time I had a clear picture of what went wrong. It appears libcfs can not handle the fake numa situation, ie, adding &quot;numa=fake=&amp;lt;n&amp;gt;&quot; at the boot line.&lt;/p&gt;

&lt;p&gt;When a system is booted with fake numa, syslog would show &quot;Operation not permitted&quot; error at libcfs, and all lustre modules were not able to load.&lt;/p&gt;

&lt;p&gt;The workaround suggested by Liang Zhen worked this time also. We are able to get lustre mounted at those fake numa sles11sp2 systems.&lt;/p&gt;</comment>
                            <comment id="69618" author="liang" created="Wed, 23 Oct 2013 08:25:54 +0000"  >&lt;p&gt;There is another ticket (&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3992&quot; title=&quot;Fix NUMA emulated mode&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3992&quot;&gt;&lt;del&gt;LU-3992&lt;/del&gt;&lt;/a&gt;) which reported the same issue, and patch link is:&lt;br/&gt;
&lt;a href=&quot;http://review.whamcloud.com/#/c/7724/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#/c/7724/&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="70088" author="jaylan" created="Tue, 29 Oct 2013 00:42:01 +0000"  >&lt;p&gt;The patch failed in Maloo testing though...&lt;/p&gt;
</comment>
                            <comment id="71817" author="jaylan" created="Mon, 18 Nov 2013 19:23:40 +0000"  >&lt;p&gt;&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3992&quot; title=&quot;Fix NUMA emulated mode&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3992&quot;&gt;&lt;del&gt;LU-3992&lt;/del&gt;&lt;/a&gt; marked closed and I have cherry-picked into nas-2.4.0-1 and nas-2.4.1. Please close this ticket. Thanks!&lt;/p&gt;</comment>
                            <comment id="71825" author="pjones" created="Mon, 18 Nov 2013 20:32:14 +0000"  >&lt;p&gt;ok - thanks Jay!&lt;/p&gt;</comment>
                    </comments>
                    <attachments>
                            <attachment id="12117" name="cpuinfo.s331" size="6160" author="jaylan" created="Thu, 27 Dec 2012 16:54:42 +0000"/>
                            <attachment id="12118" name="cpuinfo.s332" size="6045" author="jaylan" created="Thu, 27 Dec 2012 16:54:42 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzve5z:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>5934</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>