<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:38:44 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-3997] Excessive slab usage causes large mem &amp; core count clients to hang</title>
                <link>https://jira.whamcloud.com/browse/LU-3997</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;Client version: 2.3.0, SLES kernel 3.0.13_0.27_default&lt;br/&gt;
Server version: 2.1.2&lt;/p&gt;

&lt;p&gt;We are running into an issue at SCBI that appears to be similar to &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3771&quot; title=&quot;stuck 56G of SUnreclaim memory&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3771&quot;&gt;&lt;del&gt;LU-3771&lt;/del&gt;&lt;/a&gt;. Under certain workloads, the slab memory usage gets to the point where it causes the kernel to hang. Apparently it is not a problem under the SLUB allocator, but Novell is not prepared to support a SLES kernel with the SLUB allocator enabled.&lt;/p&gt;

&lt;p&gt;Are there any tunables that can help shrink the slab usage? &lt;/p&gt;

&lt;p&gt;HP Summary:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The main factor triggering the problem is reading from /proc/slabinfo. SLAB does this while holding l3-&amp;gt;list_lock and when a slab is huge, this leads to big delays so that other subsystems are impacted and if NMI Watchdog is enabled, this leads t soft/hards lockups and panics.&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;Novell analysis:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Have you actually tried to put pcc governor out of way? I can still see many&lt;br/&gt;
cpus looping on the same pcc internal lock:&lt;br/&gt;
crash&amp;gt; struct spinlock_t ffffffffa047c610&lt;br/&gt;
struct spinlock_t {&lt;br/&gt;
{&lt;br/&gt;
rlock = {&lt;br/&gt;
raw_lock = {&lt;br/&gt;
slock = 862335805&lt;br/&gt;
}&lt;br/&gt;
}&lt;br/&gt;
}&lt;br/&gt;
}&lt;br/&gt;
crash&amp;gt; p /x 862335805&lt;br/&gt;
$1 = 0x3366333d&lt;/p&gt;

&lt;p&gt;crash&amp;gt; p 0x3366-0x333d&lt;br/&gt;
$2 = 41&lt;/p&gt;

&lt;p&gt;So there are 40 CPUs waiting for the lock. This sounds really insane! Who is&lt;br/&gt;
holding the lock?&lt;br/&gt;
PID: 79454 TASK: ffff882fd6a224c0 CPU: 0 COMMAND: &quot;kworker/0:1&quot;&lt;br/&gt;
#0 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff88407f807eb0&amp;#93;&lt;/span&gt; crash_nmi_callback at ffffffff8101eaef&lt;br/&gt;
#1 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff88407f807ec0&amp;#93;&lt;/span&gt; notifier_call_chain at ffffffff81445617&lt;br/&gt;
#2 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff88407f807ef0&amp;#93;&lt;/span&gt; notify_die at ffffffff814456ad&lt;br/&gt;
#3 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff88407f807f20&amp;#93;&lt;/span&gt; default_do_nmi at ffffffff814429d7&lt;br/&gt;
#4 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff88407f807f40&amp;#93;&lt;/span&gt; do_nmi at ffffffff81442c08&lt;br/&gt;
#5 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff88407f807f50&amp;#93;&lt;/span&gt; nmi at ffffffff81442320&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;exception RIP: _raw_spin_lock+24&amp;#93;&lt;/span&gt;&lt;br/&gt;
RIP: ffffffff81441998 RSP: ffff883f02147d28 RFLAGS: 00000293&lt;br/&gt;
RAX: 000000000000333d RBX: ffff8b3e85c4e680 RCX: 0000000000000028&lt;br/&gt;
RDX: 0000000000003335 RSI: 0000000000249f00 RDI: ffffffffa047c610&lt;br/&gt;
RBP: ffff88407f80eb80 R8: 0000000000000020 R9: 0000000000000000&lt;br/&gt;
R10: 0000000000000064 R11: ffffffffa047a4e0 R12: 0000000000249f00&lt;br/&gt;
R13: 0000000000004fd4 R14: 00000000000000a0 R15: 0000000000000000&lt;br/&gt;
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018&lt;br/&gt;
&amp;#8212; &amp;lt;NMI exception stack&amp;gt; &amp;#8212;&lt;br/&gt;
#6 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff883f02147d28&amp;#93;&lt;/span&gt; _raw_spin_lock at ffffffff81441998&lt;br/&gt;
#7 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff883f02147d28&amp;#93;&lt;/span&gt; pcc_cpufreq_target at ffffffffa047a4fe &lt;span class=&quot;error&quot;&gt;&amp;#91;pcc_cpufreq&amp;#93;&lt;/span&gt;&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;...&amp;#93;&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;OK this one requested 0x333d ticket but it still sees very old spinlock state.&lt;br/&gt;
And what more interesting is that it just refetched the global state:&lt;br/&gt;
0xffffffff81441995 &amp;lt;_raw_spin_lock+21&amp;gt;: movzwl (%rdi),%edx&lt;br/&gt;
0xffffffff81441998 &amp;lt;_raw_spin_lock+24&amp;gt;: jmp 0xffffffff8144198f&lt;br/&gt;
&amp;lt;_raw_spin_lock+15&amp;gt;&lt;/p&gt;

&lt;p&gt;The lock is not IRQ safe so an interrupt might have triggered after movzwl and before jmp. OK, let&apos;s pretend that this is not a problem, althought I wouldn&apos;t be happy about CPU governor which doesn&apos;t scale on such a machine that badly.&lt;/p&gt;

&lt;p&gt;The lockup has been detected:&lt;br/&gt;
crash&amp;gt; dmesg | grep -i lockup&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;385474.330482&amp;#93;&lt;/span&gt; BUG: soft lockup - CPU#0 stuck for 22s! &lt;span class=&quot;error&quot;&gt;&amp;#91;sort:130201&amp;#93;&lt;/span&gt;&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;507912.743427&amp;#93;&lt;/span&gt; Kernel panic - not syncing: Watchdog detected hard LOCKUP on&lt;br/&gt;
cpu 44&lt;/p&gt;

&lt;p&gt;The first one (soft lockup) was obviously recoverable. The second is more&lt;br/&gt;
interesting:&lt;br/&gt;
PID: 100927 TASK: ffff8b3e857c2580 CPU: 44 COMMAND: &quot;collectl&quot;&lt;br/&gt;
#0 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff8a3fff907b20&amp;#93;&lt;/span&gt; machine_kexec at ffffffff810265ce&lt;br/&gt;
#1 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff8a3fff907b70&amp;#93;&lt;/span&gt; crash_kexec at ffffffff810a3b5a&lt;br/&gt;
#2 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff8a3fff907c40&amp;#93;&lt;/span&gt; panic at ffffffff8143eadf&lt;br/&gt;
#3 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff8a3fff907cc0&amp;#93;&lt;/span&gt; watchdog_overflow_callback at ffffffff810be194&lt;br/&gt;
#4 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff8a3fff907cd0&amp;#93;&lt;/span&gt; __perf_event_overflow at ffffffff810e9aba&lt;br/&gt;
#5 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff8a3fff907d70&amp;#93;&lt;/span&gt; intel_pmu_handle_irq at ffffffff810159d9&lt;br/&gt;
#6 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff8a3fff907eb0&amp;#93;&lt;/span&gt; perf_event_nmi_handler at ffffffff814433b1&lt;br/&gt;
#7 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff8a3fff907ec0&amp;#93;&lt;/span&gt; notifier_call_chain at ffffffff81445617&lt;br/&gt;
#8 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff8a3fff907ef0&amp;#93;&lt;/span&gt; notify_die at ffffffff814456ad&lt;br/&gt;
#9 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff8a3fff907f20&amp;#93;&lt;/span&gt; default_do_nmi at ffffffff814429d7&lt;br/&gt;
#10 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff8a3fff907f40&amp;#93;&lt;/span&gt; do_nmi at ffffffff81442c08&lt;br/&gt;
#11 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff8a3fff907f50&amp;#93;&lt;/span&gt; nmi at ffffffff81442320&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;exception RIP: s_show+211&amp;#93;&lt;/span&gt;&lt;br/&gt;
RIP: ffffffff8113a4c3 RSP: ffff8b3e70d2fde8 RFLAGS: 00000046&lt;br/&gt;
RAX: ffff89367c870000 RBX: 0000000000000000 RCX: 0000000000000025&lt;br/&gt;
RDX: 0000000000000025 RSI: ffff893fff42e150 RDI: ffff893fff42e180&lt;br/&gt;
RBP: ffff893fff42e140 R8: 0000000000000400 R9: ffffffff81be18a0&lt;br/&gt;
R10: 0000ffff00066c0a R11: 0000000000000000 R12: 0000000004ec9217&lt;br/&gt;
R13: 00000000002270bc R14: 0000000000000000 R15: 0000000000000002&lt;br/&gt;
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018&lt;br/&gt;
&amp;#8212; &amp;lt;NMI exception stack&amp;gt; &amp;#8212;&lt;br/&gt;
#12 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff8b3e70d2fde8&amp;#93;&lt;/span&gt; s_show at ffffffff8113a4c3&lt;br/&gt;
#13 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff8b3e70d2fe60&amp;#93;&lt;/span&gt; seq_read at ffffffff81171991&lt;br/&gt;
#14 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff8b3e70d2fed0&amp;#93;&lt;/span&gt; proc_reg_read at ffffffff811ad847&lt;br/&gt;
#15 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff8b3e70d2ff10&amp;#93;&lt;/span&gt; vfs_read at ffffffff81151687&lt;br/&gt;
#16 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff8b3e70d2ff40&amp;#93;&lt;/span&gt; sys_read at ffffffff811517f3&lt;br/&gt;
#17 &lt;span class=&quot;error&quot;&gt;&amp;#91;ffff8b3e70d2ff80&amp;#93;&lt;/span&gt; system_call_fastpath at ffffffff81449692&lt;br/&gt;
RIP: 00007d570dc8e750 RSP: 00007fff43f31af0 RFLAGS: 00010206&lt;br/&gt;
RAX: 0000000000000000 RBX: ffffffff81449692 RCX: 0000000000004618&lt;br/&gt;
RDX: 0000000000001000 RSI: 0000000001bc7ee8 RDI: 0000000000000007&lt;br/&gt;
RBP: 000000000078e010 R8: 0000000000000000 R9: 0000000000000000&lt;br/&gt;
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000001bc7ee8&lt;br/&gt;
R13: 0000000000000007 R14: 0000000000000000 R15: 000000000000f001&lt;br/&gt;
ORIG_RAX: 0000000000000000 CS: 0033 SS: 002b&lt;/p&gt;

&lt;p&gt;This is /proc/slabinfo interface. rbp contains kmem_list.&lt;br/&gt;
crash&amp;gt; struct kmem_list3 ffff893fff42e140&lt;br/&gt;
struct kmem_list3 {&lt;br/&gt;
slabs_partial = {&lt;br/&gt;
next = 0xffff893fff42e140,&lt;br/&gt;
prev = 0xffff893fff42e140&lt;br/&gt;
},&lt;br/&gt;
slabs_full = {&lt;br/&gt;
next = 0xffff892a92857000,&lt;br/&gt;
prev = 0xffff893e86cb0000&lt;br/&gt;
},&lt;br/&gt;
slabs_free = {&lt;br/&gt;
next = 0xffff893fff42e160,&lt;br/&gt;
prev = 0xffff893fff42e160&lt;br/&gt;
},&lt;br/&gt;
free_objects = 0,&lt;br/&gt;
free_limit = 637,&lt;br/&gt;
colour_next = 0,&lt;br/&gt;
list_lock = {&lt;br/&gt;
{&lt;br/&gt;
rlock = {&lt;br/&gt;
raw_lock = {&lt;br/&gt;
slock = 2008643510&lt;br/&gt;
}&lt;br/&gt;
}&lt;br/&gt;
}&lt;br/&gt;
},&lt;br/&gt;
shared = 0xffff893fff42f000,&lt;br/&gt;
alien = 0xffff893fff41d640,&lt;br/&gt;
next_reap = 4422089447,&lt;br/&gt;
free_touched = 1&lt;br/&gt;
}&lt;/p&gt;

&lt;p&gt;There are no free nor partially filled slabs so we have only full_slabs and&lt;br/&gt;
quite some of them:&lt;br/&gt;
crash&amp;gt; list -s slab 0xffff892a92857000 | grep &quot;^ffff&quot; &amp;gt; full_slabs&lt;br/&gt;
[wait for a loooooooooooooooooooooooong time until you loose your patience and&lt;br/&gt;
ctrl+c]&lt;br/&gt;
$ wc -l full_slabs&lt;br/&gt;
55898 full_slabs&lt;/p&gt;

&lt;p&gt;So yes, this is indeed dangerous if some subsystem allocates too many objects.&lt;br/&gt;
Especially when:&lt;br/&gt;
ll /proc/slabinfo&lt;br/&gt;
&lt;del&gt;rw-r&lt;/del&gt;&lt;del&gt;r&lt;/del&gt;- 1 root root 0 Sep 19 11:50 /proc/slabinfo&lt;/p&gt;

&lt;p&gt;So anybody might read and interfere. Upstream is no better in that aspect as&lt;br/&gt;
get_slabinfo does the same thing. SLUB would be better as it is using atomic&lt;br/&gt;
counters for the same purposes.&lt;/p&gt;

&lt;p&gt;We can silent the watchdog and stuff touch_nmi_watchdog into the loops but that only papers over the real issue. The right thing to do would be having&lt;br/&gt;
something similar as SLUB and collect statistics per kmem_cachel3. I am not&lt;br/&gt;
sure whether this is doable considering kABI restrictions and potential&lt;br/&gt;
performance regressions. I would have to dive into this more but unfortunatelly I am leaving for a long vacation. Let&apos;s CC Mel here. Also I haven&apos;t seen this as a problem with our regular kernel because nothing seems to be allocating so manny kmalloc objects so it is questionable how much of a problem this really is for the supported kernel configurations.&lt;/p&gt;

&lt;p&gt;Whether using so many objects is healthy is another question for which I do not have a good answer. SLAB tries to batch operations internally so it should scale with the number of objects quite well but I am not familiar with all the internals enough to tell that with 100% certainity.&lt;/p&gt;&lt;/blockquote&gt;</description>
                <environment></environment>
        <key id="21094">LU-3997</key>
            <summary>Excessive slab usage causes large mem &amp; core count clients to hang</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="4" iconUrl="https://jira.whamcloud.com/images/icons/priorities/minor.svg">Minor</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="6">Not a Bug</resolution>
                                        <assignee username="bfaccini">Bruno Faccini</assignee>
                                    <reporter username="kitwestneat">Kit Westneat</reporter>
                        <labels>
                    </labels>
                <created>Mon, 23 Sep 2013 20:07:04 +0000</created>
                <updated>Thu, 24 Oct 2013 15:17:04 +0000</updated>
                            <resolved>Thu, 24 Oct 2013 15:17:04 +0000</resolved>
                                    <version>Lustre 2.3.0</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>3</watches>
                                                                            <comments>
                            <comment id="67342" author="bfaccini" created="Tue, 24 Sep 2013 13:16:55 +0000"  >&lt;p&gt;Hello Kit,&lt;br/&gt;
Even if I easily understand that the /proc/slabinfo walk-thru by &quot;collectl&quot; thread can take ages due to huge Slabs, I don&apos;t see where it can interfere with the pcc_lock spin-lock use by others threads here.&lt;/p&gt;

&lt;p&gt;Could it be possible for you to provide the crash-dump (along with vmlinux/kernel-debuginfo&lt;span class=&quot;error&quot;&gt;&amp;#91;-common&amp;#93;&lt;/span&gt; RPMs, and the lustre-&lt;span class=&quot;error&quot;&gt;&amp;#91;modules,debuginfo&amp;#93;&lt;/span&gt; RPMs) or at least attach the log/dmesg, foreach bt, kmem -s, crash sub-commands output ??&lt;/p&gt;</comment>
                            <comment id="67396" author="kitwestneat" created="Tue, 24 Sep 2013 17:19:16 +0000"  >&lt;p&gt;Hi Bruno, &lt;/p&gt;

&lt;p&gt;Where should we upload the core to?&lt;/p&gt;

&lt;p&gt;Thanks,&lt;br/&gt;
Kit &lt;/p&gt;</comment>
                            <comment id="67466" author="bfaccini" created="Tue, 24 Sep 2013 21:32:25 +0000"  >&lt;p&gt;Just sent you upload instructions by email.&lt;br/&gt;
Thanks for your help.&lt;/p&gt;</comment>
                            <comment id="67779" author="kitwestneat" created="Thu, 26 Sep 2013 22:50:01 +0000"  >&lt;p&gt;Hi Bruno,&lt;/p&gt;

&lt;p&gt;They should be uploaded, let me know if you need anything else.&lt;/p&gt;

&lt;p&gt;Thanks.&lt;/p&gt;</comment>
                            <comment id="67866" author="kitwestneat" created="Fri, 27 Sep 2013 19:15:12 +0000"  >&lt;p&gt;I&apos;ve just run into a similar issue at NREL while doing robinhood testing, where the slab unreclaim goes to 90% of memory. I ran &apos;echo 3 &amp;gt; /proc/sys/vm/drop_caches&apos;, but it is hanging and the bash process is at 100% CPU, as is the kswapd process. Lustre is mounted read-only.&lt;/p&gt;

&lt;p&gt;Here is slabtop sorted by cache size:&lt;br/&gt;
2080183 2080182  99%    8.00K 2080183        1  16641464K size-8192&lt;br/&gt;
6185372 6185242  99%    1.00K 1546343        4   6185372K size-1024&lt;br/&gt;
2062676 2062631  99%    2.00K 1031338        2   4125352K size-2048&lt;br/&gt;
2064096 2063826  99%    0.50K 258012        8   1032048K size-512&lt;br/&gt;
3054480 305107   9%    0.12K 101816       30    407264K size-128&lt;br/&gt;
4803485 4754884  98%    0.06K  81415       59    325660K size-64 &lt;br/&gt;
4096288 2116344  51%    0.03K  36574      112    146296K size-32   &lt;/p&gt;

&lt;p&gt;You can see that the 8k slab cache is using 16G. The other slabs are also using a lot. &lt;/p&gt;

&lt;p&gt;memused=28486658060&lt;br/&gt;
memused_max=28535237092&lt;/p&gt;

&lt;p&gt;The kernel is 2.6.32-279.el6.x86_64, client is 2.1.6. Any more information we can get? &lt;/p&gt;
</comment>
                            <comment id="67870" author="kitwestneat" created="Fri, 27 Sep 2013 19:32:33 +0000"  >&lt;p&gt;The NREL bug is due to &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-2613&quot; title=&quot;opening and closing file can generate &amp;#39;unreclaimable slab&amp;#39; space&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-2613&quot;&gt;&lt;del&gt;LU-2613&lt;/del&gt;&lt;/a&gt;. Reading in a file on the filesystem caused it to unblock. &lt;/p&gt;
</comment>
                            <comment id="68129" author="bfaccini" created="Wed, 2 Oct 2013 09:07:31 +0000"  >&lt;p&gt;Kit, I am sorry but crash tool complains &quot;no debugging data available&quot; against the &quot;localscratch/dump/2013-08-20-14:05/vmlinux-3.0.13-0.27-default&quot; kernel you provided ...&lt;br/&gt;
Do you know where to find and/or can you also upload the corresponding kernel-debuginfo* RPMs ??&lt;br/&gt;
Thanks again and in advance for your help!&lt;/p&gt;</comment>
                            <comment id="68234" author="kitwestneat" created="Thu, 3 Oct 2013 14:38:26 +0000"  >&lt;p&gt;Hi Bruno,&lt;/p&gt;

&lt;p&gt;The customer has uploaded the debuginfo rpm.&lt;/p&gt;

&lt;p&gt;Thanks,&lt;br/&gt;
Kit&lt;/p&gt;</comment>
                            <comment id="69121" author="kitwestneat" created="Wed, 16 Oct 2013 14:50:24 +0000"  >&lt;p&gt;Hi Bruno,&lt;/p&gt;

&lt;p&gt;Were you able to get the debuginfo?&lt;/p&gt;

&lt;p&gt;Thanks,&lt;br/&gt;
Kit&lt;/p&gt;</comment>
                            <comment id="69391" author="bfaccini" created="Mon, 21 Oct 2013 12:48:09 +0000"  >&lt;p&gt;Yes I got it and I am working on the crash-dump now that crash tool is happy.&lt;/p&gt;</comment>
                            <comment id="69505" author="bfaccini" created="Tue, 22 Oct 2013 10:12:22 +0000"  >&lt;p&gt;Humm the crash-output you already attached are not from the uploaded crash-dump but fortunately it shows the same situation!! &lt;/p&gt;

&lt;p&gt;Concerning the pcc_lock contention, there are about 28 threads spinning on it and on the node&apos;s 80 cores, when it is likely to be owned by this thread :&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;PID: 5758   TASK: ffff883369b12480  CPU: 0   COMMAND: &quot;kworker/0:1&quot;
 #0 [ffff88407f807eb0] crash_nmi_callback at ffffffff8101eaef
 #1 [ffff88407f807ec0] notifier_call_chain at ffffffff81445617
 #2 [ffff88407f807ef0] notify_die at ffffffff814456ad
 #3 [ffff88407f807f20] default_do_nmi at ffffffff814429d7
 #4 [ffff88407f807f40] do_nmi at ffffffff81442c08
 #5 [ffff88407f807f50] nmi at ffffffff81442320
    [exception RIP: _raw_spin_lock+21]
    RIP: ffffffff81441995  RSP: ffff8834d4d7dd28  RFLAGS: 00000283
    RAX: 000000000000c430  RBX: ffff8b3e8532d7c0  RCX: 0000000000000028
    RDX: 000000000000c42e  RSI: 0000000000249f00  RDI: ffffffffa02b7610
    RBP: ffff88407f80eb80   R8: 0000000000000020   R9: 0000000000000000
    R10: 0000000000000064  R11: ffffffffa02b54e0  R12: 0000000000249f00
    R13: 0000000000005ef0  R14: 00000000000000a0  R15: 0000000000000000
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
--- &amp;lt;NMI exception stack&amp;gt; ---
 #6 [ffff8834d4d7dd28] _raw_spin_lock at ffffffff81441995
 #7 [ffff8834d4d7dd28] pcc_cpufreq_target at ffffffffa02b54fe [pcc_cpufreq]
 #8 [ffff8834d4d7dd78] dbs_check_cpu at ffffffff8135feb3
 #9 [ffff8834d4d7ddf8] do_dbs_timer at ffffffff813601c8
#10 [ffff8834d4d7de28] process_one_work at ffffffff810747bc
#11 [ffff8834d4d7de78] worker_thread at ffffffff8107734a
#12 [ffff8834d4d7dee8] kthread at ffffffff8107b676
#13 [ffff8834d4d7df48] kernel_thread_helper at ffffffff8144a7c4
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;again, it is running on CPU/Core 0 and also shows an old spin-lock/ticket value which should only be a side-effect of the NMI handling.&lt;br/&gt;
But finally, I don&apos;t think this pcc_lock contention is an issue since it only occurs on Idle cores (ie, where the only thread waiting in RU-nning state is the Idle-loop), during power/frequency recalibration.&lt;/p&gt;

&lt;p&gt;Now I will investigate the Soft-lockup+NMI/watchdog issue caused by collectl computing the Slabs consumption.&lt;/p&gt;
</comment>
                            <comment id="69597" author="bfaccini" created="Tue, 22 Oct 2013 23:16:00 +0000"  >&lt;p&gt;Top Slabs consumers in crash-dump provided are :&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;CACHE            NAME                 OBJSIZE  ALLOCATED     TOTAL  SLABS  SSIZE
ffff883ef7a61080 cl_page_kmem             192  110784616  117426440 5871322     4k
ffff883ef5f011c0 osc_page_kmem            216   55392308  59188842 3288269     4k
ffff883ef5ee17c0 vvp_page_kmem             80   55392308  59744976 1244687     4k
ffff883ef5e913c0 lov_page_kmem             48   55392308  59936492 778396     4k
ffff883ef5a01540 lovsub_page_kmem          40   55392308  60008840 652270     4k
ffff88407f690980 radix_tree_node          560    2812010   3063060 437580     4k
ffff883ef59d16c0 lustre_inode_cache      1152    1077902   1078294 154042     8k
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;others are of a much lower order of magnitude.&lt;br/&gt;
The KMem footprint is more than 50GB but it is only 2.5% of the 2048GB of memory. The corresponding number of objects (360446944) can be the problem here, causing collectl thread to spend too long time in s_show() routine loops, and thus trigger the hard-lockup ??&lt;/p&gt;</comment>
                            <comment id="69697" author="bfaccini" created="Wed, 23 Oct 2013 22:59:03 +0000"  >&lt;p&gt;And no surprise that the current kmem_cache being walked-thru, as part of &quot;/proc/slabinfo&quot; access and at the time of the problem, is cl_page_kmem and its huge number of Slabs/Objects.&lt;/p&gt;

&lt;p&gt;There are 80 cores divided in 8 Numa-nodes, and this is Node #4 kmem_list3 that is being processed. It is made of 782369 slabs_full and 190921 slabs_partial and no slabs_free to be parsed in this order by s_show() routine with disabled IRQs (ie, causing no HPET timer updates in-between). And the current Slab being used at the time of the crash is one of the partial ones (the 173600th out of 190921) so seems that the watchdog just did not allow to complete parsing of Node-4 cl_page_kmem consumption !!&lt;/p&gt;

&lt;p&gt;So, according to the concerned Slabs and their current usage, this does not look like &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-2613&quot; title=&quot;opening and closing file can generate &amp;#39;unreclaimable slab&amp;#39; space&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-2613&quot;&gt;&lt;del&gt;LU-2613&lt;/del&gt;&lt;/a&gt; nor &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4053&quot; title=&quot;client leaking objects/locks during IO&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4053&quot;&gt;&lt;del&gt;LU-4053&lt;/del&gt;&lt;/a&gt; scenarios but seem to be only the consequence of a huge-memory Lustre Client page-cache memory foot-print.&lt;/p&gt;

&lt;p&gt;And yes, SLUBs are definitely a future option when supported by distro providers.&lt;/p&gt;

&lt;p&gt;And sure, disable HPET/NMI-watchdog could be a &quot;ugly&quot; work-around but an other possible one could be to regularly drain Lustre page-cache (using &quot;lctl set_param ldlm_namespaces.*.lru_size=clear&quot; and/or &quot;echo 3 &amp;gt; /proc/sys/vm/drop_caches&quot;) and/or also reduce lustre-page-cache size (max_cached_mb) in order to reduce the number of *_page_kmem objects to be kept in Slabs. Last, simply avoid /proc/slabinfo usage is also one !!&lt;/p&gt;

&lt;p&gt;What else can be done about this???&lt;/p&gt;</comment>
                            <comment id="69750" author="kitwestneat" created="Thu, 24 Oct 2013 14:37:29 +0000"  >&lt;p&gt;Hi Bruno,&lt;/p&gt;

&lt;p&gt;The customer has decided to disable collectl on the client and this seems to have cleared up the issue. Thank you for  your investigation into the issue. I think we can close the ticket.&lt;/p&gt;

&lt;p&gt;Thanks,&lt;br/&gt;
Kit&lt;/p&gt;</comment>
                            <comment id="69757" author="bfaccini" created="Thu, 24 Oct 2013 14:58:43 +0000"  >&lt;p&gt;Thanks for the update Kit. Do you agree if I close it with the &quot;Not a Bug&quot; reason ??&lt;/p&gt;</comment>
                            <comment id="69759" author="kitwestneat" created="Thu, 24 Oct 2013 15:14:09 +0000"  >&lt;p&gt;Sure&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                        <issuelink>
            <issuekey id="21245">LU-4053</issuekey>
        </issuelink>
                            </outwardlinks>
                                                        </issuelinktype>
                    </issuelinks>
                <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzw3pj:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>10695</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>