[LU-3997] Excessive slab usage causes large mem & core count clients to hang - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Not a Bug
Priority: Minor
Fix Version/s: None
Affects Version/s: Lustre 2.3.0
Labels:
None

Severity:
3
Rank (Obsolete):
10695

Description

Client version: 2.3.0, SLES kernel 3.0.13_0.27_default
Server version: 2.1.2

We are running into an issue at SCBI that appears to be similar to ~~LU-3771~~. Under certain workloads, the slab memory usage gets to the point where it causes the kernel to hang. Apparently it is not a problem under the SLUB allocator, but Novell is not prepared to support a SLES kernel with the SLUB allocator enabled.

Are there any tunables that can help shrink the slab usage?

HP Summary:

The main factor triggering the problem is reading from /proc/slabinfo. SLAB does this while holding l3->list_lock and when a slab is huge, this leads to big delays so that other subsystems are impacted and if NMI Watchdog is enabled, this leads t soft/hards lockups and panics.

Novell analysis:

Have you actually tried to put pcc governor out of way? I can still see many
cpus looping on the same pcc internal lock:
crash> struct spinlock_t ffffffffa047c610
struct spinlock_t {
{
rlock = {
raw_lock = {
slock = 862335805
}
}
}
}
crash> p /x 862335805
$1 = 0x3366333d

crash> p 0x3366-0x333d
$2 = 41

So there are 40 CPUs waiting for the lock. This sounds really insane! Who is
holding the lock?
PID: 79454 TASK: ffff882fd6a224c0 CPU: 0 COMMAND: "kworker/0:1"
#0 [ffff88407f807eb0] crash_nmi_callback at ffffffff8101eaef
#1 [ffff88407f807ec0] notifier_call_chain at ffffffff81445617
#2 [ffff88407f807ef0] notify_die at ffffffff814456ad
#3 [ffff88407f807f20] default_do_nmi at ffffffff814429d7
#4 [ffff88407f807f40] do_nmi at ffffffff81442c08
#5 [ffff88407f807f50] nmi at ffffffff81442320
[exception RIP: _raw_spin_lock+24]
RIP: ffffffff81441998 RSP: ffff883f02147d28 RFLAGS: 00000293
RAX: 000000000000333d RBX: ffff8b3e85c4e680 RCX: 0000000000000028
RDX: 0000000000003335 RSI: 0000000000249f00 RDI: ffffffffa047c610
RBP: ffff88407f80eb80 R8: 0000000000000020 R9: 0000000000000000
R10: 0000000000000064 R11: ffffffffa047a4e0 R12: 0000000000249f00
R13: 0000000000004fd4 R14: 00000000000000a0 R15: 0000000000000000
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
— <NMI exception stack> —
#6 [ffff883f02147d28] _raw_spin_lock at ffffffff81441998
#7 [ffff883f02147d28] pcc_cpufreq_target at ffffffffa047a4fe [pcc_cpufreq]
[...]

OK this one requested 0x333d ticket but it still sees very old spinlock state.
And what more interesting is that it just refetched the global state:
0xffffffff81441995 <_raw_spin_lock+21>: movzwl (%rdi),%edx
0xffffffff81441998 <_raw_spin_lock+24>: jmp 0xffffffff8144198f
<_raw_spin_lock+15>

The lock is not IRQ safe so an interrupt might have triggered after movzwl and before jmp. OK, let's pretend that this is not a problem, althought I wouldn't be happy about CPU governor which doesn't scale on such a machine that badly.

The lockup has been detected:
crash> dmesg | grep -i lockup
[385474.330482] BUG: soft lockup - CPU#0 stuck for 22s! [sort:130201]
[507912.743427] Kernel panic - not syncing: Watchdog detected hard LOCKUP on
cpu 44

The first one (soft lockup) was obviously recoverable. The second is more
interesting:
PID: 100927 TASK: ffff8b3e857c2580 CPU: 44 COMMAND: "collectl"
#0 [ffff8a3fff907b20] machine_kexec at ffffffff810265ce
#1 [ffff8a3fff907b70] crash_kexec at ffffffff810a3b5a
#2 [ffff8a3fff907c40] panic at ffffffff8143eadf
#3 [ffff8a3fff907cc0] watchdog_overflow_callback at ffffffff810be194
#4 [ffff8a3fff907cd0] __perf_event_overflow at ffffffff810e9aba
#5 [ffff8a3fff907d70] intel_pmu_handle_irq at ffffffff810159d9
#6 [ffff8a3fff907eb0] perf_event_nmi_handler at ffffffff814433b1
#7 [ffff8a3fff907ec0] notifier_call_chain at ffffffff81445617
#8 [ffff8a3fff907ef0] notify_die at ffffffff814456ad
#9 [ffff8a3fff907f20] default_do_nmi at ffffffff814429d7
#10 [ffff8a3fff907f40] do_nmi at ffffffff81442c08
#11 [ffff8a3fff907f50] nmi at ffffffff81442320
[exception RIP: s_show+211]
RIP: ffffffff8113a4c3 RSP: ffff8b3e70d2fde8 RFLAGS: 00000046
RAX: ffff89367c870000 RBX: 0000000000000000 RCX: 0000000000000025
RDX: 0000000000000025 RSI: ffff893fff42e150 RDI: ffff893fff42e180
RBP: ffff893fff42e140 R8: 0000000000000400 R9: ffffffff81be18a0
R10: 0000ffff00066c0a R11: 0000000000000000 R12: 0000000004ec9217
R13: 00000000002270bc R14: 0000000000000000 R15: 0000000000000002
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
— <NMI exception stack> —
#12 [ffff8b3e70d2fde8] s_show at ffffffff8113a4c3
#13 [ffff8b3e70d2fe60] seq_read at ffffffff81171991
#14 [ffff8b3e70d2fed0] proc_reg_read at ffffffff811ad847
#15 [ffff8b3e70d2ff10] vfs_read at ffffffff81151687
#16 [ffff8b3e70d2ff40] sys_read at ffffffff811517f3
#17 [ffff8b3e70d2ff80] system_call_fastpath at ffffffff81449692
RIP: 00007d570dc8e750 RSP: 00007fff43f31af0 RFLAGS: 00010206
RAX: 0000000000000000 RBX: ffffffff81449692 RCX: 0000000000004618
RDX: 0000000000001000 RSI: 0000000001bc7ee8 RDI: 0000000000000007
RBP: 000000000078e010 R8: 0000000000000000 R9: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000001bc7ee8
R13: 0000000000000007 R14: 0000000000000000 R15: 000000000000f001
ORIG_RAX: 0000000000000000 CS: 0033 SS: 002b

This is /proc/slabinfo interface. rbp contains kmem_list.
crash> struct kmem_list3 ffff893fff42e140
struct kmem_list3 {
slabs_partial = {
next = 0xffff893fff42e140,
prev = 0xffff893fff42e140
},
slabs_full = {
next = 0xffff892a92857000,
prev = 0xffff893e86cb0000
},
slabs_free = {
next = 0xffff893fff42e160,
prev = 0xffff893fff42e160
},
free_objects = 0,
free_limit = 637,
colour_next = 0,
list_lock = {
{
rlock = {
raw_lock = {
slock = 2008643510
}
}
}
},
shared = 0xffff893fff42f000,
alien = 0xffff893fff41d640,
next_reap = 4422089447,
free_touched = 1
}

There are no free nor partially filled slabs so we have only full_slabs and
quite some of them:
crash> list -s slab 0xffff892a92857000 | grep "^ffff" > full_slabs
[wait for a loooooooooooooooooooooooong time until you loose your patience and
ctrl+c]
$ wc -l full_slabs
55898 full_slabs

So yes, this is indeed dangerous if some subsystem allocates too many objects.
Especially when:
ll /proc/slabinfo
~~rw-r~~r- 1 root root 0 Sep 19 11:50 /proc/slabinfo

So anybody might read and interfere. Upstream is no better in that aspect as
get_slabinfo does the same thing. SLUB would be better as it is using atomic
counters for the same purposes.

We can silent the watchdog and stuff touch_nmi_watchdog into the loops but that only papers over the real issue. The right thing to do would be having
something similar as SLUB and collect statistics per kmem_cachel3. I am not
sure whether this is doable considering kABI restrictions and potential
performance regressions. I would have to dive into this more but unfortunatelly I am leaving for a long vacation. Let's CC Mel here. Also I haven't seen this as a problem with our regular kernel because nothing seems to be allocating so manny kmalloc objects so it is questionable how much of a problem this really is for the supported kernel configurations.

Whether using so many objects is healthy is another question for which I do not have a good answer. SLAB tries to batch operations internally so it should scale with the number of objects quite well but I am not familiar with all the internals enough to tell that with 100% certainity.

Attachments

Issue Links

is related to

LU-4053 client leaking objects/locks during IO

Resolved

Activity

[LU-3997] Excessive slab usage causes large mem & core count clients to hang

Kit Westneat (Inactive) added a comment - 16/Oct/13 2:50 PM

Hi Bruno,

Were you able to get the debuginfo?

Thanks,
Kit

Kit Westneat (Inactive) added a comment - 16/Oct/13 2:50 PM Hi Bruno, Were you able to get the debuginfo? Thanks, Kit

Kit Westneat (Inactive) added a comment - 03/Oct/13 2:38 PM

Hi Bruno,

The customer has uploaded the debuginfo rpm.

Thanks,
Kit

Kit Westneat (Inactive) added a comment - 03/Oct/13 2:38 PM Hi Bruno, The customer has uploaded the debuginfo rpm. Thanks, Kit

Bruno Faccini (Inactive) added a comment - 02/Oct/13 9:07 AM

Kit, I am sorry but crash tool complains "no debugging data available" against the "localscratch/dump/2013-08-20-14:05/vmlinux-3.0.13-0.27-default" kernel you provided ...
Do you know where to find and/or can you also upload the corresponding kernel-debuginfo* RPMs ??
Thanks again and in advance for your help!

Bruno Faccini (Inactive) added a comment - 02/Oct/13 9:07 AM Kit, I am sorry but crash tool complains "no debugging data available" against the "localscratch/dump/2013-08-20-14:05/vmlinux-3.0.13-0.27-default" kernel you provided ... Do you know where to find and/or can you also upload the corresponding kernel-debuginfo* RPMs ?? Thanks again and in advance for your help!

Kit Westneat (Inactive) added a comment - 27/Sep/13 7:32 PM

The NREL bug is due to ~~LU-2613~~. Reading in a file on the filesystem caused it to unblock.

Kit Westneat (Inactive) added a comment - 27/Sep/13 7:32 PM The NREL bug is due to LU-2613 . Reading in a file on the filesystem caused it to unblock.

Kit Westneat (Inactive) added a comment - 27/Sep/13 7:15 PM

I've just run into a similar issue at NREL while doing robinhood testing, where the slab unreclaim goes to 90% of memory. I ran 'echo 3 > /proc/sys/vm/drop_caches', but it is hanging and the bash process is at 100% CPU, as is the kswapd process. Lustre is mounted read-only.

Here is slabtop sorted by cache size:
2080183 2080182 99% 8.00K 2080183 1 16641464K size-8192
6185372 6185242 99% 1.00K 1546343 4 6185372K size-1024
2062676 2062631 99% 2.00K 1031338 2 4125352K size-2048
2064096 2063826 99% 0.50K 258012 8 1032048K size-512
3054480 305107 9% 0.12K 101816 30 407264K size-128
4803485 4754884 98% 0.06K 81415 59 325660K size-64
4096288 2116344 51% 0.03K 36574 112 146296K size-32

You can see that the 8k slab cache is using 16G. The other slabs are also using a lot.

memused=28486658060
memused_max=28535237092

The kernel is 2.6.32-279.el6.x86_64, client is 2.1.6. Any more information we can get?

Kit Westneat (Inactive) added a comment - 27/Sep/13 7:15 PM I've just run into a similar issue at NREL while doing robinhood testing, where the slab unreclaim goes to 90% of memory. I ran 'echo 3 > /proc/sys/vm/drop_caches', but it is hanging and the bash process is at 100% CPU, as is the kswapd process. Lustre is mounted read-only. Here is slabtop sorted by cache size: 2080183 2080182 99% 8.00K 2080183 1 16641464K size-8192 6185372 6185242 99% 1.00K 1546343 4 6185372K size-1024 2062676 2062631 99% 2.00K 1031338 2 4125352K size-2048 2064096 2063826 99% 0.50K 258012 8 1032048K size-512 3054480 305107 9% 0.12K 101816 30 407264K size-128 4803485 4754884 98% 0.06K 81415 59 325660K size-64 4096288 2116344 51% 0.03K 36574 112 146296K size-32 You can see that the 8k slab cache is using 16G. The other slabs are also using a lot. memused=28486658060 memused_max=28535237092 The kernel is 2.6.32-279.el6.x86_64, client is 2.1.6. Any more information we can get?

Kit Westneat (Inactive) added a comment - 26/Sep/13 10:50 PM

Hi Bruno,

They should be uploaded, let me know if you need anything else.

Thanks.

Kit Westneat (Inactive) added a comment - 26/Sep/13 10:50 PM Hi Bruno, They should be uploaded, let me know if you need anything else. Thanks.

Bruno Faccini (Inactive) added a comment - 24/Sep/13 9:32 PM

Just sent you upload instructions by email.
Thanks for your help.

Bruno Faccini (Inactive) added a comment - 24/Sep/13 9:32 PM Just sent you upload instructions by email. Thanks for your help.

Kit Westneat (Inactive) added a comment - 24/Sep/13 5:19 PM

Hi Bruno,

Where should we upload the core to?

Thanks,
Kit

Kit Westneat (Inactive) added a comment - 24/Sep/13 5:19 PM Hi Bruno, Where should we upload the core to? Thanks, Kit

Bruno Faccini (Inactive) added a comment - 24/Sep/13 1:16 PM

Hello Kit,
Even if I easily understand that the /proc/slabinfo walk-thru by "collectl" thread can take ages due to huge Slabs, I don't see where it can interfere with the pcc_lock spin-lock use by others threads here.

Could it be possible for you to provide the crash-dump (along with vmlinux/kernel-debuginfo[-common] RPMs, and the lustre-[modules,debuginfo] RPMs) or at least attach the log/dmesg, foreach bt, kmem -s, crash sub-commands output ??

Bruno Faccini (Inactive) added a comment - 24/Sep/13 1:16 PM Hello Kit, Even if I easily understand that the /proc/slabinfo walk-thru by "collectl" thread can take ages due to huge Slabs, I don't see where it can interfere with the pcc_lock spin-lock use by others threads here. Could it be possible for you to provide the crash-dump (along with vmlinux/kernel-debuginfo [-common] RPMs, and the lustre- [modules,debuginfo] RPMs) or at least attach the log/dmesg, foreach bt, kmem -s, crash sub-commands output ??

People

Assignee:: Bruno Faccini (Inactive)

Reporter:: Kit Westneat (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 23/Sep/13 8:07 PM

Updated:: 24/Oct/13 3:17 PM

Resolved:: 24/Oct/13 3:17 PM