Details
-
Bug
-
Resolution: Not a Bug
-
Minor
-
None
-
Lustre 2.3.0
-
None
-
3
-
10695
Description
Client version: 2.3.0, SLES kernel 3.0.13_0.27_default
Server version: 2.1.2
We are running into an issue at SCBI that appears to be similar to LU-3771. Under certain workloads, the slab memory usage gets to the point where it causes the kernel to hang. Apparently it is not a problem under the SLUB allocator, but Novell is not prepared to support a SLES kernel with the SLUB allocator enabled.
Are there any tunables that can help shrink the slab usage?
HP Summary:
The main factor triggering the problem is reading from /proc/slabinfo. SLAB does this while holding l3->list_lock and when a slab is huge, this leads to big delays so that other subsystems are impacted and if NMI Watchdog is enabled, this leads t soft/hards lockups and panics.
Novell analysis:
Have you actually tried to put pcc governor out of way? I can still see many
cpus looping on the same pcc internal lock:
crash> struct spinlock_t ffffffffa047c610
struct spinlock_t {
{
rlock = {
raw_lock = {
slock = 862335805
}
}
}
}
crash> p /x 862335805
$1 = 0x3366333dcrash> p 0x3366-0x333d
$2 = 41So there are 40 CPUs waiting for the lock. This sounds really insane! Who is
holding the lock?
PID: 79454 TASK: ffff882fd6a224c0 CPU: 0 COMMAND: "kworker/0:1"
#0 [ffff88407f807eb0] crash_nmi_callback at ffffffff8101eaef
#1 [ffff88407f807ec0] notifier_call_chain at ffffffff81445617
#2 [ffff88407f807ef0] notify_die at ffffffff814456ad
#3 [ffff88407f807f20] default_do_nmi at ffffffff814429d7
#4 [ffff88407f807f40] do_nmi at ffffffff81442c08
#5 [ffff88407f807f50] nmi at ffffffff81442320
[exception RIP: _raw_spin_lock+24]
RIP: ffffffff81441998 RSP: ffff883f02147d28 RFLAGS: 00000293
RAX: 000000000000333d RBX: ffff8b3e85c4e680 RCX: 0000000000000028
RDX: 0000000000003335 RSI: 0000000000249f00 RDI: ffffffffa047c610
RBP: ffff88407f80eb80 R8: 0000000000000020 R9: 0000000000000000
R10: 0000000000000064 R11: ffffffffa047a4e0 R12: 0000000000249f00
R13: 0000000000004fd4 R14: 00000000000000a0 R15: 0000000000000000
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
— <NMI exception stack> —
#6 [ffff883f02147d28] _raw_spin_lock at ffffffff81441998
#7 [ffff883f02147d28] pcc_cpufreq_target at ffffffffa047a4fe [pcc_cpufreq]
[...]OK this one requested 0x333d ticket but it still sees very old spinlock state.
And what more interesting is that it just refetched the global state:
0xffffffff81441995 <_raw_spin_lock+21>: movzwl (%rdi),%edx
0xffffffff81441998 <_raw_spin_lock+24>: jmp 0xffffffff8144198f
<_raw_spin_lock+15>The lock is not IRQ safe so an interrupt might have triggered after movzwl and before jmp. OK, let's pretend that this is not a problem, althought I wouldn't be happy about CPU governor which doesn't scale on such a machine that badly.
The lockup has been detected:
crash> dmesg | grep -i lockup
[385474.330482] BUG: soft lockup - CPU#0 stuck for 22s! [sort:130201]
[507912.743427] Kernel panic - not syncing: Watchdog detected hard LOCKUP on
cpu 44The first one (soft lockup) was obviously recoverable. The second is more
interesting:
PID: 100927 TASK: ffff8b3e857c2580 CPU: 44 COMMAND: "collectl"
#0 [ffff8a3fff907b20] machine_kexec at ffffffff810265ce
#1 [ffff8a3fff907b70] crash_kexec at ffffffff810a3b5a
#2 [ffff8a3fff907c40] panic at ffffffff8143eadf
#3 [ffff8a3fff907cc0] watchdog_overflow_callback at ffffffff810be194
#4 [ffff8a3fff907cd0] __perf_event_overflow at ffffffff810e9aba
#5 [ffff8a3fff907d70] intel_pmu_handle_irq at ffffffff810159d9
#6 [ffff8a3fff907eb0] perf_event_nmi_handler at ffffffff814433b1
#7 [ffff8a3fff907ec0] notifier_call_chain at ffffffff81445617
#8 [ffff8a3fff907ef0] notify_die at ffffffff814456ad
#9 [ffff8a3fff907f20] default_do_nmi at ffffffff814429d7
#10 [ffff8a3fff907f40] do_nmi at ffffffff81442c08
#11 [ffff8a3fff907f50] nmi at ffffffff81442320
[exception RIP: s_show+211]
RIP: ffffffff8113a4c3 RSP: ffff8b3e70d2fde8 RFLAGS: 00000046
RAX: ffff89367c870000 RBX: 0000000000000000 RCX: 0000000000000025
RDX: 0000000000000025 RSI: ffff893fff42e150 RDI: ffff893fff42e180
RBP: ffff893fff42e140 R8: 0000000000000400 R9: ffffffff81be18a0
R10: 0000ffff00066c0a R11: 0000000000000000 R12: 0000000004ec9217
R13: 00000000002270bc R14: 0000000000000000 R15: 0000000000000002
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
— <NMI exception stack> —
#12 [ffff8b3e70d2fde8] s_show at ffffffff8113a4c3
#13 [ffff8b3e70d2fe60] seq_read at ffffffff81171991
#14 [ffff8b3e70d2fed0] proc_reg_read at ffffffff811ad847
#15 [ffff8b3e70d2ff10] vfs_read at ffffffff81151687
#16 [ffff8b3e70d2ff40] sys_read at ffffffff811517f3
#17 [ffff8b3e70d2ff80] system_call_fastpath at ffffffff81449692
RIP: 00007d570dc8e750 RSP: 00007fff43f31af0 RFLAGS: 00010206
RAX: 0000000000000000 RBX: ffffffff81449692 RCX: 0000000000004618
RDX: 0000000000001000 RSI: 0000000001bc7ee8 RDI: 0000000000000007
RBP: 000000000078e010 R8: 0000000000000000 R9: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000001bc7ee8
R13: 0000000000000007 R14: 0000000000000000 R15: 000000000000f001
ORIG_RAX: 0000000000000000 CS: 0033 SS: 002bThis is /proc/slabinfo interface. rbp contains kmem_list.
crash> struct kmem_list3 ffff893fff42e140
struct kmem_list3 {
slabs_partial = {
next = 0xffff893fff42e140,
prev = 0xffff893fff42e140
},
slabs_full = {
next = 0xffff892a92857000,
prev = 0xffff893e86cb0000
},
slabs_free = {
next = 0xffff893fff42e160,
prev = 0xffff893fff42e160
},
free_objects = 0,
free_limit = 637,
colour_next = 0,
list_lock = {
{
rlock = {
raw_lock = {
slock = 2008643510
}
}
}
},
shared = 0xffff893fff42f000,
alien = 0xffff893fff41d640,
next_reap = 4422089447,
free_touched = 1
}There are no free nor partially filled slabs so we have only full_slabs and
quite some of them:
crash> list -s slab 0xffff892a92857000 | grep "^ffff" > full_slabs
[wait for a loooooooooooooooooooooooong time until you loose your patience and
ctrl+c]
$ wc -l full_slabs
55898 full_slabsSo yes, this is indeed dangerous if some subsystem allocates too many objects.
Especially when:
ll /proc/slabinfo
rw-rr- 1 root root 0 Sep 19 11:50 /proc/slabinfoSo anybody might read and interfere. Upstream is no better in that aspect as
get_slabinfo does the same thing. SLUB would be better as it is using atomic
counters for the same purposes.We can silent the watchdog and stuff touch_nmi_watchdog into the loops but that only papers over the real issue. The right thing to do would be having
something similar as SLUB and collect statistics per kmem_cachel3. I am not
sure whether this is doable considering kABI restrictions and potential
performance regressions. I would have to dive into this more but unfortunatelly I am leaving for a long vacation. Let's CC Mel here. Also I haven't seen this as a problem with our regular kernel because nothing seems to be allocating so manny kmalloc objects so it is questionable how much of a problem this really is for the supported kernel configurations.Whether using so many objects is healthy is another question for which I do not have a good answer. SLAB tries to batch operations internally so it should scale with the number of objects quite well but I am not familiar with all the internals enough to tell that with 100% certainity.
Attachments
Issue Links
- is related to
-
LU-4053 client leaking objects/locks during IO
-
- Resolved
-
And no surprise that the current kmem_cache being walked-thru, as part of "/proc/slabinfo" access and at the time of the problem, is cl_page_kmem and its huge number of Slabs/Objects.
There are 80 cores divided in 8 Numa-nodes, and this is Node #4 kmem_list3 that is being processed. It is made of 782369 slabs_full and 190921 slabs_partial and no slabs_free to be parsed in this order by s_show() routine with disabled IRQs (ie, causing no HPET timer updates in-between). And the current Slab being used at the time of the crash is one of the partial ones (the 173600th out of 190921) so seems that the watchdog just did not allow to complete parsing of Node-4 cl_page_kmem consumption !!
So, according to the concerned Slabs and their current usage, this does not look like
LU-2613norLU-4053scenarios but seem to be only the consequence of a huge-memory Lustre Client page-cache memory foot-print.And yes, SLUBs are definitely a future option when supported by distro providers.
And sure, disable HPET/NMI-watchdog could be a "ugly" work-around but an other possible one could be to regularly drain Lustre page-cache (using "lctl set_param ldlm_namespaces.*.lru_size=clear" and/or "echo 3 > /proc/sys/vm/drop_caches") and/or also reduce lustre-page-cache size (max_cached_mb) in order to reduce the number of *_page_kmem objects to be kept in Slabs. Last, simply avoid /proc/slabinfo usage is also one !!
What else can be done about this???