[LU-3997] Excessive slab usage causes large mem & core count clients to hang Created: 23/Sep/13 Updated: 24/Oct/13 Resolved: 24/Oct/13 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.3.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Kit Westneat (Inactive) | Assignee: | Bruno Faccini (Inactive) |
| Resolution: | Not a Bug | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 10695 | ||||||||
| Description |
|
Client version: 2.3.0, SLES kernel 3.0.13_0.27_default We are running into an issue at SCBI that appears to be similar to Are there any tunables that can help shrink the slab usage? HP Summary:
Novell analysis:
|
| Comments |
| Comment by Bruno Faccini (Inactive) [ 24/Sep/13 ] |
|
Hello Kit, Could it be possible for you to provide the crash-dump (along with vmlinux/kernel-debuginfo[-common] RPMs, and the lustre-[modules,debuginfo] RPMs) or at least attach the log/dmesg, foreach bt, kmem -s, crash sub-commands output ?? |
| Comment by Kit Westneat (Inactive) [ 24/Sep/13 ] |
|
Hi Bruno, Where should we upload the core to? Thanks, |
| Comment by Bruno Faccini (Inactive) [ 24/Sep/13 ] |
|
Just sent you upload instructions by email. |
| Comment by Kit Westneat (Inactive) [ 26/Sep/13 ] |
|
Hi Bruno, They should be uploaded, let me know if you need anything else. Thanks. |
| Comment by Kit Westneat (Inactive) [ 27/Sep/13 ] |
|
I've just run into a similar issue at NREL while doing robinhood testing, where the slab unreclaim goes to 90% of memory. I ran 'echo 3 > /proc/sys/vm/drop_caches', but it is hanging and the bash process is at 100% CPU, as is the kswapd process. Lustre is mounted read-only. Here is slabtop sorted by cache size: You can see that the 8k slab cache is using 16G. The other slabs are also using a lot. memused=28486658060 The kernel is 2.6.32-279.el6.x86_64, client is 2.1.6. Any more information we can get? |
| Comment by Kit Westneat (Inactive) [ 27/Sep/13 ] |
|
The NREL bug is due to |
| Comment by Bruno Faccini (Inactive) [ 02/Oct/13 ] |
|
Kit, I am sorry but crash tool complains "no debugging data available" against the "localscratch/dump/2013-08-20-14:05/vmlinux-3.0.13-0.27-default" kernel you provided ... |
| Comment by Kit Westneat (Inactive) [ 03/Oct/13 ] |
|
Hi Bruno, The customer has uploaded the debuginfo rpm. Thanks, |
| Comment by Kit Westneat (Inactive) [ 16/Oct/13 ] |
|
Hi Bruno, Were you able to get the debuginfo? Thanks, |
| Comment by Bruno Faccini (Inactive) [ 21/Oct/13 ] |
|
Yes I got it and I am working on the crash-dump now that crash tool is happy. |
| Comment by Bruno Faccini (Inactive) [ 22/Oct/13 ] |
|
Humm the crash-output you already attached are not from the uploaded crash-dump but fortunately it shows the same situation!! Concerning the pcc_lock contention, there are about 28 threads spinning on it and on the node's 80 cores, when it is likely to be owned by this thread : PID: 5758 TASK: ffff883369b12480 CPU: 0 COMMAND: "kworker/0:1"
#0 [ffff88407f807eb0] crash_nmi_callback at ffffffff8101eaef
#1 [ffff88407f807ec0] notifier_call_chain at ffffffff81445617
#2 [ffff88407f807ef0] notify_die at ffffffff814456ad
#3 [ffff88407f807f20] default_do_nmi at ffffffff814429d7
#4 [ffff88407f807f40] do_nmi at ffffffff81442c08
#5 [ffff88407f807f50] nmi at ffffffff81442320
[exception RIP: _raw_spin_lock+21]
RIP: ffffffff81441995 RSP: ffff8834d4d7dd28 RFLAGS: 00000283
RAX: 000000000000c430 RBX: ffff8b3e8532d7c0 RCX: 0000000000000028
RDX: 000000000000c42e RSI: 0000000000249f00 RDI: ffffffffa02b7610
RBP: ffff88407f80eb80 R8: 0000000000000020 R9: 0000000000000000
R10: 0000000000000064 R11: ffffffffa02b54e0 R12: 0000000000249f00
R13: 0000000000005ef0 R14: 00000000000000a0 R15: 0000000000000000
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
--- <NMI exception stack> ---
#6 [ffff8834d4d7dd28] _raw_spin_lock at ffffffff81441995
#7 [ffff8834d4d7dd28] pcc_cpufreq_target at ffffffffa02b54fe [pcc_cpufreq]
#8 [ffff8834d4d7dd78] dbs_check_cpu at ffffffff8135feb3
#9 [ffff8834d4d7ddf8] do_dbs_timer at ffffffff813601c8
#10 [ffff8834d4d7de28] process_one_work at ffffffff810747bc
#11 [ffff8834d4d7de78] worker_thread at ffffffff8107734a
#12 [ffff8834d4d7dee8] kthread at ffffffff8107b676
#13 [ffff8834d4d7df48] kernel_thread_helper at ffffffff8144a7c4
again, it is running on CPU/Core 0 and also shows an old spin-lock/ticket value which should only be a side-effect of the NMI handling. Now I will investigate the Soft-lockup+NMI/watchdog issue caused by collectl computing the Slabs consumption. |
| Comment by Bruno Faccini (Inactive) [ 22/Oct/13 ] |
|
Top Slabs consumers in crash-dump provided are : CACHE NAME OBJSIZE ALLOCATED TOTAL SLABS SSIZE ffff883ef7a61080 cl_page_kmem 192 110784616 117426440 5871322 4k ffff883ef5f011c0 osc_page_kmem 216 55392308 59188842 3288269 4k ffff883ef5ee17c0 vvp_page_kmem 80 55392308 59744976 1244687 4k ffff883ef5e913c0 lov_page_kmem 48 55392308 59936492 778396 4k ffff883ef5a01540 lovsub_page_kmem 40 55392308 60008840 652270 4k ffff88407f690980 radix_tree_node 560 2812010 3063060 437580 4k ffff883ef59d16c0 lustre_inode_cache 1152 1077902 1078294 154042 8k others are of a much lower order of magnitude. |
| Comment by Bruno Faccini (Inactive) [ 23/Oct/13 ] |
|
And no surprise that the current kmem_cache being walked-thru, as part of "/proc/slabinfo" access and at the time of the problem, is cl_page_kmem and its huge number of Slabs/Objects. There are 80 cores divided in 8 Numa-nodes, and this is Node #4 kmem_list3 that is being processed. It is made of 782369 slabs_full and 190921 slabs_partial and no slabs_free to be parsed in this order by s_show() routine with disabled IRQs (ie, causing no HPET timer updates in-between). And the current Slab being used at the time of the crash is one of the partial ones (the 173600th out of 190921) so seems that the watchdog just did not allow to complete parsing of Node-4 cl_page_kmem consumption !! So, according to the concerned Slabs and their current usage, this does not look like And yes, SLUBs are definitely a future option when supported by distro providers. And sure, disable HPET/NMI-watchdog could be a "ugly" work-around but an other possible one could be to regularly drain Lustre page-cache (using "lctl set_param ldlm_namespaces.*.lru_size=clear" and/or "echo 3 > /proc/sys/vm/drop_caches") and/or also reduce lustre-page-cache size (max_cached_mb) in order to reduce the number of *_page_kmem objects to be kept in Slabs. Last, simply avoid /proc/slabinfo usage is also one !! What else can be done about this??? |
| Comment by Kit Westneat (Inactive) [ 24/Oct/13 ] |
|
Hi Bruno, The customer has decided to disable collectl on the client and this seems to have cleared up the issue. Thank you for your investigation into the issue. I think we can close the ticket. Thanks, |
| Comment by Bruno Faccini (Inactive) [ 24/Oct/13 ] |
|
Thanks for the update Kit. Do you agree if I close it with the "Not a Bug" reason ?? |
| Comment by Kit Westneat (Inactive) [ 24/Oct/13 ] |
|
Sure |