[LU-4264] Excessive slab usage on 1.8.9 server Created: 18/Nov/13 Updated: 01/Sep/17 Resolved: 01/Sep/17 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 1.8.9 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical |
| Reporter: | Oz Rentas | Assignee: | Oleg Drokin |
| Resolution: | Fixed | Votes: | 1 |
| Labels: | None | ||
| Attachments: |
|
| Severity: | 1 |
| Rank (Obsolete): | 11714 |
| Description |
|
NOAA has been having a problem with OOM on their OSSes causing failover. Looking at collectl output from right before the crash, it appears that all the memory is being consumed by the size-256 slab: Is there a way to determine what those objects are and reduce the amount of memory they are taking? The vmcore is available if necessary. |
| Comments |
| Comment by Oleg Drokin [ 18/Nov/13 ] |
|
You can do "echo +malloc >/proc/sys/lnet/debug" (no quotes) on the affected servers, and then as the number keeps growing, you can do lctl dk >/tmp/somewhere You can also enable in-kernel memory leak tracer if you suspect you have a genuine memory leak (it's a kernel config option in debug options). |
| Comment by Kit Westneat (Inactive) [ 18/Nov/13 ] |
|
The log from crash and the collectl from the time of the OOM. |
| Comment by Kit Westneat (Inactive) [ 18/Nov/13 ] |
|
I found an oss whose slab appears to be growing larger than the other OSSes. oss-2-18 is typical
So there is a couple orders of magnitude difference, and it keeps increasing. Slab size-256 allocations within a minute or two: [root@lfs-oss-2-15 ~]# cat /proc/slabinfo |grep e-256 [root@lfs-oss-2-15 ~]# cat /proc/slabinfo |grep e-256 I ran malloc and grepped for slab-alloc: I checked the number of locks to see if they were very large, but it is comparable to other OSSes: |
| Comment by Kit Westneat (Inactive) [ 18/Nov/13 ] |
|
Any ideas? The OSSes keep crashing. |
| Comment by Kit Westneat (Inactive) [ 18/Nov/13 ] |
|
Attached the malloc dk log |
| Comment by Kit Westneat (Inactive) [ 18/Nov/13 ] |
|
It appears to be a memory leak of some kind - the servers having issues had uptimes of 155 days, while the other OSSes were more recently rebooted. I wasn't able to determine where the leak was. I looked at some of the objects in the slab of a vmcore we got, and some were definitely Lustre related, but I couldn't narrow it down to anything. I also tried unloading modules and stopping processes to try to get the memory back, but it didn't budge. Any tips for debugging this kind of problem? |
| Comment by Jinshan Xiong (Inactive) [ 18/Nov/13 ] |
|
I didn't find leak information from the log file of dk1 you posted. How much time would it survive for each restart? The log file has collected the memory allocation from lustre for 80 seconds, so it must have hit thousands of times if the leaking allocation is from lustre. 256 byte-size slab is created by linux kernel to serve generic kmalloc, and lustre creates its own slab cache for dlm lock, therefore it won't be the lak from dlm lock fore sure. I used the following command to check the log to try to find the leaking blocks: egrep 'kmalloced|kfreed' dk1 |awk ' {if($4>128 && $4<=256)print}' > temp_log (you can find leak_finder from lustre-tests rpm). the output was: ... freed 144 bytes at ffff810a83297980 called desc (client.c:ptlrpc_free_bulk:188) freed 144 bytes at ffff8107f9ca16c0 called desc (client.c:ptlrpc_free_bulk:188) freed 144 bytes at ffff810825d80980 called desc (client.c:ptlrpc_free_bulk:188) freed 144 bytes at ffff81090a21d1c0 called desc (client.c:ptlrpc_free_bulk:188) malloced 176 bytes at ffff810accc66880 called blwi (ldlm_lockd.c:ldlm_bl_to_thread:1672) *** Leak: 176 bytes allocated at ffff810accc66880 (ldlm_lockd.c:ldlm_bl_to_thread:1672, debug file line 15320) maximum used: 4912, amount leaked: 176 the last entry is not a true memory leak, because the free log was not yet collected. Is this node different from the others in anyways, kernel, drivers or recent updates? This is probably not a Lustre problem. Anyway, let's drill down a little bit. Let's try ftrace to see if we can find it out. Please follow the instruction here: http://elinux.org/Kernel_dynamic_memory_analysis, please read the Debugfs section and then go to Dynamic section. You can trace kmalloc and kfree events as follows: echo "kmem:kmalloc_node kmem:kfree kmem:kmalloc" > /sys/kernel/debug/tracing/set_event Then enable the trace by: echo "1" > /sys/kernel/debug/tracing/tracing_on After a while, you believe you have seen memory reduction, then you can dump the memory allocation information by: echo "0" > /sys/kernel/debug/tracing/tracing_on Also, you need to dump the kernel symbol of running kernel by: cat /proc/kallsyms > kallsyms.txt Then we can do further analysis. |
| Comment by Kit Westneat (Inactive) [ 18/Nov/13 ] |
|
Unfortunately, it looks like RHEL5 doesn't have trace support compiled in, so we would have to reboot. The memory leak seems to be occurring on all the servers, we still have one system up that has been up for 155 days and it has very high memory usage in slab-256. It is the backup MDS, so we have not rebooted it yet, in case there is information we can still get. It seems that the servers go about 150 days before starting to have problems. I thought that these servers had already been reboot earlier today, but that was not the case. The file system is mostly stable now, so the severity can be reduced. NOAA is very anxious about the memory leaks, however, and so we still need to figure out where that is coming from. |
| Comment by Oleg Drokin [ 19/Nov/13 ] |
|
Well, given that you cannot identify a culprit in the lustre log (or is it a really-really slow leak?) and it takes 150 days to manifest, also assuming you don't want to recompile your kernel in a way that would allow tracking leaks, the only realistic options left to you, I guess are: schedule some quiet time and reboot the remaining OSSes with 150+ days uptime, and reboot them every 100 days or so until you can gather some extra debug or upgrade to a newer version on those nodes (so that reboots are controllable vs random). Just do nothing and the servers will fail when they fail and recover all by themselves. |
| Comment by Kit Westneat (Inactive) [ 19/Nov/13 ] |
|
Ok, that makes sense. Is the vmcore useful at all in identifying what is occupying the memory? |
| Comment by Jinshan Xiong (Inactive) [ 19/Nov/13 ] |
|
no, vmcore won't help this case. |
| Comment by Kit Westneat (Inactive) [ 10/Jan/14 ] |
|
I think I found the memory leak, it looks like a Mellanox patch was incorrectly backported. I created a Redhat ticket here: I'll attach the original 2.6.18-308.11.1 version and the broken 2.6.18-348.1.1. Can you tell me if my analysis makes sense? |
| Comment by Kit Westneat (Inactive) [ 10/Jan/14 ] |
|
Here's a link to the kernel.org git version: ib_link_query_port is the function with the memory leak. |
| Comment by Jean-Philippe Dionne [ 17/Jan/14 ] |
|
I have a similar problem here. Same kernel and slow slab usage increase over time. I do not have access to the Bugzilla link. By looking at the main.c differences, I can't pinpoint the leak. Can you provide a patch or a link to the commit that introduced problem? |
| Comment by Kit Westneat (Inactive) [ 17/Jan/14 ] |
|
oh weird, I wonder why it's private. I'll attach the patch, and try to get it into Gerrit. |
| Comment by Kit Westneat (Inactive) [ 21/Jan/14 ] |
| Comment by Oz Rentas [ 01/Sep/17 ] |
|
This is resolved. Please close. |
| Comment by Peter Jones [ 01/Sep/17 ] |
|
ok thanks |