[LU-12594] client wedged trying to free memory Created: 26/Jul/19 Updated: 06/Jan/21 Resolved: 06/Jan/21 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.10.6 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical |
| Reporter: | Mahmoud Hanafi | Assignee: | Andreas Dilger |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
| Severity: | 2 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
We are getting many clients getting wedged trying to free memory.
PID: 3470 TASK: ffff8817cfa48cc0 CPU: 0 COMMAND: "agetty"
#0 [ffff8817dec03a80] machine_kexec at ffffffff81059c5c
#1 [ffff8817dec03ad0] __crash_kexec at ffffffff81119dea
#2 [ffff8817dec03b90] crash_kexec at ffffffff81119ebc
#3 [ffff8817dec03ba0] kdb_kdump_check at ffffffff81142566
#4 [ffff8817dec03ba8] kdb_main_loop at ffffffff81142792
#5 [ffff8817dec03be0] kdb_stub at ffffffff811455b8
#6 [ffff8817dec03c18] kgdb_cpu_enter at ffffffff8113b5fa
#7 [ffff8817dec03ce8] __kgdb_notify at ffffffff8105f4cc
#8 [ffff8817dec03d00] kgdb_ll_trap at ffffffff8105f598
#9 [ffff8817dec03d28] do_int3 at ffffffff81017d0e
#10 [ffff8817dec03d40] int3 at ffffffff816210a8
#11 [ffff8817dec03dc8] kgdb_breakpoint at ffffffff8113ad70
#12 [ffff8817dec03df0] __handle_sysrq at ffffffff81423db2
#13 [ffff8817dec03e18] serial8250_rx_chars at ffffffff8143c2bc
#14 [ffff8817dec03e48] serial8250_handle_irq at ffffffff8143d295
#15 [ffff8817dec03e78] serial8250_default_handle_irq at ffffffff8143d364
#16 [ffff8817dec03e90] serial8250_interrupt at ffffffff814382fd
#17 [ffff8817dec03ed0] __handle_irq_event_percpu at ffffffff810dc6bc
#18 [ffff8817dec03f18] handle_irq_event_percpu at ffffffff810dc840
#19 [ffff8817dec03f38] handle_irq_event at ffffffff810dc8a6
#20 [ffff8817dec03f58] handle_edge_irq at ffffffff810dfb5e
#21 [ffff8817dec03f78] handle_irq at ffffffff81019a9c
#22 [ffff8817dec03f80] do_IRQ at ffffffff81621f78
--- <IRQ stack> ---
#23 [ffff8817ae7378b0] ret_from_intr at ffffffff8161e9c7
[exception RIP: unknown or invalid address]
RIP: 0000000000040000 RSP: ffffffffa12fe0c0 RFLAGS: 00f00001
RAX: ffff8817ae737b50 RBX: ffff8817de0435c0 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8817dec1afc0
RBP: ffffffff82367440 R8: 0000000000000080 R9: 0000000000000000
R10: ffff88183ffd5d80 R11: 0000000000000000 R12: ffffffff811a2758
R13: cccccccccccccccd R14: ffff8817ae737950 R15: ffff8817ae737948
ORIG_RAX: 0000000000000000 CS: ffff8817df5dafc0 SS: ffffffffffffffcc
bt: WARNING: possibly bogus exception frame
#24 [ffff8817ae737958] native_queued_spin_lock_slowpath at ffffffff810cd10f
#25 [ffff8817ae737988] queued_spin_lock_slowpath at ffffffff811946d2
#26 [ffff8817ae737990] osc_cache_shrink_count at ffffffffa12c6b61 [osc]
#27 [ffff8817ae737998] shrink_slab at ffffffff811a7b12
#28 [ffff8817ae737a68] shrink_zone at ffffffff811ac2be
#29 [ffff8817ae737ad0] do_try_to_free_pages at ffffffff811ac63d
#30 [ffff8817ae737b48] try_to_free_pages at ffffffff811ac9ea
#31 [ffff8817ae737bb8] __alloc_pages_nodemask at ffffffff8119e9cb
#32 [ffff8817ae737ca8] alloc_pages_current at ffffffff811e72cf
#33 [ffff8817ae737ce0] filemap_fault at ffffffff81198bf0
#34 [ffff8817ae737d40] __do_fault at ffffffff811c1ad7
#35 [ffff8817ae737da0] handle_pte_fault at ffffffff811c535e
#36 [ffff8817ae737e78] handle_mm_fault at ffffffff811c79ca
#37 [ffff8817ae737ec0] __do_page_fault at ffffffff81068df7
#38 [ffff8817ae737f28] do_page_fault at ffffffff810690cb
#39 [ffff8817ae737f50] page_fault at ffffffff81621342
RIP: 00007ffff7b1b953 RSP: 00007fffffffb1f8 RFLAGS: 00010246
RAX: 0000000000000001 RBX: 00007fffffffb210 RCX: 00007ffff7b1b953
RDX: 0000000000000000 RSI: 00007fffffffb260 RDI: 0000000000000005
RBP: 00007fffffffb360 R8: 0000000000000000 R9: 0000000000610680
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 0000000000000004 R14: 00007fffffffb260 R15: 0000000000000d8e
ORIG_RAX: ffffffffffffffff CS: 0033 SS: 002b
|
| Comments |
| Comment by Peter Jones [ 27/Jul/19 ] |
|
Is this related to the |
| Comment by Andreas Dilger [ 29/Jul/19 ] |
|
Are the clients actually running out of memory? You can use "sysrq-M" to dump the memory state from the serial console, or "echo M > /proc/sysrq-trigger" if you are still able to run commands. Usually when the node runs out of memory it will automatically dump a bunch of additional information to the console about the memory state (equivalent to sysrq-M). The alternative is that the threads are stuck in memory reclaim, but the node is not actually out of memory. At that point it is necessary to determine what is consuming the memory. Often "/proc/slabinfo" (or slabtop) will be very useful. |
| Comment by Mahmoud Hanafi [ 29/Jul/19 ] |
|
The clients are running out/low on free memory, due to the user's code. Here is the memory info: sysrq: SysRq : sysrq: Show Memory Mem-Info: active_anon:48134602 inactive_anon:6649 isolated_anon:0 active_file:858 inactive_file:849 isolated_file:0 unevictable:12 dirty:0 writeback:0 unstable:0 slab_reclaimable:25927 slab_unreclaimable:148970 mapped:237298 shmem:47686 pagetables:96920 bounce:Node 0 DMA free:15884kB min:20kB low:32kB high:44kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15996kB managed:15884kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kBlowmem_reserve[]: 0 1077 94771 94771 94771 Node 0 DMA32 free:376264kB min:1492kB low:2616kB high:3740kB active_anon:742848kB inactive_anon:12kB active_file:4kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:1732656kB managed:1127488kB mlocked:0kB dirty:0kB writeback:0kB mlowmem_reserve[]: 0 0 93693 93693 93693 Node 0 Normal free:123664kB min:127272kB low:223212kB high:319152kB active_anon:94329320kB inactive_anon:16280kB active_file:2952kB inactive_file:3020kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:97517568kB managed:95942452kB mlocked:0kB dirtylowmem_reserve[]: 0 0 0 0 0 Node 1 Normal free:130872kB min:131416kB low:230480kB high:329544kB active_anon:97466240kB inactive_anon:10304kB active_file:476kB inactive_file:424kB unevictable:48kB isolated(anon):0kB isolated(file):0kB present:100663296kB managed:99065216kB mlocked:48kB dirtlowmem_reserve[]: 0 0 0 0 0 Node 0 DMA: 1*4kB (U) 1*8kB (U) 0*16kB 0*32kB 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15884kB Node 0 DMA32: 245*4kB (UE) 169*8kB (UME) 113*16kB (UME) 120*32kB (UME) 78*64kB (UME) 27*128kB (UE) 8*256kB (UE) 5*512kB (UE) 1*1024kB (U) 3*2048kB (UME) 85*4096kB (UM) = 376364kB Node 0 Normal: 6224*4kB (UME) 358*8kB (UME) 174*16kB (UME) 14*32kB (ME) 4*64kB (ME) 1*128kB (U) 2*256kB (ME) 1*512kB (U) 2*1024kB (UE) 0*2048kB 22*4096kB (M) = 124560kB Node 1 Normal: 5113*4kB (UME) 594*8kB (UME) 242*16kB (UME) 34*32kB (UME) 10*64kB (UE) 2*128kB (UM) 1*256kB (M) 3*512kB (UME) 2*1024kB (ME) 1*2048kB (M) 23*4096kB (M) = 131156kB Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB 47929 total pagecache pages 0 pages in swap cache Swap cache stats: add 0, delete 0, find 0/0 Free swap = 0kB Total swap = 0kB 49982379 pages RAM 0 pages HighMem/MovableOnly 944619 pages reserved 0 pages hwpoisoned |
| Comment by Andreas Dilger [ 29/Jul/19 ] |
|
It looks like there is no swap on these nodes, which is typical for compute nodes, so when memory is very low the only other choice for the kernel is to kill user processes using the Out-Of-Memory (OOM) killer. That would normally happen rather than blocking kernel allocations, so I guess that you have disabled the OOM-killer? It looks from the meminfo that all of the node memory (almost 190GB) is consumed by the application, since there is very llittle memory allocate to file cache, and very little in the slab cache, basically they have been whittled down to the very minimum and still there is nothing left to allocate. |
| Comment by Mahmoud Hanafi [ 29/Jul/19 ] |
|
No OOM-killer is not disabled. Normally the node will oom/crash and reboot. But with this new behaviour the it will just hang. not oom or crash. It requires manual reset. |
| Comment by Andreas Dilger [ 02/Aug/19 ] |
|
Looking at the backtraces, it seems that critical system processes systemd, kthreadd, kswapd and many lesser processes are all stuck in ldlm_namespace_put() trying to get the ns_lock, which is a bit odd because this is a spinlock that shouldn't be held very long and besides the many threads stuck in this function there is no thread I can find which is holding that lock. PID: 1 TASK: ffff8801a9694040 CPU: 4 COMMAND: "systemd" #0 [ffff8801a969b928] schedule at ffffffff816197cd #1 [ffff8801a969b940] ldlm_namespace_put at ffffffffa0e9a8ca [ptlrpc] #2 [ffff8801a969b958] ldlm_pools_count at ffffffffa0eb3a20 [ptlrpc] #3 [ffff8801a969b998] shrink_slab at ffffffff811a7b12 #4 [ffff8801a969ba68] shrink_zone at ffffffff811ac2be #5 [ffff8801a969bad0] do_try_to_free_pages at ffffffff811ac63d #6 [ffff8801a969bb48] try_to_free_pages at ffffffff811ac9ea #7 [ffff8801a969bbb8] __alloc_pages_nodemask at ffffffff8119e9cb #8 [ffff8801a969bca8] alloc_pages_current at ffffffff811e72cf #9 [ffff8801a969bce0] filemap_fault at ffffffff81198bf0 #10 [ffff8801a969bd40] __do_fault at ffffffff811c1ad7 #11 [ffff8801a969bda0] handle_pte_fault at ffffffff811c535e #12 [ffff8801a969be78] handle_mm_fault at ffffffff811c79ca #13 [ffff8801a969bec0] __do_page_fault at ffffffff81068df7 #14 [ffff8801a969bf28] do_page_fault at ffffffff810690cb #15 [ffff8801a969bf50] page_fault at ffffffff81621342 PID: 2 TASK: ffff8801a9700080 CPU: 76 COMMAND: "kthreadd" #0 [ffff8801a97079c8] schedule at ffffffff816197cd #1 [ffff8801a97079e0] ldlm_namespace_put at ffffffffa0e9a8ca [ptlrpc] #2 [ffff8801a97079f8] ldlm_pools_count at ffffffffa0eb3a20 [ptlrpc] #3 [ffff8801a9707a38] shrink_slab at ffffffff811a7b12 #4 [ffff8801a9707b08] shrink_zone at ffffffff811ac2be #5 [ffff8801a9707b70] do_try_to_free_pages at ffffffff811ac63d #6 [ffff8801a9707be8] try_to_free_pages at ffffffff811ac9ea #7 [ffff8801a9707c58] __alloc_pages_nodemask at ffffffff8119e9cb #8 [ffff8801a9707d48] copy_process at ffffffff8107ff57 #9 [ffff8801a9707e18] _do_fork at ffffffff81081ca6 #10 [ffff8801a9707e80] kernel_thread at ffffffff81081f55 #11 [ffff8801a9707e88] kthreadd at ffffffff810a1a81 #12 [ffff8801a9707f50] ret_from_fork at ffffffff8161e1f5 PID: 504 TASK: ffff8817dc374a00 CPU: 60 COMMAND: "kswapd1" #0 [ffff8817dc37bc40] schedule at ffffffff816197cd #1 [ffff8817dc37bc58] ldlm_namespace_put at ffffffffa0e9a8ca [ptlrpc] #2 [ffff8817dc37bc70] ldlm_pools_count at ffffffffa0eb3a20 [ptlrpc] #3 [ffff8817dc37bcb0] shrink_slab at ffffffff811a7b12 #4 [ffff8817dc37bd80] shrink_zone at ffffffff811ac2be #5 [ffff8817dc37bde8] kswapd at ffffffff811ad28e #6 [ffff8817dc37bed0] kthread at ffffffff810a0e39 #7 [ffff8817dc37bf50] ret_from_fork at ffffffff8161e1f5 What is also extremely strange is that the threads blocked on ldlm_namespace_put() are calling schedule(), which shouldn't happen for spinlocks, unless there is something strange going on (e.g. some change to the kernel that change spinlocks into sleeping calls)? The only other thing I can think of is that the spinlock itself is corrupted. The thread in the original description and a few others are blocked on another spinlock osc_shrink_lock in osc_cache_shrink_count(), which is also a bit strange since that function is very simple and osc_shrink_lock is not held for a very long time. It might be that this stack is just coincidence due to the crashdump IRQ if osc_cache_shrink_count() is being called in a loop because there is no free memory. One area of potential interest is that several ptlrpcd threads, which are in charge of sending RPC requests (e.g. writing out dirty data), are blocked trying to allocate memory. Some of them are trying to allocate a jobid cache entry (about 128 bytes), but others appear to be calling something else: PID: 3627 TASK: ffff8817ab4c4280 CPU: 67 COMMAND: "ptlrpcd_01_13" #0 [ffff8817ab4cb600] schedule at ffffffff816197cd #1 [ffff8817ab4cb618] ldlm_namespace_put at ffffffffa0e9a8ca [ptlrpc] #2 [ffff8817ab4cb630] ldlm_pools_count at ffffffffa0eb3a20 [ptlrpc] #3 [ffff8817ab4cb670] shrink_slab at ffffffff811a7b12 #4 [ffff8817ab4cb740] shrink_zone at ffffffff811ac2be #5 [ffff8817ab4cb7a8] do_try_to_free_pages at ffffffff811ac63d #6 [ffff8817ab4cb820] try_to_free_pages at ffffffff811ac9ea #7 [ffff8817ab4cb890] __alloc_pages_nodemask at ffffffff8119e9cb #8 [ffff8817ab4cb980] kmem_getpages at ffffffff811ee20d #9 [ffff8817ab4cb998] fallback_alloc at ffffffff811efa1b #10 [ffff8817ab4cba00] kmem_cache_alloc_trace at ffffffff811f05ec #11 [ffff8817ab4cba98] null_alloc_reqbuf at ffffffffa0f0130c [ptlrpc] #12 [ffff8817ab4cbb20] lustre_get_jobid at ffffffffa0c8d709 [obdclass] #13 [ffff8817ab4cbb78] ptlrpcd_add_req at ffffffffa0eea156 [ptlrpc] #14 [ffff8817ab4cbbe8] ptlrpc_connect_import at ffffffffa0ee6c35 [ptlrpc] #15 [ffff8817ab4cbc90] ptlrpc_request_handle_notconn at ffffffffa0ec3308 [ptlrpc] #16 [ffff8817ab4cbca8] after_reply at ffffffffa0ebeae0 [ptlrpc] #17 [ffff8817ab4cbcf0] ptlrpc_check_set at ffffffffa0ec0243 [ptlrpc] #18 [ffff8817ab4cbd80] ptlrpcd_check at ffffffffa0eea77a [ptlrpc] #19 [ffff8817ab4cbdd0] ptlrpcd at ffffffffa0eeab98 [ptlrpc] PID: 3631 TASK: ffff8817ab4e4380 CPU: 72 COMMAND: "ptlrpcd_01_17" #0 [ffff8817ab4eb600] schedule at ffffffff816197cd #1 [ffff8817ab4eb618] ldlm_namespace_put at ffffffffa0e9a8ca [ptlrpc] #2 [ffff8817ab4eb630] ldlm_pools_count at ffffffffa0eb3a20 [ptlrpc] #3 [ffff8817ab4eb670] shrink_slab at ffffffff811a7b12 #4 [ffff8817ab4eb740] shrink_zone at ffffffff811ac2be #5 [ffff8817ab4eb7a8] do_try_to_free_pages at ffffffff811ac63d #6 [ffff8817ab4eb820] try_to_free_pages at ffffffff811ac9ea #7 [ffff8817ab4eb890] __alloc_pages_nodemask at ffffffff8119e9cb #8 [ffff8817ab4eb980] kmem_getpages at ffffffff811ee20d #9 [ffff8817ab4eb998] fallback_alloc at ffffffff811efa1b #10 [ffff8817ab4eba00] kmem_cache_alloc_trace at ffffffff811f05ec #11 [ffff8817ab4ebaa0] __kmalloc at ffffffff811f1c4f #12 [ffff8817ab4ebb78] ptlrpcd_add_req at ffffffffa0eea156 [ptlrpc] #13 [ffff8817ab4ebbe8] ptlrpc_connect_import at ffffffffa0ee6c35 [ptlrpc] #14 [ffff8817ab4ebc90] ptlrpc_request_handle_notconn at ffffffffa0ec3308 [ptlrpc] #15 [ffff8817ab4ebca8] after_reply at ffffffffa0ebeae0 [ptlrpc] #16 [ffff8817ab4ebcf0] ptlrpc_check_set at ffffffffa0ec0243 [ptlrpc] #17 [ffff8817ab4ebd80] ptlrpcd_check at ffffffffa0eea77a [ptlrpc] #18 [ffff8817ab4ebdd0] ptlrpcd at ffffffffa0eeab98 [ptlrpc] It may be that this is just collateral damage being stuck in ldlm_namespace_put() as other threads, but we also should avoid doing allocations in the write path, especially ones that block important service threads. Since the jobid information is not critical, the allocation in cfs_get_environ() and jobid_get_from_cache() (if you have the " If this is a regular occurrence, a few things you could try to isolate the problem:
You may also want to check if the kswapd, systemd, and other threads are blocked before the node runs out of memory, as the cause and effect might be reversed. |
| Comment by Lukasz Flis [ 21/Aug/19 ] |
|
mhanafi - I think we had similar issue. In our case we got rid of hangs by raising vm.min_free_kbytes and vm.admin_reserve_kbytes vm.admin_reserve_kbytes = 65536 vm.min_free_kbytes = 270336 Unfortunately sometimes we see kwapd0 consuming around 90% of CPU in situation where free memory on numa0 is near 215MB This happens when we have 6 single core jobs on one server, each one assigned to different core from numa0, each proces consuming around 6.5G of RAM. (processes have no memory limits in terms of accesible numa) This is how perf report from kswapd0 look like:
+ 100.00% 0.00% kswapd0 [kernel.kallsyms] [k] kthread
- 99.98% 0.02% kswapd0 [kernel.kallsyms] [k] kswapd
- 99.96% kswapd
- 99.77% balance_pgdat
- 85.01% shrink_slab
- 55.07% osc_cache_shrink
54.63% osc_cache_shrink_count
+ 9.54% ldlm_pools_cli_shrink
8.52% prune_super
1.20% shrinker2_shrink
0.57% lu_cache_shrink
+ 9.63% shrink_zone
1.50% __x86_indirect_thunk_rax
1.42% prune_super
problem manifests itself regardless of lru_size setting (dynamic vs 10000) Lustre version: 2.10.8 |
| Comment by Andreas Dilger [ 26/Aug/19 ] |
|
lflis, the NUMA allocation problem may potentially be resolved by patch https://review.whamcloud.com/32848 " |
| Comment by Lukasz Flis [ 26/Aug/19 ] |
|
Andreas Dilger in our case we have fully balanced memory but 2 socket system is configured in cluster-on-die mode what implies that mostly used numa 0 gets exhausted quickly (64 GB socket mem is divided across two virtual numas ) We will give the patch a try and report back if problem with looped kswapd0 is gone. I did some source digging and i wonder if we could use the same approach with __OBD_SLAB_ALLOC_VERBOSE to handle oom conditions for SLAB allocations?
|
| Comment by Peter Jones [ 26/Aug/19 ] |
|
lflis please use your official support channels to get help - comments attached to other people's tickets are prone to being missed and may not get response. |
| Comment by Lukasz Flis [ 26/Aug/19 ] |
|
Peter Jones my involvement here was to point Mahmoud Hanafi that increasing min_free_kbytes can solve problem with oom kiler related system crashes and point that the reduction of Lustre performance may then happen in low numa memory conditions. The patch suggested by Andreas seems to be very good remedy for the low numa problem - as confirmed by our initial testing - i will continue testing in more production wise env but i think it's worth to consider it for b2_10 I didn't expect fast response times here to be honest and accepted the risk of being missed
|
| Comment by Mahmoud Hanafi [ 06/Jan/21 ] |
|
We have not seen this in 2.12. Please close |