[LU-12830] RHEL8.3 and ZFS: oom on OSS - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Cannot Reproduce
Priority: Minor
Fix Version/s: None
Affects Version/s: Lustre 2.14.0, Lustre 2.15.0
Labels:
- ZFS

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

This issue was created by maloo for jianyu <yujian@whamcloud.com>

This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/1e6f3bc6-e5ef-11e9-b62b-52540065bddc

test_bonnie failed with oom on OSS:

[16526.881544] Lustre: DEBUG MARKER: == sanity-benchmark test bonnie: bonnie++ ============================================================ 14:37:57 (1570027077)
[16528.099983] Lustre: DEBUG MARKER: /usr/sbin/lctl mark min OST has 10511360kB available, using 3438712kB file size
[16528.357585] Lustre: DEBUG MARKER: min OST has 10511360kB available, using 3438712kB file size
[16567.214746] irqbalance invoked oom-killer: gfp_mask=0x200da, order=0, oom_score_adj=0
[16567.215741] irqbalance cpuset=/ mems_allowed=0
[16567.216221] CPU: 1 PID: 1179 Comm: irqbalance Kdump: loaded Tainted: P           OE  ------------   3.10.0-957.27.2.el7_lustre.x86_64 #1
[16567.217451] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[16567.218033] Call Trace:
[16567.218336]  [<ffffffffad565147>] dump_stack+0x19/0x1b
[16567.218874]  [<ffffffffad55fb6a>] dump_header+0x90/0x229
[16567.219422]  [<ffffffffad572b1f>] ? notifier_call_chain+0x4f/0x70
[16567.220055]  [<ffffffffacec91c8>] ? __blocking_notifier_call_chain+0x58/0x70
[16567.220779]  [<ffffffffacfbbaae>] check_panic_on_oom+0x2e/0x60
[16567.221379]  [<ffffffffacfbbecb>] out_of_memory+0x23b/0x4f0
[16567.221938]  [<ffffffffad56066e>] __alloc_pages_slowpath+0x5d6/0x724
[16567.222585]  [<ffffffffacfc2524>] __alloc_pages_nodemask+0x404/0x420
[16567.223225]  [<ffffffffad0128c5>] alloc_pages_vma+0xb5/0x200
[16567.223840]  [<ffffffffad000b15>] __read_swap_cache_async+0x115/0x190
[16567.224491]  [<ffffffffad000bb6>] read_swap_cache_async+0x26/0x60
[16567.225104]  [<ffffffffad000c9c>] swapin_readahead+0xac/0x110
[16567.225690]  [<ffffffffacfead92>] handle_pte_fault+0x812/0xd10
[16567.226280]  [<fffffffface2a621>] ? __switch_to+0x151/0x580
[16567.226858]  [<ffffffffacfed3ad>] handle_mm_fault+0x39d/0x9b0
[16567.227444]  [<ffffffffacec6efd>] ? hrtimer_start_range_ns+0x1ed/0x3c0
[16567.228100]  [<ffffffffad572603>] __do_page_fault+0x203/0x4f0
[16567.228685]  [<ffffffffad5729d6>] trace_do_page_fault+0x56/0x150
[16567.229287]  [<ffffffffad571f62>] do_async_page_fault+0x22/0xf0
[16567.229890]  [<ffffffffad56e798>] async_page_fault+0x28/0x30

VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
ost-pools test_23b - trevis-21vm3 crashed during ost-pools test_23b

Attachments

Issue Links

is duplicated by

LU-17924 obdfilter-survey test_1a: OSS hit OOM

Open

LU-14643 ost-pools test_23b: 'trevis-57vm8 crashed during ost-pools test_23b

Resolved

LU-14812 ost-pools test_23b OOM crash

Resolved

is related to

LU-13162 parallel-scale test_statahead: mdsrate invoked oom-killer

Resolved

LU-13594 register OOM callback in Lustre

Reopened

is related to

LU-12864 sanity-benchmark test_iozone crashes with OOM on clients

Open

LU-10687 sanity-benchmark test iozone hangs with client OOM

Resolved

LU-14456 ost-pools test_23b: ll_ost_io00_007 invoked oom-killer

Resolved

LU-12336 Update ZFS Version to 0.8.2

Resolved

LU-13178 Update ZFS Version to 0.8.3

Resolved

mentioned in: Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...

(5 is related to , 7 mentioned in)

Activity

[LU-12830] RHEL8.3 and ZFS: oom on OSS

Andreas Dilger added a comment - 10/Aug/21 11:30 PM - edited

Crashes have been stopped by increasing OST VM RAM to 3GB from 2GB (last autotest failure 2021-07-31), but I don't think the problem is really fixed. That said, it is off our radar for now, until someone is more interested to fix this.

Andreas Dilger added a comment - 10/Aug/21 11:30 PM - edited Crashes have been stopped by increasing OST VM RAM to 3GB from 2GB (last autotest failure 2021-07-31), but I don't think the problem is really fixed . That said, it is off our radar for now, until someone is more interested to fix this.

Alex Zhuravlev added a comment - 27/Jul/21 1:49 PM

https://testing.whamcloud.com/test_sets/642b2113-975a-4311-9388-1e58e808a577

Alex Zhuravlev added a comment - 27/Jul/21 1:49 PM https://testing.whamcloud.com/test_sets/642b2113-975a-4311-9388-1e58e808a577

Chris Horn added a comment - 20/Jul/21 10:27 PM

+1 https://testing.whamcloud.com/test_sets/f54f45ff-11d2-4d46-9063-376613c4a738

Chris Horn added a comment - 20/Jul/21 10:27 PM +1 https://testing.whamcloud.com/test_sets/f54f45ff-11d2-4d46-9063-376613c4a738

Andreas Dilger added a comment - 13/Jul/21 7:13 PM - edited

+12 failures in the past week, all ZFS

Andreas Dilger added a comment - 13/Jul/21 7:13 PM - edited +12 failures in the past week , all ZFS

Alex Zhuravlev added a comment - 29/Apr/21 5:18 AM

https://testing.whamcloud.com/test_sets/26779a73-fb4b-44d4-b7f4-0892faf505af

Alex Zhuravlev added a comment - 29/Apr/21 5:18 AM https://testing.whamcloud.com/test_sets/26779a73-fb4b-44d4-b7f4-0892faf505af

Andreas Dilger added a comment - 19/Apr/21 5:21 AM

+4 failures in the past week, all of ZFS.

Andreas Dilger added a comment - 19/Apr/21 5:21 AM +4 failures in the past week, all of ZFS.

Andreas Dilger added a comment - 30/Mar/21 8:41 PM

+7 failures in the past week:
https://testing.whamcloud.com/test_sets/7015c24d-a8f5-4dfa-8032-6693a12fc05f
https://testing.whamcloud.com/test_sets/3ec6bac8-df96-4a4c-ae97-e93d1546194c
https://testing.whamcloud.com/test_sets/3624fe39-083b-4427-a4c0-f801afc13b54
https://testing.whamcloud.com/test_sets/72ecc79f-cc65-48b5-ba39-208fc6d0b9ac
https://testing.whamcloud.com/test_sets/02596af3-0144-4af4-888f-8ea89a444bbf
https://testing.whamcloud.com/test_sets/d8aa05e2-eb9a-4911-a770-bbe5f2d6c613
https://testing.whamcloud.com/test_sets/1a7783dc-9b13-407b-980d-2dce13877b19

Andreas Dilger added a comment - 30/Mar/21 8:41 PM +7 failures in the past week: https://testing.whamcloud.com/test_sets/7015c24d-a8f5-4dfa-8032-6693a12fc05f https://testing.whamcloud.com/test_sets/3ec6bac8-df96-4a4c-ae97-e93d1546194c https://testing.whamcloud.com/test_sets/3624fe39-083b-4427-a4c0-f801afc13b54 https://testing.whamcloud.com/test_sets/72ecc79f-cc65-48b5-ba39-208fc6d0b9ac https://testing.whamcloud.com/test_sets/02596af3-0144-4af4-888f-8ea89a444bbf https://testing.whamcloud.com/test_sets/d8aa05e2-eb9a-4911-a770-bbe5f2d6c613 https://testing.whamcloud.com/test_sets/1a7783dc-9b13-407b-980d-2dce13877b19

Andreas Dilger added a comment - 23/Feb/21 5:14 PM

What is strange is that it seems some slabs are missing from the dump? I don't see ldlm_locks or ldlm_resources, but I thought we are running with slub_nomerge these days? It still isn't clear where all the memory has gone...

Andreas Dilger added a comment - 23/Feb/21 5:14 PM What is strange is that it seems some slabs are missing from the dump? I don't see ldlm_locks or ldlm_resources , but I thought we are running with slub_nomerge these days? It still isn't clear where all the memory has gone...

Andreas Dilger added a comment - 27/Jan/21 10:42 PM

Looking at all of the recent failures, the OOM is for ZFS only, and only happening on the OSS. The two RHEL8.1 ldiskfs failures Jian referenced above were hit on the client.

Looking at these OOM cases, all of them are due to OOM of a userspace processes (lctl is in the case of obdfilter-survey where it is driving the load):

[ 5089.324526] CPU: 0 PID: 1175 Comm: tuned 
[20249.901128] CPU: 0 PID: 866 Comm: sssd_be
[31802.020686] CPU: 1 PID: 1074159 Comm: lctl 
[129875.609998] CPU: 1 PID: 1501864 Comm: lctl 
[77898.375312] CPU: 0 PID: 1194 Comm: tuned 
[20897.423508] CPU: 0 PID: 1459 Comm: tuned 
[37225.033993] CPU: 0 PID: 2 Comm: kthreadd
[21544.431269] CPU: 1 PID: 1177 Comm: tuned
[ 4838.509291] CPU: 0 PID: 804 Comm: NetworkManager
[114309.521078] CPU: 1 PID: 795 Comm: NetworkManager
[  914.434369] CPU: 1 PID: 1178 Comm: tuned
[22615.640870] CPU: 1 PID: 889 Comm: sssd_nss

According to the meminfo dump at the time, It doesn't look like there is a lot of slab in use, only 20563 pages ~= 82MB, and not much in terms of page cache at all. All of the Lustre-internal allocations would be recorded as part of the slab cache, so it doesn't seem like this is a memory leak. Pagecache pages should probably appear in active_file: or inactive_file:, unless ZFS is doing something strange with the ARC buffers.

[22615.680857] Mem-Info:
[22615.681353] active_anon:0 inactive_anon:0 isolated_anon:0
 active_file:15 inactive_file:19 isolated_file:0
 unevictable:0 dirty:0 writeback:0 unstable:0
 slab_reclaimable:6140 slab_unreclaimable:20563
 mapped:1 shmem:0 pagetables:1575 bounce:0
 free:12775 free_pcp:15 free_cma:0
[22615.686600] Node 0 active_anon:0kB inactive_anon:0kB active_file:60kB inactive_file:76kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:4kB dirty:0kB writeback:0kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB unstable:0kB all_unreclaimable? yes
[22615.690863] Node 0 DMA free:6816kB min:416kB low:520kB high:624kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15360kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[22615.695119] lowmem_reserve[]: 0 1605 1605 1605 1605
[22615.696004] Node 0 DMA32 free:44284kB min:44636kB low:55792kB high:66948kB active_anon:0kB inactive_anon:0kB active_file:60kB inactive_file:76kB unevictable:0kB writepending:0kB present:2080744kB managed:1678888kB mlocked:0kB kernel_stack:11920kB pagetables:6300kB bounce:0kB free_pcp:60kB local_pcp:0kB free_cma:0kB
[22615.700603] lowmem_reserve[]: 0 0 0 0 0
[22615.701307] Node 0 DMA: 0*4kB 0*8kB 0*16kB 1*32kB (U) 0*64kB 1*128kB (U) 2*256kB (U) 2*512kB (U) 1*1024kB (U) 2*2048kB (UM) 0*4096kB = 6816kB
[22615.703467] Node 0 DMA32: 1237*4kB (UME) 827*8kB (UME) 361*16kB (UME) 198*32kB (UME) 72*64kB (UME) 11*128kB (UM) 7*256kB (UM) 19*512kB (U) 3*1024kB (U) 0*2048kB 0*4096kB = 44284kB
[22615.706168] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[22615.707684] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[22615.709160] 34 total pagecache pages
[22615.709848] 0 pages in swap cache
[22615.710472] Swap cache stats: add 47896, delete 47895, find 250384/250507
[22615.711670] Free swap  = 2685692kB
[22615.712302] Total swap = 2860028kB
[22615.712949] 524184 pages RAM
[22615.713507] 0 pages HighMem/MovableOnly
[22615.714206] 100622 pages reserved
[22615.714834] 0 pages hwpoisoned

I'm not sure what the "100622 pages reserved" (about 400MB of the 2GB) is reserved for? Crashdumps? It looks like this 400MB reserve is the difference between "present:2080744kB" and "managed:1678888kB" in the DMA32 zone. I don't think that is the specific problem here, since I also see about 100k pages reserved on my RHEL7.9 VM, but that is a fair chunk of memory to hold back.

What IS strange is that the DMA32 zone shows 2GB present, and 1.6GB managed, but that memory doesn't show up as "used" in any of the fields, with about 45MB of free space.

There are several systemd-related processes that are consuming a significant amount of memory (tuned, polkitd, automount), but I'm not sure if they are the main offenders, nor what the exact units of this part of the dump represent (pages or KB; if this is all running processes, or only those with significant memory usage):

[22615.839020] [ pid ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
[22615.840493] [  544]     0   544    28120        1   274432     1298             0 systemd-journal
[22615.842023] [  579]     0   579    27233        1   225280      539         -1000 systemd-udevd
[22615.843530] [  742]    32   742    16785        0   167936      191             0 rpcbind
[22615.844941] [  750]     0   750    19242        0   147456      168         -1000 auditd
[22615.846372] [  794]     0   794    53697        0   434176      508             0 sssd
[22615.847757] [  800]    81   800    14140        0   163840      237          -900 dbus-daemon
[22615.849229] [  805]     0   805    97895        0   385024      687             0 NetworkManager
[22615.850754] [  813]   998   813   407041        0   331776     1290             0 polkitd
[22615.852173] [  815]     0   815    31224        0   139264      165             0 irqbalance
[22615.853634] [  827]   994   827    32228        0   147456      122             0 chronyd
[22615.855053] [  841]   993   841    58461        1   212992      219             0 rngd
[22615.856444] [  877]     0   877    55424        0   442368      752             0 sssd_be
[22615.857858] [  889]     0   889    56279        0   462848      469             0 sssd_nss
[22615.859299] [  901]     0   901   106578        0   430080     3803             0 tuned
[22615.860696] [  906]     0   906    23072        1   200704      232         -1000 sshd
[22615.862075] [  907]     0   907    34682        0   286720      368             0 cupsd
[22615.863478] [  910]   997   910    30251        0   147456      161             0 munged
[22615.864879] [  931]     0   931    14994        1   147456      148             0 xinetd
[22615.866271] [  953]     0   953    75597        0   303104      561             0 rsyslogd
[22615.867707] [  968]     0   968    25434        0   176128      176             0 gssproxy
[22615.869133] [  988]     0   988    24316        1   212992      300             0 systemd-logind
[22615.870653] [  998]    29   998    17931        0   176128     6297             0 rpc.statd
[22615.872093] [ 1133]     0  1133    30765        1   180224      251             0 master
[22615.873519] [ 1140]    89  1140    37597        0   237568      272             0 qmgr
[22615.874896] [ 1212]     0  1212   173534        0   335872      500             0 automount
[22615.876351] [ 1216]     0  1216    10994        0   114688       53             0 atd
[22615.877707] [ 1217]     0  1217     9232        1   110592      220             0 crond
[22615.879098] [ 1241]     0  1241     4050        0    69632       37             0 agetty
[22615.880508] [ 1242]     0  1242     3408        0    65536       32             0 agetty
[22615.881906] [1250901]    89 1250901    37584        0   237568      271             0 pickup
[22615.883448] Kernel panic - not syncing: Out of memory: system-wide panic_on_oom is enabled

One possibility (just a guess) is that there is something with RHEL8 systemd/cgroup configuration that prevent memory pressure from userspace processes releasing pages used for cached objects in the kernel, that wasn't the case in RHEL7.x? That still doesn't explain where the memory is currently being used.

On my RHEL7 VM the sysrq-m dump shows that each zone has a number of pages used for different reasons, while the RHEL8 dump shows almost nothing:

Node 0 DMA32 free:1897928kB min:59336kB low:74168kB high:89004kB active_anon:87160kB inactive_anon:75148kB active_file:74556kB inactive_file:473568kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3653568kB managed:3407344kB mlocked:0kB dirty:25324kB writeback:0kB mapped:21492kB shmem:91932kB slab_reclaimable:30712kB slab_unreclaimable:66032kB kernel_stack:3712kB pagetables:4132kB unstable:0kB bounce:0kB free_pcp:948kB local_pcp:692kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
Node 0 Normal free:217984kB min:7968kB low:9960kB high:11952kB active_anon:12984kB inactive_anon:11080kB active_file:13596kB inactive_file:59896kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:524288kB managed:457212kB mlocked:0kB dirty:1744kB writeback:0kB mapped:2524kB shmem:13212kB slab_reclaimable:9616kB slab_unreclaimable:12632kB kernel_stack:1152kB pagetables:424kB unstable:0kB bounce:0kB free_pcp:956kB local_pcp:700kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no

Andreas Dilger added a comment - 27/Jan/21 10:42 PM Looking at all of the recent failures, the OOM is for ZFS only, and only happening on the OSS. The two RHEL8.1 ldiskfs failures Jian referenced above were hit on the client. Looking at these OOM cases, all of them are due to OOM of a userspace processes ( lctl is in the case of obdfilter-survey where it is driving the load): [ 5089.324526] CPU: 0 PID: 1175 Comm: tuned [20249.901128] CPU: 0 PID: 866 Comm: sssd_be [31802.020686] CPU: 1 PID: 1074159 Comm: lctl [129875.609998] CPU: 1 PID: 1501864 Comm: lctl [77898.375312] CPU: 0 PID: 1194 Comm: tuned [20897.423508] CPU: 0 PID: 1459 Comm: tuned [37225.033993] CPU: 0 PID: 2 Comm: kthreadd [21544.431269] CPU: 1 PID: 1177 Comm: tuned [ 4838.509291] CPU: 0 PID: 804 Comm: NetworkManager [114309.521078] CPU: 1 PID: 795 Comm: NetworkManager [ 914.434369] CPU: 1 PID: 1178 Comm: tuned [22615.640870] CPU: 1 PID: 889 Comm: sssd_nss According to the meminfo dump at the time, It doesn't look like there is a lot of slab in use, only 20563 pages ~= 82MB, and not much in terms of page cache at all. All of the Lustre-internal allocations would be recorded as part of the slab cache, so it doesn't seem like this is a memory leak. Pagecache pages should probably appear in active_file: or inactive_file: , unless ZFS is doing something strange with the ARC buffers. [22615.680857] Mem-Info: [22615.681353] active_anon:0 inactive_anon:0 isolated_anon:0 active_file:15 inactive_file:19 isolated_file:0 unevictable:0 dirty:0 writeback:0 unstable:0 slab_reclaimable:6140 slab_unreclaimable:20563 mapped:1 shmem:0 pagetables:1575 bounce:0 free:12775 free_pcp:15 free_cma:0 [22615.686600] Node 0 active_anon:0kB inactive_anon:0kB active_file:60kB inactive_file:76kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:4kB dirty:0kB writeback:0kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB unstable:0kB all_unreclaimable? yes [22615.690863] Node 0 DMA free:6816kB min:416kB low:520kB high:624kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15360kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB [22615.695119] lowmem_reserve[]: 0 1605 1605 1605 1605 [22615.696004] Node 0 DMA32 free:44284kB min:44636kB low:55792kB high:66948kB active_anon:0kB inactive_anon:0kB active_file:60kB inactive_file:76kB unevictable:0kB writepending:0kB present:2080744kB managed:1678888kB mlocked:0kB kernel_stack:11920kB pagetables:6300kB bounce:0kB free_pcp:60kB local_pcp:0kB free_cma:0kB [22615.700603] lowmem_reserve[]: 0 0 0 0 0 [22615.701307] Node 0 DMA: 0*4kB 0*8kB 0*16kB 1*32kB (U) 0*64kB 1*128kB (U) 2*256kB (U) 2*512kB (U) 1*1024kB (U) 2*2048kB (UM) 0*4096kB = 6816kB [22615.703467] Node 0 DMA32: 1237*4kB (UME) 827*8kB (UME) 361*16kB (UME) 198*32kB (UME) 72*64kB (UME) 11*128kB (UM) 7*256kB (UM) 19*512kB (U) 3*1024kB (U) 0*2048kB 0*4096kB = 44284kB [22615.706168] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB [22615.707684] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB [22615.709160] 34 total pagecache pages [22615.709848] 0 pages in swap cache [22615.710472] Swap cache stats: add 47896, delete 47895, find 250384/250507 [22615.711670] Free swap = 2685692kB [22615.712302] Total swap = 2860028kB [22615.712949] 524184 pages RAM [22615.713507] 0 pages HighMem/MovableOnly [22615.714206] 100622 pages reserved [22615.714834] 0 pages hwpoisoned I'm not sure what the "100622 pages reserved" (about 400MB of the 2GB) is reserved for ? Crashdumps? It looks like this 400MB reserve is the difference between " present:2080744kB " and " managed:1678888kB " in the DMA32 zone. I don't think that is the specific problem here, since I also see about 100k pages reserved on my RHEL7.9 VM, but that is a fair chunk of memory to hold back. What IS strange is that the DMA32 zone shows 2GB present, and 1.6GB managed, but that memory doesn't show up as "used" in any of the fields, with about 45MB of free space. There are several systemd-related processes that are consuming a significant amount of memory ( tuned , polkitd , automount ), but I'm not sure if they are the main offenders, nor what the exact units of this part of the dump represent (pages or KB; if this is all running processes, or only those with significant memory usage): [22615.839020] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name [22615.840493] [ 544] 0 544 28120 1 274432 1298 0 systemd-journal [22615.842023] [ 579] 0 579 27233 1 225280 539 -1000 systemd-udevd [22615.843530] [ 742] 32 742 16785 0 167936 191 0 rpcbind [22615.844941] [ 750] 0 750 19242 0 147456 168 -1000 auditd [22615.846372] [ 794] 0 794 53697 0 434176 508 0 sssd [22615.847757] [ 800] 81 800 14140 0 163840 237 -900 dbus-daemon [22615.849229] [ 805] 0 805 97895 0 385024 687 0 NetworkManager [22615.850754] [ 813] 998 813 407041 0 331776 1290 0 polkitd [22615.852173] [ 815] 0 815 31224 0 139264 165 0 irqbalance [22615.853634] [ 827] 994 827 32228 0 147456 122 0 chronyd [22615.855053] [ 841] 993 841 58461 1 212992 219 0 rngd [22615.856444] [ 877] 0 877 55424 0 442368 752 0 sssd_be [22615.857858] [ 889] 0 889 56279 0 462848 469 0 sssd_nss [22615.859299] [ 901] 0 901 106578 0 430080 3803 0 tuned [22615.860696] [ 906] 0 906 23072 1 200704 232 -1000 sshd [22615.862075] [ 907] 0 907 34682 0 286720 368 0 cupsd [22615.863478] [ 910] 997 910 30251 0 147456 161 0 munged [22615.864879] [ 931] 0 931 14994 1 147456 148 0 xinetd [22615.866271] [ 953] 0 953 75597 0 303104 561 0 rsyslogd [22615.867707] [ 968] 0 968 25434 0 176128 176 0 gssproxy [22615.869133] [ 988] 0 988 24316 1 212992 300 0 systemd-logind [22615.870653] [ 998] 29 998 17931 0 176128 6297 0 rpc.statd [22615.872093] [ 1133] 0 1133 30765 1 180224 251 0 master [22615.873519] [ 1140] 89 1140 37597 0 237568 272 0 qmgr [22615.874896] [ 1212] 0 1212 173534 0 335872 500 0 automount [22615.876351] [ 1216] 0 1216 10994 0 114688 53 0 atd [22615.877707] [ 1217] 0 1217 9232 1 110592 220 0 crond [22615.879098] [ 1241] 0 1241 4050 0 69632 37 0 agetty [22615.880508] [ 1242] 0 1242 3408 0 65536 32 0 agetty [22615.881906] [1250901] 89 1250901 37584 0 237568 271 0 pickup [22615.883448] Kernel panic - not syncing: Out of memory: system-wide panic_on_oom is enabled One possibility (just a guess) is that there is something with RHEL8 systemd/cgroup configuration that prevent memory pressure from userspace processes releasing pages used for cached objects in the kernel, that wasn't the case in RHEL7.x? That still doesn't explain where the memory is currently being used. On my RHEL7 VM the sysrq-m dump shows that each zone has a number of pages used for different reasons, while the RHEL8 dump shows almost nothing: Node 0 DMA32 free:1897928kB min:59336kB low:74168kB high:89004kB active_anon:87160kB inactive_anon:75148kB active_file:74556kB inactive_file:473568kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3653568kB managed:3407344kB mlocked:0kB dirty:25324kB writeback:0kB mapped:21492kB shmem:91932kB slab_reclaimable:30712kB slab_unreclaimable:66032kB kernel_stack:3712kB pagetables:4132kB unstable:0kB bounce:0kB free_pcp:948kB local_pcp:692kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no Node 0 Normal free:217984kB min:7968kB low:9960kB high:11952kB active_anon:12984kB inactive_anon:11080kB active_file:13596kB inactive_file:59896kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:524288kB managed:457212kB mlocked:0kB dirty:1744kB writeback:0kB mapped:2524kB shmem:13212kB slab_reclaimable:9616kB slab_unreclaimable:12632kB kernel_stack:1152kB pagetables:424kB unstable:0kB bounce:0kB free_pcp:956kB local_pcp:700kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no

Jian Yu added a comment - 14/Feb/20 10:36 PM

another option is to extend memory on OSS which is 2GB by default.

Hi Minh, the oom issue affects many tests. Can we increase the memory size on vm nodes?

Jian Yu added a comment - 14/Feb/20 10:36 PM another option is to extend memory on OSS which is 2GB by default. Hi Minh, the oom issue affects many tests. Can we increase the memory size on vm nodes?

People

Assignee:: WC Triage

Reporter:: Maloo

Votes:: 0 Vote for this issue

Watchers:: 10 Start watching this issue

Dates

Created:: 04/Oct/19 3:02 AM

Updated:: 14/Jun/24 12:56 AM

Resolved:: 22/Jun/22 6:44 PM