[LU-12830] RHEL8.3 and ZFS: oom on OSS Created: 04/Oct/19  Updated: 22/Jun/22  Resolved: 22/Jun/22

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.14.0, Lustre 2.15.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Maloo Assignee: WC Triage
Resolution: Cannot Reproduce Votes: 0
Labels: ZFS

Issue Links:
Duplicate
is duplicated by LU-14643 ost-pools test_23b: 'trevis-57vm8 cra... Resolved
is duplicated by LU-14812 ost-pools test_23b OOM crash Resolved
Related
is related to LU-12864 sanity-benchmark test_iozone crashes ... Open
is related to LU-10687 sanity-benchmark test iozone hangs wi... Resolved
is related to LU-14456 ost-pools test_23b: ll_ost_io00_007 i... Resolved
is related to LU-12336 Update ZFS Version to 0.8.2 Resolved
is related to LU-13178 Update ZFS Version to 0.8.3 Resolved
is related to LU-13162 parallel-scale test_statahead: mdsrat... Resolved
is related to LU-13594 register OOM callback in Lustre Reopened
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for jianyu <yujian@whamcloud.com>

This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/1e6f3bc6-e5ef-11e9-b62b-52540065bddc

test_bonnie failed with oom on OSS:

[16526.881544] Lustre: DEBUG MARKER: == sanity-benchmark test bonnie: bonnie++ ============================================================ 14:37:57 (1570027077)
[16528.099983] Lustre: DEBUG MARKER: /usr/sbin/lctl mark min OST has 10511360kB available, using 3438712kB file size
[16528.357585] Lustre: DEBUG MARKER: min OST has 10511360kB available, using 3438712kB file size
[16567.214746] irqbalance invoked oom-killer: gfp_mask=0x200da, order=0, oom_score_adj=0
[16567.215741] irqbalance cpuset=/ mems_allowed=0
[16567.216221] CPU: 1 PID: 1179 Comm: irqbalance Kdump: loaded Tainted: P           OE  ------------   3.10.0-957.27.2.el7_lustre.x86_64 #1
[16567.217451] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[16567.218033] Call Trace:
[16567.218336]  [<ffffffffad565147>] dump_stack+0x19/0x1b
[16567.218874]  [<ffffffffad55fb6a>] dump_header+0x90/0x229
[16567.219422]  [<ffffffffad572b1f>] ? notifier_call_chain+0x4f/0x70
[16567.220055]  [<ffffffffacec91c8>] ? __blocking_notifier_call_chain+0x58/0x70
[16567.220779]  [<ffffffffacfbbaae>] check_panic_on_oom+0x2e/0x60
[16567.221379]  [<ffffffffacfbbecb>] out_of_memory+0x23b/0x4f0
[16567.221938]  [<ffffffffad56066e>] __alloc_pages_slowpath+0x5d6/0x724
[16567.222585]  [<ffffffffacfc2524>] __alloc_pages_nodemask+0x404/0x420
[16567.223225]  [<ffffffffad0128c5>] alloc_pages_vma+0xb5/0x200
[16567.223840]  [<ffffffffad000b15>] __read_swap_cache_async+0x115/0x190
[16567.224491]  [<ffffffffad000bb6>] read_swap_cache_async+0x26/0x60
[16567.225104]  [<ffffffffad000c9c>] swapin_readahead+0xac/0x110
[16567.225690]  [<ffffffffacfead92>] handle_pte_fault+0x812/0xd10
[16567.226280]  [<fffffffface2a621>] ? __switch_to+0x151/0x580
[16567.226858]  [<ffffffffacfed3ad>] handle_mm_fault+0x39d/0x9b0
[16567.227444]  [<ffffffffacec6efd>] ? hrtimer_start_range_ns+0x1ed/0x3c0
[16567.228100]  [<ffffffffad572603>] __do_page_fault+0x203/0x4f0
[16567.228685]  [<ffffffffad5729d6>] trace_do_page_fault+0x56/0x150
[16567.229287]  [<ffffffffad571f62>] do_async_page_fault+0x22/0xf0
[16567.229890]  [<ffffffffad56e798>] async_page_fault+0x28/0x30

VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
ost-pools test_23b - trevis-21vm3 crashed during ost-pools test_23b



 Comments   
Comment by Jian Yu [ 04/Oct/19 ]

This failure is specific to Lustre test sessions with ZFS 0.8.1 and 0.8.2. It didn't occur in test sessions with ZFS 0.7.13.

Comment by Jian Yu [ 04/Oct/19 ]

obdfilter-survey test 1a also failed with oom on OSS:
https://testing.whamcloud.com/test_sets/585db5f6-e5ef-11e9-b62b-52540065bddc
https://testing.whamcloud.com/test_sets/a00637aa-e632-11e9-a197-52540065bddc
https://testing.whamcloud.com/test_sets/722833d0-e603-11e9-be86-52540065bddc

Comment by Jian Yu [ 04/Oct/19 ]

parallel-scale test compilebench hit the same issue:
https://testing.whamcloud.com/test_sets/6e82be44-e603-11e9-be86-52540065bddc

Comment by Jian Yu [ 30/Jan/20 ]

The oom failure also occurred with ZFS 0.8.3:
https://testing.whamcloud.com/test_sets/b2ac0f98-43a1-11ea-bffa-52540065bddc
https://testing.whamcloud.com/test_sets/b2ab5292-43a1-11ea-86b2-52540065bddc

Comment by Alex Zhuravlev [ 10/Feb/20 ]

according to testing in autotest (wiht https://review.whamcloud.com/#/c/37384/ ) all 0.8.* demontrate this behavior.
all this doesn't look like a memleak but rather like a increased memory requirements on ZFS side.
I tried with less number of OSTs per node and all was good up to 4 OSTs.
another option is to extend memory on OSS which is 2GB by default.

Comment by Jian Yu [ 14/Feb/20 ]

The same failure also occurred with ldiskfs on RHEL 8.1:
https://testing.whamcloud.com/test_sets/d6331fe6-4ef1-11ea-a90e-52540065bddc
https://testing.whamcloud.com/test_sets/d70b1166-4907-11ea-aeb7-52540065bddc
Cluent-side OOM moved to LU-13162

Comment by Jian Yu [ 14/Feb/20 ]

another option is to extend memory on OSS which is 2GB by default.

Hi Minh, the oom issue affects many tests. Can we increase the memory size on vm nodes?

Comment by Andreas Dilger [ 27/Jan/21 ]

Looking at all of the recent failures, the OOM is for ZFS only, and only happening on the OSS. The two RHEL8.1 ldiskfs failures Jian referenced above were hit on the client.

Looking at these OOM cases, all of them are due to OOM of a userspace processes (lctl is in the case of obdfilter-survey where it is driving the load):

[ 5089.324526] CPU: 0 PID: 1175 Comm: tuned 
[20249.901128] CPU: 0 PID: 866 Comm: sssd_be
[31802.020686] CPU: 1 PID: 1074159 Comm: lctl 
[129875.609998] CPU: 1 PID: 1501864 Comm: lctl 
[77898.375312] CPU: 0 PID: 1194 Comm: tuned 
[20897.423508] CPU: 0 PID: 1459 Comm: tuned 
[37225.033993] CPU: 0 PID: 2 Comm: kthreadd
[21544.431269] CPU: 1 PID: 1177 Comm: tuned
[ 4838.509291] CPU: 0 PID: 804 Comm: NetworkManager
[114309.521078] CPU: 1 PID: 795 Comm: NetworkManager
[  914.434369] CPU: 1 PID: 1178 Comm: tuned
[22615.640870] CPU: 1 PID: 889 Comm: sssd_nss

According to the meminfo dump at the time, It doesn't look like there is a lot of slab in use, only 20563 pages ~= 82MB, and not much in terms of page cache at all. All of the Lustre-internal allocations would be recorded as part of the slab cache, so it doesn't seem like this is a memory leak. Pagecache pages should probably appear in active_file: or inactive_file:, unless ZFS is doing something strange with the ARC buffers.

[22615.680857] Mem-Info:
[22615.681353] active_anon:0 inactive_anon:0 isolated_anon:0
 active_file:15 inactive_file:19 isolated_file:0
 unevictable:0 dirty:0 writeback:0 unstable:0
 slab_reclaimable:6140 slab_unreclaimable:20563
 mapped:1 shmem:0 pagetables:1575 bounce:0
 free:12775 free_pcp:15 free_cma:0
[22615.686600] Node 0 active_anon:0kB inactive_anon:0kB active_file:60kB inactive_file:76kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:4kB dirty:0kB writeback:0kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB unstable:0kB all_unreclaimable? yes
[22615.690863] Node 0 DMA free:6816kB min:416kB low:520kB high:624kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15360kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[22615.695119] lowmem_reserve[]: 0 1605 1605 1605 1605
[22615.696004] Node 0 DMA32 free:44284kB min:44636kB low:55792kB high:66948kB active_anon:0kB inactive_anon:0kB active_file:60kB inactive_file:76kB unevictable:0kB writepending:0kB present:2080744kB managed:1678888kB mlocked:0kB kernel_stack:11920kB pagetables:6300kB bounce:0kB free_pcp:60kB local_pcp:0kB free_cma:0kB
[22615.700603] lowmem_reserve[]: 0 0 0 0 0
[22615.701307] Node 0 DMA: 0*4kB 0*8kB 0*16kB 1*32kB (U) 0*64kB 1*128kB (U) 2*256kB (U) 2*512kB (U) 1*1024kB (U) 2*2048kB (UM) 0*4096kB = 6816kB
[22615.703467] Node 0 DMA32: 1237*4kB (UME) 827*8kB (UME) 361*16kB (UME) 198*32kB (UME) 72*64kB (UME) 11*128kB (UM) 7*256kB (UM) 19*512kB (U) 3*1024kB (U) 0*2048kB 0*4096kB = 44284kB
[22615.706168] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[22615.707684] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[22615.709160] 34 total pagecache pages
[22615.709848] 0 pages in swap cache
[22615.710472] Swap cache stats: add 47896, delete 47895, find 250384/250507
[22615.711670] Free swap  = 2685692kB
[22615.712302] Total swap = 2860028kB
[22615.712949] 524184 pages RAM
[22615.713507] 0 pages HighMem/MovableOnly
[22615.714206] 100622 pages reserved
[22615.714834] 0 pages hwpoisoned

I'm not sure what the "100622 pages reserved" (about 400MB of the 2GB) is reserved for? Crashdumps? It looks like this 400MB reserve is the difference between "present:2080744kB" and "managed:1678888kB" in the DMA32 zone. I don't think that is the specific problem here, since I also see about 100k pages reserved on my RHEL7.9 VM, but that is a fair chunk of memory to hold back.

What IS strange is that the DMA32 zone shows 2GB present, and 1.6GB managed, but that memory doesn't show up as "used" in any of the fields, with about 45MB of free space.

There are several systemd-related processes that are consuming a significant amount of memory (tuned, polkitd, automount), but I'm not sure if they are the main offenders, nor what the exact units of this part of the dump represent (pages or KB; if this is all running processes, or only those with significant memory usage):

[22615.839020] [ pid ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
[22615.840493] [  544]     0   544    28120        1   274432     1298             0 systemd-journal
[22615.842023] [  579]     0   579    27233        1   225280      539         -1000 systemd-udevd
[22615.843530] [  742]    32   742    16785        0   167936      191             0 rpcbind
[22615.844941] [  750]     0   750    19242        0   147456      168         -1000 auditd
[22615.846372] [  794]     0   794    53697        0   434176      508             0 sssd
[22615.847757] [  800]    81   800    14140        0   163840      237          -900 dbus-daemon
[22615.849229] [  805]     0   805    97895        0   385024      687             0 NetworkManager
[22615.850754] [  813]   998   813   407041        0   331776     1290             0 polkitd
[22615.852173] [  815]     0   815    31224        0   139264      165             0 irqbalance
[22615.853634] [  827]   994   827    32228        0   147456      122             0 chronyd
[22615.855053] [  841]   993   841    58461        1   212992      219             0 rngd
[22615.856444] [  877]     0   877    55424        0   442368      752             0 sssd_be
[22615.857858] [  889]     0   889    56279        0   462848      469             0 sssd_nss
[22615.859299] [  901]     0   901   106578        0   430080     3803             0 tuned
[22615.860696] [  906]     0   906    23072        1   200704      232         -1000 sshd
[22615.862075] [  907]     0   907    34682        0   286720      368             0 cupsd
[22615.863478] [  910]   997   910    30251        0   147456      161             0 munged
[22615.864879] [  931]     0   931    14994        1   147456      148             0 xinetd
[22615.866271] [  953]     0   953    75597        0   303104      561             0 rsyslogd
[22615.867707] [  968]     0   968    25434        0   176128      176             0 gssproxy
[22615.869133] [  988]     0   988    24316        1   212992      300             0 systemd-logind
[22615.870653] [  998]    29   998    17931        0   176128     6297             0 rpc.statd
[22615.872093] [ 1133]     0  1133    30765        1   180224      251             0 master
[22615.873519] [ 1140]    89  1140    37597        0   237568      272             0 qmgr
[22615.874896] [ 1212]     0  1212   173534        0   335872      500             0 automount
[22615.876351] [ 1216]     0  1216    10994        0   114688       53             0 atd
[22615.877707] [ 1217]     0  1217     9232        1   110592      220             0 crond
[22615.879098] [ 1241]     0  1241     4050        0    69632       37             0 agetty
[22615.880508] [ 1242]     0  1242     3408        0    65536       32             0 agetty
[22615.881906] [1250901]    89 1250901    37584        0   237568      271             0 pickup
[22615.883448] Kernel panic - not syncing: Out of memory: system-wide panic_on_oom is enabled

One possibility (just a guess) is that there is something with RHEL8 systemd/cgroup configuration that prevent memory pressure from userspace processes releasing pages used for cached objects in the kernel, that wasn't the case in RHEL7.x? That still doesn't explain where the memory is currently being used.

On my RHEL7 VM the sysrq-m dump shows that each zone has a number of pages used for different reasons, while the RHEL8 dump shows almost nothing:

Node 0 DMA32 free:1897928kB min:59336kB low:74168kB high:89004kB active_anon:87160kB inactive_anon:75148kB active_file:74556kB inactive_file:473568kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3653568kB managed:3407344kB mlocked:0kB dirty:25324kB writeback:0kB mapped:21492kB shmem:91932kB slab_reclaimable:30712kB slab_unreclaimable:66032kB kernel_stack:3712kB pagetables:4132kB unstable:0kB bounce:0kB free_pcp:948kB local_pcp:692kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
Node 0 Normal free:217984kB min:7968kB low:9960kB high:11952kB active_anon:12984kB inactive_anon:11080kB active_file:13596kB inactive_file:59896kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:524288kB managed:457212kB mlocked:0kB dirty:1744kB writeback:0kB mapped:2524kB shmem:13212kB slab_reclaimable:9616kB slab_unreclaimable:12632kB kernel_stack:1152kB pagetables:424kB unstable:0kB bounce:0kB free_pcp:956kB local_pcp:700kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
Comment by Andreas Dilger [ 23/Feb/21 ]

What is strange is that it seems some slabs are missing from the dump? I don't see ldlm_locks or ldlm_resources, but I thought we are running with slub_nomerge these days? It still isn't clear where all the memory has gone...

Comment by Andreas Dilger [ 30/Mar/21 ]

+7 failures in the past week:
https://testing.whamcloud.com/test_sets/7015c24d-a8f5-4dfa-8032-6693a12fc05f
https://testing.whamcloud.com/test_sets/3ec6bac8-df96-4a4c-ae97-e93d1546194c
https://testing.whamcloud.com/test_sets/3624fe39-083b-4427-a4c0-f801afc13b54
https://testing.whamcloud.com/test_sets/72ecc79f-cc65-48b5-ba39-208fc6d0b9ac
https://testing.whamcloud.com/test_sets/02596af3-0144-4af4-888f-8ea89a444bbf
https://testing.whamcloud.com/test_sets/d8aa05e2-eb9a-4911-a770-bbe5f2d6c613
https://testing.whamcloud.com/test_sets/1a7783dc-9b13-407b-980d-2dce13877b19

Comment by Andreas Dilger [ 19/Apr/21 ]

+4 failures in the past week, all of ZFS.

Comment by Alex Zhuravlev [ 29/Apr/21 ]

https://testing.whamcloud.com/test_sets/26779a73-fb4b-44d4-b7f4-0892faf505af

Comment by Andreas Dilger [ 13/Jul/21 ]

+12 failures in the past week, all ZFS

Comment by Chris Horn [ 20/Jul/21 ]

+1 https://testing.whamcloud.com/test_sets/f54f45ff-11d2-4d46-9063-376613c4a738

Comment by Alex Zhuravlev [ 27/Jul/21 ]

https://testing.whamcloud.com/test_sets/642b2113-975a-4311-9388-1e58e808a577

Comment by Andreas Dilger [ 10/Aug/21 ]

Crashes have been stopped by increasing OST VM RAM to 3GB from 2GB (last autotest failure 2021-07-31), but I don't think the problem is really fixed. That said, it is off our radar for now, until someone is more interested to fix this.

Generated at Sat Feb 10 02:56:01 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.