[LU-12830] RHEL8.3 and ZFS: oom on OSS Created: 04/Oct/19 Updated: 22/Jun/22 Resolved: 22/Jun/22 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.14.0, Lustre 2.15.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Maloo | Assignee: | WC Triage |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | ZFS | ||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||||||||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||||||||||||||||||||||||||
| Description |
|
This issue was created by maloo for jianyu <yujian@whamcloud.com> This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/1e6f3bc6-e5ef-11e9-b62b-52540065bddc test_bonnie failed with oom on OSS: [16526.881544] Lustre: DEBUG MARKER: == sanity-benchmark test bonnie: bonnie++ ============================================================ 14:37:57 (1570027077) [16528.099983] Lustre: DEBUG MARKER: /usr/sbin/lctl mark min OST has 10511360kB available, using 3438712kB file size [16528.357585] Lustre: DEBUG MARKER: min OST has 10511360kB available, using 3438712kB file size [16567.214746] irqbalance invoked oom-killer: gfp_mask=0x200da, order=0, oom_score_adj=0 [16567.215741] irqbalance cpuset=/ mems_allowed=0 [16567.216221] CPU: 1 PID: 1179 Comm: irqbalance Kdump: loaded Tainted: P OE ------------ 3.10.0-957.27.2.el7_lustre.x86_64 #1 [16567.217451] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [16567.218033] Call Trace: [16567.218336] [<ffffffffad565147>] dump_stack+0x19/0x1b [16567.218874] [<ffffffffad55fb6a>] dump_header+0x90/0x229 [16567.219422] [<ffffffffad572b1f>] ? notifier_call_chain+0x4f/0x70 [16567.220055] [<ffffffffacec91c8>] ? __blocking_notifier_call_chain+0x58/0x70 [16567.220779] [<ffffffffacfbbaae>] check_panic_on_oom+0x2e/0x60 [16567.221379] [<ffffffffacfbbecb>] out_of_memory+0x23b/0x4f0 [16567.221938] [<ffffffffad56066e>] __alloc_pages_slowpath+0x5d6/0x724 [16567.222585] [<ffffffffacfc2524>] __alloc_pages_nodemask+0x404/0x420 [16567.223225] [<ffffffffad0128c5>] alloc_pages_vma+0xb5/0x200 [16567.223840] [<ffffffffad000b15>] __read_swap_cache_async+0x115/0x190 [16567.224491] [<ffffffffad000bb6>] read_swap_cache_async+0x26/0x60 [16567.225104] [<ffffffffad000c9c>] swapin_readahead+0xac/0x110 [16567.225690] [<ffffffffacfead92>] handle_pte_fault+0x812/0xd10 [16567.226280] [<fffffffface2a621>] ? __switch_to+0x151/0x580 [16567.226858] [<ffffffffacfed3ad>] handle_mm_fault+0x39d/0x9b0 [16567.227444] [<ffffffffacec6efd>] ? hrtimer_start_range_ns+0x1ed/0x3c0 [16567.228100] [<ffffffffad572603>] __do_page_fault+0x203/0x4f0 [16567.228685] [<ffffffffad5729d6>] trace_do_page_fault+0x56/0x150 [16567.229287] [<ffffffffad571f62>] do_async_page_fault+0x22/0xf0 [16567.229890] [<ffffffffad56e798>] async_page_fault+0x28/0x30 VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV |
| Comments |
| Comment by Jian Yu [ 04/Oct/19 ] |
|
This failure is specific to Lustre test sessions with ZFS 0.8.1 and 0.8.2. It didn't occur in test sessions with ZFS 0.7.13. |
| Comment by Jian Yu [ 04/Oct/19 ] |
|
obdfilter-survey test 1a also failed with oom on OSS: |
| Comment by Jian Yu [ 04/Oct/19 ] |
|
parallel-scale test compilebench hit the same issue: |
| Comment by Jian Yu [ 30/Jan/20 ] |
|
The oom failure also occurred with ZFS 0.8.3: |
| Comment by Alex Zhuravlev [ 10/Feb/20 ] |
|
according to testing in autotest (wiht https://review.whamcloud.com/#/c/37384/ ) all 0.8.* demontrate this behavior. |
| Comment by Jian Yu [ 14/Feb/20 ] |
|
|
| Comment by Jian Yu [ 14/Feb/20 ] |
Hi Minh, the oom issue affects many tests. Can we increase the memory size on vm nodes? |
| Comment by Andreas Dilger [ 27/Jan/21 ] |
|
Looking at all of the recent failures, the OOM is for ZFS only, and only happening on the OSS. The two RHEL8.1 ldiskfs failures Jian referenced above were hit on the client. Looking at these OOM cases, all of them are due to OOM of a userspace processes (lctl is in the case of obdfilter-survey where it is driving the load): [ 5089.324526] CPU: 0 PID: 1175 Comm: tuned [20249.901128] CPU: 0 PID: 866 Comm: sssd_be [31802.020686] CPU: 1 PID: 1074159 Comm: lctl [129875.609998] CPU: 1 PID: 1501864 Comm: lctl [77898.375312] CPU: 0 PID: 1194 Comm: tuned [20897.423508] CPU: 0 PID: 1459 Comm: tuned [37225.033993] CPU: 0 PID: 2 Comm: kthreadd [21544.431269] CPU: 1 PID: 1177 Comm: tuned [ 4838.509291] CPU: 0 PID: 804 Comm: NetworkManager [114309.521078] CPU: 1 PID: 795 Comm: NetworkManager [ 914.434369] CPU: 1 PID: 1178 Comm: tuned [22615.640870] CPU: 1 PID: 889 Comm: sssd_nss According to the meminfo dump at the time, It doesn't look like there is a lot of slab in use, only 20563 pages ~= 82MB, and not much in terms of page cache at all. All of the Lustre-internal allocations would be recorded as part of the slab cache, so it doesn't seem like this is a memory leak. Pagecache pages should probably appear in active_file: or inactive_file:, unless ZFS is doing something strange with the ARC buffers. [22615.680857] Mem-Info: [22615.681353] active_anon:0 inactive_anon:0 isolated_anon:0 active_file:15 inactive_file:19 isolated_file:0 unevictable:0 dirty:0 writeback:0 unstable:0 slab_reclaimable:6140 slab_unreclaimable:20563 mapped:1 shmem:0 pagetables:1575 bounce:0 free:12775 free_pcp:15 free_cma:0 [22615.686600] Node 0 active_anon:0kB inactive_anon:0kB active_file:60kB inactive_file:76kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:4kB dirty:0kB writeback:0kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB unstable:0kB all_unreclaimable? yes [22615.690863] Node 0 DMA free:6816kB min:416kB low:520kB high:624kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15360kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB [22615.695119] lowmem_reserve[]: 0 1605 1605 1605 1605 [22615.696004] Node 0 DMA32 free:44284kB min:44636kB low:55792kB high:66948kB active_anon:0kB inactive_anon:0kB active_file:60kB inactive_file:76kB unevictable:0kB writepending:0kB present:2080744kB managed:1678888kB mlocked:0kB kernel_stack:11920kB pagetables:6300kB bounce:0kB free_pcp:60kB local_pcp:0kB free_cma:0kB [22615.700603] lowmem_reserve[]: 0 0 0 0 0 [22615.701307] Node 0 DMA: 0*4kB 0*8kB 0*16kB 1*32kB (U) 0*64kB 1*128kB (U) 2*256kB (U) 2*512kB (U) 1*1024kB (U) 2*2048kB (UM) 0*4096kB = 6816kB [22615.703467] Node 0 DMA32: 1237*4kB (UME) 827*8kB (UME) 361*16kB (UME) 198*32kB (UME) 72*64kB (UME) 11*128kB (UM) 7*256kB (UM) 19*512kB (U) 3*1024kB (U) 0*2048kB 0*4096kB = 44284kB [22615.706168] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB [22615.707684] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB [22615.709160] 34 total pagecache pages [22615.709848] 0 pages in swap cache [22615.710472] Swap cache stats: add 47896, delete 47895, find 250384/250507 [22615.711670] Free swap = 2685692kB [22615.712302] Total swap = 2860028kB [22615.712949] 524184 pages RAM [22615.713507] 0 pages HighMem/MovableOnly [22615.714206] 100622 pages reserved [22615.714834] 0 pages hwpoisoned I'm not sure what the "100622 pages reserved" (about 400MB of the 2GB) is reserved for? Crashdumps? It looks like this 400MB reserve is the difference between "present:2080744kB" and "managed:1678888kB" in the DMA32 zone. I don't think that is the specific problem here, since I also see about 100k pages reserved on my RHEL7.9 VM, but that is a fair chunk of memory to hold back. What IS strange is that the DMA32 zone shows 2GB present, and 1.6GB managed, but that memory doesn't show up as "used" in any of the fields, with about 45MB of free space. There are several systemd-related processes that are consuming a significant amount of memory (tuned, polkitd, automount), but I'm not sure if they are the main offenders, nor what the exact units of this part of the dump represent (pages or KB; if this is all running processes, or only those with significant memory usage): [22615.839020] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name [22615.840493] [ 544] 0 544 28120 1 274432 1298 0 systemd-journal [22615.842023] [ 579] 0 579 27233 1 225280 539 -1000 systemd-udevd [22615.843530] [ 742] 32 742 16785 0 167936 191 0 rpcbind [22615.844941] [ 750] 0 750 19242 0 147456 168 -1000 auditd [22615.846372] [ 794] 0 794 53697 0 434176 508 0 sssd [22615.847757] [ 800] 81 800 14140 0 163840 237 -900 dbus-daemon [22615.849229] [ 805] 0 805 97895 0 385024 687 0 NetworkManager [22615.850754] [ 813] 998 813 407041 0 331776 1290 0 polkitd [22615.852173] [ 815] 0 815 31224 0 139264 165 0 irqbalance [22615.853634] [ 827] 994 827 32228 0 147456 122 0 chronyd [22615.855053] [ 841] 993 841 58461 1 212992 219 0 rngd [22615.856444] [ 877] 0 877 55424 0 442368 752 0 sssd_be [22615.857858] [ 889] 0 889 56279 0 462848 469 0 sssd_nss [22615.859299] [ 901] 0 901 106578 0 430080 3803 0 tuned [22615.860696] [ 906] 0 906 23072 1 200704 232 -1000 sshd [22615.862075] [ 907] 0 907 34682 0 286720 368 0 cupsd [22615.863478] [ 910] 997 910 30251 0 147456 161 0 munged [22615.864879] [ 931] 0 931 14994 1 147456 148 0 xinetd [22615.866271] [ 953] 0 953 75597 0 303104 561 0 rsyslogd [22615.867707] [ 968] 0 968 25434 0 176128 176 0 gssproxy [22615.869133] [ 988] 0 988 24316 1 212992 300 0 systemd-logind [22615.870653] [ 998] 29 998 17931 0 176128 6297 0 rpc.statd [22615.872093] [ 1133] 0 1133 30765 1 180224 251 0 master [22615.873519] [ 1140] 89 1140 37597 0 237568 272 0 qmgr [22615.874896] [ 1212] 0 1212 173534 0 335872 500 0 automount [22615.876351] [ 1216] 0 1216 10994 0 114688 53 0 atd [22615.877707] [ 1217] 0 1217 9232 1 110592 220 0 crond [22615.879098] [ 1241] 0 1241 4050 0 69632 37 0 agetty [22615.880508] [ 1242] 0 1242 3408 0 65536 32 0 agetty [22615.881906] [1250901] 89 1250901 37584 0 237568 271 0 pickup [22615.883448] Kernel panic - not syncing: Out of memory: system-wide panic_on_oom is enabled One possibility (just a guess) is that there is something with RHEL8 systemd/cgroup configuration that prevent memory pressure from userspace processes releasing pages used for cached objects in the kernel, that wasn't the case in RHEL7.x? That still doesn't explain where the memory is currently being used. On my RHEL7 VM the sysrq-m dump shows that each zone has a number of pages used for different reasons, while the RHEL8 dump shows almost nothing: Node 0 DMA32 free:1897928kB min:59336kB low:74168kB high:89004kB active_anon:87160kB inactive_anon:75148kB active_file:74556kB inactive_file:473568kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3653568kB managed:3407344kB mlocked:0kB dirty:25324kB writeback:0kB mapped:21492kB shmem:91932kB slab_reclaimable:30712kB slab_unreclaimable:66032kB kernel_stack:3712kB pagetables:4132kB unstable:0kB bounce:0kB free_pcp:948kB local_pcp:692kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no Node 0 Normal free:217984kB min:7968kB low:9960kB high:11952kB active_anon:12984kB inactive_anon:11080kB active_file:13596kB inactive_file:59896kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:524288kB managed:457212kB mlocked:0kB dirty:1744kB writeback:0kB mapped:2524kB shmem:13212kB slab_reclaimable:9616kB slab_unreclaimable:12632kB kernel_stack:1152kB pagetables:424kB unstable:0kB bounce:0kB free_pcp:956kB local_pcp:700kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no |
| Comment by Andreas Dilger [ 23/Feb/21 ] |
|
What is strange is that it seems some slabs are missing from the dump? I don't see ldlm_locks or ldlm_resources, but I thought we are running with slub_nomerge these days? It still isn't clear where all the memory has gone... |
| Comment by Andreas Dilger [ 30/Mar/21 ] |
|
+7 failures in the past week: |
| Comment by Andreas Dilger [ 19/Apr/21 ] |
|
+4 failures in the past week, all of ZFS. |
| Comment by Alex Zhuravlev [ 29/Apr/21 ] |
|
https://testing.whamcloud.com/test_sets/26779a73-fb4b-44d4-b7f4-0892faf505af |
| Comment by Andreas Dilger [ 13/Jul/21 ] |
|
+12 failures in the past week, all ZFS |
| Comment by Chris Horn [ 20/Jul/21 ] |
|
+1 https://testing.whamcloud.com/test_sets/f54f45ff-11d2-4d46-9063-376613c4a738 |
| Comment by Alex Zhuravlev [ 27/Jul/21 ] |
|
https://testing.whamcloud.com/test_sets/642b2113-975a-4311-9388-1e58e808a577 |
| Comment by Andreas Dilger [ 10/Aug/21 ] |
|
Crashes have been stopped by increasing OST VM RAM to 3GB from 2GB (last autotest failure 2021-07-31), but I don't think the problem is really fixed. That said, it is off our radar for now, until someone is more interested to fix this. |