Details
Description
During SWL for toss 4.6-6rc3 and also 4,7-2rc2, we found that an IOR run could trigger an OOM on an OSS node.
We were able to reproduce this issue using IOR under srun.
The following srun/ior command was used:
srun -N 70 -n 7840 /g/g0/carbonne/ior/src/ior -a MPIIO -i 5 -b 256MB -t 128MB -v -g -F -C -w -W -r -o /p/lflood/carbonne/oomtest/ior_1532/ior
Example at 2023-10-17 12:31:28 on garter5, see console log.
Mem-info from one oom-killer console log message set is:
Mem-Info: active_anon:22868 inactive_anon:69168 isolated_anon:0 active_file:357 inactive_file:770 isolated_file:250 unevictable:10785 dirty:0 writeback:0 slab_reclaimable:185039 slab_unreclaimable:2082954 mapped:12536 shmem:46663 pagetables:2485 bounce:0 free:134668 free_pcp:203 free_cma:0 Node 0 active_anon:75888kB inactive_anon:87304kB active_file:1840kB inactive_file:1464kB unevictable:43080kB isolated(anon):0kB isolated(file):208kB mapped:19680kB dirty:0kB writeback:0kB shmem:127712kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 26624kB writeback_tmp:0kB kernel_stack:31416kB pagetables:3896kB all_unreclaimable? no Node 0 DMA free:11264kB min:4kB low:16kB high:28kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present: 15996kB managed:15360kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB lowmem_reserve[]: 0 1183 94839 94839 94839 Node 0 DMA32 free:375156kB min:556kB low:1764kB high:2972kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:4kB unevictable:0kB writepending:0kB present:1723228kB managed:1325704kB mlocked:0kB bounce:0kB free_pcp:260kB local_pcp:0kB free_cma:0kB lowmem_reserve[]: 0 0 93655 93655 93655 Node 0 Normal free:46072kB min:44044kB low:139944kB high:235844kB active_anon:75888kB inactive_anon:87304kB active_file:1860kB inactive_file:1584kB unevictable: 43080kB writepending:0kB present:97517568kB managed:95912024kB mlocked:43080kB bounce:0kB free_pcp:372kB local_pcp:0kB free_cma:0kB lowmem_reserve[]: 0 0 0 0 0 Node 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 1*1024kB (U) 1*2048kB (M) 2*4096kB (M) = 11264kB Node 0 DMA32: 3*4kB (M) 66*8kB (UM) 202*16kB (UM) 152*32kB (UM) 168*64kB (UM) 85*128kB (UM) 24*256kB (UM) 20*512kB (UM) 11*1024kB (UM) 7*2048kB (UM) 74*4096kB (# M) = 375356kB Node 0 Normal: 151*4kB (MEH) 853*8kB (UMEH) 640*16kB (MEH) 412*32kB (MEH) 132*64kB (ME) 33*128kB (UE) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 43524kB Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB 53515 total pagecache pages 0 pages in swap cache Swap cache stats: add 0, delete 0, find 0/0 Free swap = 0kB Total swap = 0kB 49980022 pages RAM 0 pages HighMem/MovableOnly 896433 pages reserved 0 pages hwpoisoned
=============================================================
local Jira ticket: TOSS-6158