Details
Description
During SWL for toss 4.6-6rc3 and also 4,7-2rc2, we found that an IOR run could trigger an OOM on an OSS node.
We were able to reproduce this issue using IOR under srun.
The following srun/ior command was used:
srun -N 70 -n 7840 /g/g0/carbonne/ior/src/ior -a MPIIO -i 5 -b 256MB -t 128MB -v -g -F -C -w -W -r -o /p/lflood/carbonne/oomtest/ior_1532/ior
Example at 2023-10-17 12:31:28 on garter5, see console log.
Mem-info from one oom-killer console log message set is:
Mem-Info: active_anon:22868 inactive_anon:69168 isolated_anon:0 active_file:357 inactive_file:770 isolated_file:250 unevictable:10785 dirty:0 writeback:0 slab_reclaimable:185039 slab_unreclaimable:2082954 mapped:12536 shmem:46663 pagetables:2485 bounce:0 free:134668 free_pcp:203 free_cma:0 Node 0 active_anon:75888kB inactive_anon:87304kB active_file:1840kB inactive_file:1464kB unevictable:43080kB isolated(anon):0kB isolated(file):208kB mapped:19680kB dirty:0kB writeback:0kB shmem:127712kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 26624kB writeback_tmp:0kB kernel_stack:31416kB pagetables:3896kB all_unreclaimable? no Node 0 DMA free:11264kB min:4kB low:16kB high:28kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present: 15996kB managed:15360kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB lowmem_reserve[]: 0 1183 94839 94839 94839 Node 0 DMA32 free:375156kB min:556kB low:1764kB high:2972kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:4kB unevictable:0kB writepending:0kB present:1723228kB managed:1325704kB mlocked:0kB bounce:0kB free_pcp:260kB local_pcp:0kB free_cma:0kB lowmem_reserve[]: 0 0 93655 93655 93655 Node 0 Normal free:46072kB min:44044kB low:139944kB high:235844kB active_anon:75888kB inactive_anon:87304kB active_file:1860kB inactive_file:1584kB unevictable: 43080kB writepending:0kB present:97517568kB managed:95912024kB mlocked:43080kB bounce:0kB free_pcp:372kB local_pcp:0kB free_cma:0kB lowmem_reserve[]: 0 0 0 0 0 Node 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 1*1024kB (U) 1*2048kB (M) 2*4096kB (M) = 11264kB Node 0 DMA32: 3*4kB (M) 66*8kB (UM) 202*16kB (UM) 152*32kB (UM) 168*64kB (UM) 85*128kB (UM) 24*256kB (UM) 20*512kB (UM) 11*1024kB (UM) 7*2048kB (UM) 74*4096kB (# M) = 375356kB Node 0 Normal: 151*4kB (MEH) 853*8kB (UMEH) 640*16kB (MEH) 412*32kB (MEH) 132*64kB (ME) 33*128kB (UE) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 43524kB Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB 53515 total pagecache pages 0 pages in swap cache Swap cache stats: add 0, delete 0, find 0/0 Free swap = 0kB Total swap = 0kB 49980022 pages RAM 0 pages HighMem/MovableOnly 896433 pages reserved 0 pages hwpoisoned
=============================================================
local Jira ticket: TOSS-6158
Attachments
Activity
Resolution | New: Cannot Reproduce [ 5 ] | |
Status | Original: Open [ 1 ] | New: Resolved [ 5 ] |
Labels | Original: llnl topllnl | New: llnl |
Link | Original: This issue is related to JFC-21 [ JFC-21 ] |
Description |
Original:
During SWL for toss 4.6-6rc3 and also 4,7-2rc2, we found that an IOR run could trigger an OOM on an OSS node.
We were able to reproduce this issue using IOR under srun. h3. The following srun/ior command was used: srun -N 70 -n 7840 /g/g0/carbonne/ior/src/ior -a MPIIO -i 5 -b 256MB -t 128MB -v -g -F -C -w -W -r -o /p/lflood/carbonne/oomtest/ior_1532/ior Example at 2023-10-17 12:31:28 on garter5, see console log. Mem-info from one oom-killer console log message set is: Mem-Info: active_anon:22868 inactive_anon:69168 isolated_anon:0 active_file:357 inactive_file:770 isolated_file:250 unevictable:10785 dirty:0 writeback:0 slab_reclaimable:185039 slab_unreclaimable:2082954 mapped:12536 shmem:46663 pagetables:2485 bounce:0 free:134668 free_pcp:203 free_cma:0 Node 0 active_anon:75888kB inactive_anon:87304kB active_file:1840kB inactive_fil e:1464kB unevictable:43080kB isolated(anon):0kB isolated(file):208kB mapped:1968 0kB dirty:0kB writeback:0kB shmem:127712kB shmem_thp: 0kB shmem_pmdmapped: 0kB a non_thp: 26624kB writeback_tmp:0kB kernel_stack:31416kB pagetables:3896kB all_un reclaimable? no Node 0 DMA free:11264kB min:4kB low:16kB high:28kB active_anon:0kB inactive_anon :0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present: 15996kB managed:15360kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_c ma:0kB lowmem_reserve[]: 0 1183 94839 94839 94839 Node 0 DMA32 free:375156kB min:556kB low:1764kB high:2972kB active_anon:0kB inac tive_anon:0kB active_file:0kB inactive_file:4kB unevictable:0kB writepending:0kB present:1723228kB managed:1325704kB mlocked:0kB bounce:0kB free_pcp:260kB local _pcp:0kB free_cma:0kB lowmem_reserve[]: 0 0 93655 93655 93655 Node 0 Normal free:46072kB min:44044kB low:139944kB high:235844kB active_anon:75 888kB inactive_anon:87304kB active_file:1860kB inactive_file:1584kB unevictable: 43080kB writepending:0kB present:97517568kB managed:95912024kB mlocked:43080kB b ounce:0kB free_pcp:372kB local_pcp:0kB free_cma:0kB lowmem_reserve[]: 0 0 0 0 0 Node 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 1*1024kB (U ) 1*2048kB (M) 2*4096kB (M) = 11264kB Node 0 DMA32: 3*4kB (M) 66*8kB (UM) 202*16kB (UM) 152*32kB (UM) 168*64kB (UM) 85 *128kB (UM) 24*256kB (UM) 20*512kB (UM) 11*1024kB (UM) 7*2048kB (UM) 74*4096kB ( # M) = 375356kB Node 0 Normal: 151*4kB (MEH) 853*8kB (UMEH) 640*16kB (MEH) 412*32kB (MEH) 132*64 kB (ME) 33*128kB (UE) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 43524kB Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=104857 6kB Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=104857 6kB Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB 53515 total pagecache pages 0 pages in swap cache Swap cache stats: add 0, delete 0, find 0/0 Free swap = 0kB Total swap = 0kB 49980022 pages RAM 0 pages HighMem/MovableOnly 896433 pages reserved 0 pages hwpoisoned ============================================================= local Jira ticket: [TOSS-6158|https://lc.llnl.gov/jira/browse/TOSS-6158] |
New:
During SWL for toss 4.6-6rc3 and also 4,7-2rc2, we found that an IOR run could trigger an OOM on an OSS node.
We were able to reproduce this issue using IOR under srun. h3. The following srun/ior command was used: {noformat} srun -N 70 -n 7840 /g/g0/carbonne/ior/src/ior -a MPIIO -i 5 -b 256MB -t 128MB -v -g -F -C -w -W -r -o /p/lflood/carbonne/oomtest/ior_1532/ior {noformat} Example at 2023-10-17 12:31:28 on garter5, see console log. Mem-info from one oom-killer console log message set is: {noformat} Mem-Info: active_anon:22868 inactive_anon:69168 isolated_anon:0 active_file:357 inactive_file:770 isolated_file:250 unevictable:10785 dirty:0 writeback:0 slab_reclaimable:185039 slab_unreclaimable:2082954 mapped:12536 shmem:46663 pagetables:2485 bounce:0 free:134668 free_pcp:203 free_cma:0 Node 0 active_anon:75888kB inactive_anon:87304kB active_file:1840kB inactive_file:1464kB unevictable:43080kB isolated(anon):0kB isolated(file):208kB mapped:19680kB dirty:0kB writeback:0kB shmem:127712kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 26624kB writeback_tmp:0kB kernel_stack:31416kB pagetables:3896kB all_unreclaimable? no Node 0 DMA free:11264kB min:4kB low:16kB high:28kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present: 15996kB managed:15360kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB lowmem_reserve[]: 0 1183 94839 94839 94839 Node 0 DMA32 free:375156kB min:556kB low:1764kB high:2972kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:4kB unevictable:0kB writepending:0kB present:1723228kB managed:1325704kB mlocked:0kB bounce:0kB free_pcp:260kB local_pcp:0kB free_cma:0kB lowmem_reserve[]: 0 0 93655 93655 93655 Node 0 Normal free:46072kB min:44044kB low:139944kB high:235844kB active_anon:75888kB inactive_anon:87304kB active_file:1860kB inactive_file:1584kB unevictable: 43080kB writepending:0kB present:97517568kB managed:95912024kB mlocked:43080kB bounce:0kB free_pcp:372kB local_pcp:0kB free_cma:0kB lowmem_reserve[]: 0 0 0 0 0 Node 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 1*1024kB (U) 1*2048kB (M) 2*4096kB (M) = 11264kB Node 0 DMA32: 3*4kB (M) 66*8kB (UM) 202*16kB (UM) 152*32kB (UM) 168*64kB (UM) 85*128kB (UM) 24*256kB (UM) 20*512kB (UM) 11*1024kB (UM) 7*2048kB (UM) 74*4096kB (# M) = 375356kB Node 0 Normal: 151*4kB (MEH) 853*8kB (UMEH) 640*16kB (MEH) 412*32kB (MEH) 132*64kB (ME) 33*128kB (UE) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 43524kB Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB 53515 total pagecache pages 0 pages in swap cache Swap cache stats: add 0, delete 0, find 0/0 Free swap = 0kB Total swap = 0kB 49980022 pages RAM 0 pages HighMem/MovableOnly 896433 pages reserved 0 pages hwpoisoned {noformat} ============================================================= local Jira ticket: [TOSS-6158|https://lc.llnl.gov/jira/browse/TOSS-6158] |
Priority | Original: Critical [ 2 ] | New: Major [ 3 ] |
Assignee | Original: WC Triage [ wc-triage ] | New: Peter Jones [ pjones ] |
Link | New: This issue is related to JFC-21 [ JFC-21 ] |