[LU-17384] OOMkiller invoked on lustre OSS nodes under IOR Created: 22/Dec/23  Updated: 09/Feb/24

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.14.0, Lustre 2.15.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Eric Carbonneau Assignee: Peter Jones
Resolution: Unresolved Votes: 0
Labels: llnl, topllnl
Environment:

Clients : Lustre 2.12
servers: Lustre 2.14 and 2.15 (both tested and reproduced)
toss. : 4.6-6 and 4.7-2 (both reproduced)


Issue Links:
Related
Epic/Theme: OSS
Severity: 4
Rank (Obsolete): 9223372036854775807

 Description   

During SWL for toss 4.6-6rc3 and also 4,7-2rc2, we found that an IOR run could trigger an OOM on an OSS node.

We were able to reproduce this issue using IOR under srun.

The following srun/ior command was used:

srun -N 70 -n 7840 /g/g0/carbonne/ior/src/ior -a MPIIO -i 5 -b 256MB -t 128MB -v -g -F -C -w -W -r -o /p/lflood/carbonne/oomtest/ior_1532/ior

Example at 2023-10-17 12:31:28 on garter5, see console log.

Mem-info from one oom-killer console log message set is:

 

Mem-Info:
active_anon:22868 inactive_anon:69168 isolated_anon:0
 active_file:357 inactive_file:770 isolated_file:250
 unevictable:10785 dirty:0 writeback:0
 slab_reclaimable:185039 slab_unreclaimable:2082954
 mapped:12536 shmem:46663 pagetables:2485 bounce:0
 free:134668 free_pcp:203 free_cma:0

Node 0 active_anon:75888kB inactive_anon:87304kB active_file:1840kB
 inactive_file:1464kB  unevictable:43080kB isolated(anon):0kB
 isolated(file):208kB mapped:19680kB dirty:0kB writeback:0kB
 shmem:127712kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 26624kB
 writeback_tmp:0kB kernel_stack:31416kB pagetables:3896kB
 all_unreclaimable? no

Node 0 DMA free:11264kB min:4kB low:16kB high:28kB active_anon:0kB
 inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB
 writepending:0kB present: 15996kB managed:15360kB mlocked:0kB bounce:0kB
 free_pcp:0kB local_pcp:0kB free_cma:0kB
 lowmem_reserve[]: 0 1183 94839 94839 94839

Node 0 DMA32 free:375156kB min:556kB low:1764kB high:2972kB
 active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:4kB
 unevictable:0kB writepending:0kB present:1723228kB managed:1325704kB
 mlocked:0kB bounce:0kB free_pcp:260kB local_pcp:0kB free_cma:0kB
 lowmem_reserve[]: 0 0 93655 93655 93655

Node 0 Normal free:46072kB min:44044kB low:139944kB high:235844kB
 active_anon:75888kB inactive_anon:87304kB active_file:1860kB
 inactive_file:1584kB unevictable: 43080kB writepending:0kB
 present:97517568kB managed:95912024kB mlocked:43080kB bounce:0kB
 free_pcp:372kB local_pcp:0kB free_cma:0kB lowmem_reserve[]: 0 0 0 0 0

Node 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB
 1*1024kB (U) 1*2048kB (M) 2*4096kB (M) = 11264kB

Node 0 DMA32: 3*4kB (M) 66*8kB (UM) 202*16kB (UM) 152*32kB (UM)
 168*64kB (UM) 85*128kB (UM) 24*256kB (UM) 20*512kB (UM) 11*1024kB (UM)
 7*2048kB (UM) 74*4096kB (# M) = 375356kB

Node 0 Normal: 151*4kB (MEH) 853*8kB (UMEH) 640*16kB (MEH) 412*32kB (MEH)
 132*64kB (ME) 33*128kB (UE) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 43524kB

Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB

Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB

Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB

Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB

53515 total pagecache pages
0 pages in swap cache
Swap cache stats: add 0, delete 0, find 0/0
Free swap  = 0kB
Total swap = 0kB
49980022 pages RAM
0 pages HighMem/MovableOnly
896433 pages reserved
0 pages hwpoisoned

=============================================================

local Jira ticket:  TOSS-6158



 Comments   
Comment by Peter Jones [ 22/Dec/23 ]

Eric

There have been some recent changes merged to master for the upcoming 2.16 release that we think could well help address this problem. Could you please retry your reproducer against a master client. If that does indeed resolve the issue then we can look to what would need to be back ported to b2_15 in order to get the same benefit there

Regards

Peter

Comment by Olaf Faaland [ 22/Dec/23 ]

Thanks, Peter. In our case, when we reproduced this, the client was 2.12 and the server was 2.14 or 2.15.

Are you saying client patches might fix this? We're happy to test master clients, but I would think the server should be managing its memory usage without depending on the client to behave in a certain way.

Comment by Eric Carbonneau [ 22/Dec/23 ]

I forgot to mention the issue occurs during read operations on the OSS. During the write operations the OSS memory was consnt.

 

Comment by Andreas Dilger [ 23/Dec/23 ]

carbonneau, ofaaland,

it would be useful to include the actual stack traces from the OSS when the OOM is hit, not just the meminfo.  Otherwise it is difficult to know what is actually allocating the memory.  Sometimes it is just an innocent bystander process, but in many cases the actual offender is caught because it is the one allocating memory the most frequently...

Comment by Andreas Dilger [ 23/Dec/23 ]

Originally I thought this was related to cgroups, which is a client side issue, but I didn't notice the "OSS" in the summary.

The majority of memory usage looks to be in "slab_reclaimable:185039 slab_unreclaimable:2082954" or at least I can't see anything else reported in the meminfo dump.  Are you able to capture /proc/slabinfo or slabtop from the OSS while the IOR is running, and see what is using the majority of memory?

This might relate to the use of deferred fput on the server, which can accumulate over time if the server was running a long time? There were two recent patches related to this that landed on master, but these may only be relevant for osd-ldiskfs and not osd-zfs (which I assume is the case here):

Comment by Eric Carbonneau [ 09/Feb/24 ]

We've done more testing and gathered more information for your review:

As a starter  Version of ZFS and Lustre required to reproduce the OOM:

ZFS version.   : 2.1.14_1llnl-1
Lustre Version: lustre-2.15.4_1.llnl
Ram total available: 187 Gib

FIRST RUN:

zfs_arc_max was set to default: 0

I also set the kernel to slab_nomerge to pinpoint the culprit slab if any.
There was no culprit slab found as nothing was jumping at the top of the in slabtop.
But we also check arcstat to see what was going on in the arc during testing.

command used arcstat 1:

time read miss miss% dmis dm% pmis pm% mmis mm% size c avail
11:29:21 19K 6.8K 34 569 4 6.3K 99 0 0 93G 93G 25G
11:29:22 19K 6.9K 34 550 4 6.3K 99 0 0 93G 93G 25G
11:29:23 20K 7.0K 34 546 3 6.5K 99 0 0 93G 93G 25G
11:29:24 20K 7.2K 34 600 4 6.6K 99 0 0 93G 93G 25G
11:29:25 21K 7.5K 35 633 4 6.9K 99 0 0 94G 93G 24G
11:29:26 21K 7.6K 34 602 4 7.0K 99 0 0 93G 93G 25G
11:29:27 21K 7.8K 36 633 4 7.1K 99 0 0 94G 93G 25G
11:29:28 22K 8.0K 35 616 4 7.4K 99 0 0 94G 93G 24G
11:29:29 23K 9.0K 38 1.2K 7 7.8K 99 503 3 94G 93G 24G
11:29:30 22K 10K 45 2.7K 17 7.8K 99 2.0K 13 95G 93G 23G
11:29:31 23K 10K 46 2.7K 17 8.2K 100 2.0K 13 97G 93G 21G
11:29:32 24K 11K 46 2.8K 17 8.3K 100 2.1K 13 99G 93G 20G
11:29:33 24K 11K 46 2.8K 17 8.8K 100 2.1K 13 101G 93G 18G
11:29:34 24K 11K 46 2.7K 17 8.7K 100 2.0K 13 103G 93G 16G
11:29:35 26K 12K 47 2.9K 17 9.5K 100 2.1K 13 105G 93G 13G
11:29:36 26K 12K 47 2.8K 16 9.6K 100 2.0K 12 108G 93G 11G
11:29:37 26K 12K 47 2.8K 16 9.7K 100 2.0K 12 110G 93G 8.7G
11:29:38 27K 13K 47 2.8K 16 10K 100 2.0K 12 113G 93G 5.9G
11:29:39 27K 13K 47 2.9K 16 10K 100 2.0K 12 116G 93G 3.1G
11:29:40 27K 13K 48 2.7K 16 10K 100 1.9K 12 118G 93G 10M
11:29:41 41K 15K 37 5.7K 17 9.9K 100 4.9K 15 121G 93G -3.6G

At that point we were OOMed.
--------------------------------------------------------------------------------------------------------------------------

SECOND RUN:

for the second run we set the zfs_arc_max to 47 Gib

keeping monitoring the arcstat we can see it going right through the limit set to 47Gib:

arcstat 1:

time read miss miss% dmis dm% pmis pm% mmis mm% size c avail
11:57:19 19K 9.0K 47 3.1K 23 5.9K 100 2.5K 20 76G 47G 41G
11:57:20 19K 9.1K 47 3.1K 23 6.1K 100 2.5K 19 77G 47G 40G
11:57:21 20K 9.4K 47 3.0K 22 6.4K 100 2.4K 18 78G 47G 39G
11:57:22 20K 9.6K 47 3.1K 22 6.5K 100 2.5K 18 79G 47G 38G
11:57:23 21K 10.0K 47 3.2K 22 6.8K 100 2.5K 18 80G 47G 37G
11:57:24 21K 10.0K 47 3.1K 21 6.9K 100 2.4K 17 81G 47G 36G
11:57:25 21K 10K 47 3.2K 21 7.2K 100 2.5K 18 82G 47G 35G
11:57:26 22K 10K 47 3.2K 21 7.4K 100 2.5K 17 84G 47G 34G
11:57:27 22K 10K 47 3.0K 20 7.8K 100 2.4K 16 85G 47G 32G
11:57:28 23K 11K 47 3.2K 20 8.0K 100 2.5K 16 87G 47G 30G
11:57:29 23K 11K 47 3.1K 19 8.3K 100 2.4K 16 89G 47G 29G
11:57:30 24K 11K 48 3.2K 20 8.7K 100 2.5K 16 91G 47G 27G
11:57:31 24K 12K 48 3.2K 19 8.9K 100 2.5K 16 93G 47G 24G
11:57:32 25K 12K 48 3.1K 18 9.4K 100 2.4K 15 95G 47G 22G
11:57:33 26K 12K 48 3.2K 19 9.6K 100 2.4K 15 98G 47G 20G
11:57:34 26K 13K 49 3.2K 19 9.9K 100 2.5K 15 100G 47G 17G
11:57:35 27K 13K 49 3.3K 18 10K 100 2.5K 15 103G 47G 14G
11:57:36 28K 14K 49 3.4K 19 10K 100 2.6K 15 106G 47G 11G
11:57:37 28K 14K 49 3.3K 18 11K 100 2.5K 14 109G 47G 8.4G
11:57:38 29K 14K 50 3.3K 18 11K 100 2.6K 14 113G 47G 5.0G
11:57:39 30K 15K 50 3.3K 17 11K 100 2.5K 14 116G 47G 669M
11:57:40 44K 17K 39 6.1K 18 11K 100 5.3K 16 120G 47G -3.9G

------------------------------------------------------------------------------------------------------------

I will look into zfs with our zfs developers and update ticket.

 

 

Generated at Sat Feb 10 03:35:00 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.