[LU-17384] OOMkiller invoked on lustre OSS nodes under IOR Created: 22/Dec/23 Updated: 09/Feb/24 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.14.0, Lustre 2.15.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Eric Carbonneau | Assignee: | Peter Jones |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | llnl, topllnl | ||
| Environment: |
Clients : Lustre 2.12 |
||
| Issue Links: |
|
||||
| Epic/Theme: | OSS | ||||
| Severity: | 4 | ||||
| Rank (Obsolete): | 9223372036854775807 | ||||
| Description |
|
During SWL for toss 4.6-6rc3 and also 4,7-2rc2, we found that an IOR run could trigger an OOM on an OSS node. We were able to reproduce this issue using IOR under srun. The following srun/ior command was used:srun -N 70 -n 7840 /g/g0/carbonne/ior/src/ior -a MPIIO -i 5 -b 256MB -t 128MB -v -g -F -C -w -W -r -o /p/lflood/carbonne/oomtest/ior_1532/ior Example at 2023-10-17 12:31:28 on garter5, see console log. Mem-info from one oom-killer console log message set is:
Mem-Info: active_anon:22868 inactive_anon:69168 isolated_anon:0 active_file:357 inactive_file:770 isolated_file:250 unevictable:10785 dirty:0 writeback:0 slab_reclaimable:185039 slab_unreclaimable:2082954 mapped:12536 shmem:46663 pagetables:2485 bounce:0 free:134668 free_pcp:203 free_cma:0 Node 0 active_anon:75888kB inactive_anon:87304kB active_file:1840kB inactive_file:1464kB unevictable:43080kB isolated(anon):0kB isolated(file):208kB mapped:19680kB dirty:0kB writeback:0kB shmem:127712kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 26624kB writeback_tmp:0kB kernel_stack:31416kB pagetables:3896kB all_unreclaimable? no Node 0 DMA free:11264kB min:4kB low:16kB high:28kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present: 15996kB managed:15360kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB lowmem_reserve[]: 0 1183 94839 94839 94839 Node 0 DMA32 free:375156kB min:556kB low:1764kB high:2972kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:4kB unevictable:0kB writepending:0kB present:1723228kB managed:1325704kB mlocked:0kB bounce:0kB free_pcp:260kB local_pcp:0kB free_cma:0kB lowmem_reserve[]: 0 0 93655 93655 93655 Node 0 Normal free:46072kB min:44044kB low:139944kB high:235844kB active_anon:75888kB inactive_anon:87304kB active_file:1860kB inactive_file:1584kB unevictable: 43080kB writepending:0kB present:97517568kB managed:95912024kB mlocked:43080kB bounce:0kB free_pcp:372kB local_pcp:0kB free_cma:0kB lowmem_reserve[]: 0 0 0 0 0 Node 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 1*1024kB (U) 1*2048kB (M) 2*4096kB (M) = 11264kB Node 0 DMA32: 3*4kB (M) 66*8kB (UM) 202*16kB (UM) 152*32kB (UM) 168*64kB (UM) 85*128kB (UM) 24*256kB (UM) 20*512kB (UM) 11*1024kB (UM) 7*2048kB (UM) 74*4096kB (# M) = 375356kB Node 0 Normal: 151*4kB (MEH) 853*8kB (UMEH) 640*16kB (MEH) 412*32kB (MEH) 132*64kB (ME) 33*128kB (UE) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 43524kB Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB 53515 total pagecache pages 0 pages in swap cache Swap cache stats: add 0, delete 0, find 0/0 Free swap = 0kB Total swap = 0kB 49980022 pages RAM 0 pages HighMem/MovableOnly 896433 pages reserved 0 pages hwpoisoned ============================================================= local Jira ticket: TOSS-6158 |
| Comments |
| Comment by Peter Jones [ 22/Dec/23 ] |
|
Eric There have been some recent changes merged to master for the upcoming 2.16 release that we think could well help address this problem. Could you please retry your reproducer against a master client. If that does indeed resolve the issue then we can look to what would need to be back ported to b2_15 in order to get the same benefit there Regards Peter |
| Comment by Olaf Faaland [ 22/Dec/23 ] |
|
Thanks, Peter. In our case, when we reproduced this, the client was 2.12 and the server was 2.14 or 2.15. Are you saying client patches might fix this? We're happy to test master clients, but I would think the server should be managing its memory usage without depending on the client to behave in a certain way. |
| Comment by Eric Carbonneau [ 22/Dec/23 ] |
|
I forgot to mention the issue occurs during read operations on the OSS. During the write operations the OSS memory was consnt.
|
| Comment by Andreas Dilger [ 23/Dec/23 ] |
|
it would be useful to include the actual stack traces from the OSS when the OOM is hit, not just the meminfo. Otherwise it is difficult to know what is actually allocating the memory. Sometimes it is just an innocent bystander process, but in many cases the actual offender is caught because it is the one allocating memory the most frequently... |
| Comment by Andreas Dilger [ 23/Dec/23 ] |
|
Originally I thought this was related to cgroups, which is a client side issue, but I didn't notice the "OSS" in the summary. The majority of memory usage looks to be in "slab_reclaimable:185039 slab_unreclaimable:2082954" or at least I can't see anything else reported in the meminfo dump. Are you able to capture /proc/slabinfo or slabtop from the OSS while the IOR is running, and see what is using the majority of memory? This might relate to the use of deferred fput on the server, which can accumulate over time if the server was running a long time? There were two recent patches related to this that landed on master, but these may only be relevant for osd-ldiskfs and not osd-zfs (which I assume is the case here):
|
| Comment by Eric Carbonneau [ 09/Feb/24 ] |
|
We've done more testing and gathered more information for your review: As a starter Version of ZFS and Lustre required to reproduce the OOM: ZFS version. : 2.1.14_1llnl-1 FIRST RUN: zfs_arc_max was set to default: 0 I also set the kernel to slab_nomerge to pinpoint the culprit slab if any. command used arcstat 1: time read miss miss% dmis dm% pmis pm% mmis mm% size c avail At that point we were OOMed. SECOND RUN: for the second run we set the zfs_arc_max to 47 Gib keeping monitoring the arcstat we can see it going right through the limit set to 47Gib: arcstat 1: time read miss miss% dmis dm% pmis pm% mmis mm% size c avail ------------------------------------------------------------------------------------------------------------ I will look into zfs with our zfs developers and update ticket.
|