[LU-5809] sanity-benchmark test pios_fpp: OOM on zfs OSS Created: 27/Oct/14  Updated: 21/May/21  Resolved: 21/May/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Isaac Huang (Inactive) Assignee: Isaac Huang (Inactive)
Resolution: Won't Fix Votes: 0
Labels: RZ_LS, zfs

Attachments: File eagle-44vm1.log     File eagle-46vm1.log    
Issue Links:
Related
Severity: 3
Rank (Obsolete): 16295

 Description   

I hit 3 identical OOM panics during tests on eagle this weekend, all happened on zfs OSS during sanity-benchmark test pios_fpp:

Lustre: DEBUG MARKER: == sanity-benchmark test pios_fpp: pios file per process == 06:54:04 (1414418044)
Lustre: DEBUG MARKER: /usr/sbin/lctl mark \/usr\/bin\/pios  -t 1,8,40 -n 1024                          -c 1M -s 8M                             -o 16M -L fpp -p \/mnt\/lustre\/dpios_fpp.sanity-benchmark
Lustre: DEBUG MARKER: /usr/bin/pios -t 1,8,40 -n 1024 -c 1M -s 8M -o 16M -L fpp -p /mnt/lustre/dpios_fpp.sanity-benchmark
Lustre: lustre-OST0001: Slow creates, 128/256 objects created at a rate of 2/s
LNet: Service thread pid 3372 completed after 91.53s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
LNet: Skipped 15 previous similar messages
Lustre: DEBUG MARKER: /usr/sbin/lctl mark \/usr\/bin\/pios  -t 1,8,40 -n 1024                          -c 1M -s 8M                             -o 16M -L fpp --verify -p \/mnt\/lustre\/dpios_fpp.sanity-benchmark
Lustre: DEBUG MARKER: /usr/bin/pios -t 1,8,40 -n 1024 -c 1M -s 8M -o 16M -L fpp --verify -p /mnt/lustre/dpios_fpp.sanity-benchmark
spl_kmem_cache/ invoked oom-killer: gfp_mask=0x84d0, order=0, oom_adj=0, oom_score_adj=0
spl_kmem_cache/ cpuset=/ mems_allowed=0
Pid: 396, comm: spl_kmem_cache/ Tainted: P           ---------------    2.6.32-431.29.2.el6_lustre.g9835a2a.x86_64 #1
Call Trace:
 [<ffffffff810d07b1>] ? cpuset_print_task_mems_allowed+0x91/0xb0
 [<ffffffff81122b80>] ? dump_header+0x90/0x1b0
 [<ffffffff81122cee>] ? check_panic_on_oom+0x4e/0x80
 [<ffffffff811233db>] ? out_of_memory+0x1bb/0x3c0
 [<ffffffff8112fd5f>] ? __alloc_pages_nodemask+0x89f/0x8d0
 [<ffffffff81167cea>] ? alloc_pages_current+0xaa/0x110
 [<ffffffff8112d15e>] ? __get_free_pages+0xe/0x50
 [<ffffffff8104ec85>] ? pte_alloc_one_kernel+0x15/0x20
 [<ffffffff8114650b>] ? __pte_alloc_kernel+0x1b/0xc0
 [<ffffffff81157769>] ? vmap_page_range_noflush+0x309/0x370
 [<ffffffff81157802>] ? map_vm_area+0x32/0x50
 [<ffffffff81159270>] ? __vmalloc_area_node+0x100/0x190
 [<ffffffffa0115a09>] ? kv_alloc+0x59/0x60 [spl]
 [<ffffffff811590fd>] ? __vmalloc_node+0xad/0x120
 [<ffffffffa0115a09>] ? kv_alloc+0x59/0x60 [spl]
 [<ffffffff811594e2>] ? __vmalloc+0x22/0x30
 [<ffffffffa0115a09>] ? kv_alloc+0x59/0x60 [spl]
 [<ffffffffa0115a49>] ? spl_cache_grow_work+0x39/0x2d0 [spl]
 [<ffffffff81058bd3>] ? __wake_up+0x53/0x70
 [<ffffffffa01174a7>] ? taskq_thread+0x1e7/0x3f0 [spl]
 [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
 [<ffffffffa01172c0>] ? taskq_thread+0x0/0x3f0 [spl]
 [<ffffffff8109abf6>] ? kthread+0x96/0xa0
 [<ffffffff8100c20a>] ? child_rip+0xa/0x20
 [<ffffffff8109ab60>] ? kthread+0x0/0xa0
 [<ffffffff8100c200>] ? child_rip+0x0/0x20
Mem-Info:
Node 0 DMA per-cpu:
CPU    0: hi:    0, btch:   1 usd:   0
Node 0 DMA32 per-cpu:
CPU    0: hi:  186, btch:  31 usd:   0
active_anon:0 inactive_anon:0 isolated_anon:0
 active_file:11 inactive_file:0 isolated_file:0
 unevictable:0 dirty:0 writeback:0 unstable:0
 free:8559 slab_reclaimable:1482 slab_unreclaimable:12252
 mapped:1 shmem:0 pagetables:1242 bounce:0
Node 0 DMA free:8352kB min:332kB low:412kB high:496kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15348kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:20kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
lowmem_reserve[]: 0 2004 2004 2004
Node 0 DMA32 free:25884kB min:44720kB low:55900kB high:67080kB active_anon:0kB inactive_anon:0kB active_file:44kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2052308kB mlocked:0kB dirty:0kB writeback:0kB mapped:4kB shmem:0kB slab_reclaimable:5928kB slab_unreclaimable:48988kB kernel_stack:3416kB pagetables:4968kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:100 all_unreclaimable? yes
lowmem_reserve[]: 0 0 0 0
Node 0 DMA: 0*4kB 0*8kB 2*16kB 0*32kB 0*64kB 1*128kB 0*256kB 0*512kB 0*1024kB 2*2048kB 1*4096kB = 8352kB
Node 0 DMA32: 719*4kB 340*8kB 184*16kB 84*32kB 33*64kB 16*128kB 5*256kB 2*512kB 2*1024kB 1*2048kB 1*4096kB = 25884kB
20 total pagecache pages
0 pages in swap cache
Swap cache stats: add 5121, delete 5121, find 16/25
Free swap  = 4108600kB
Total swap = 4128764kB
524284 pages RAM
43654 pages reserved
54 pages shared
465254 pages non-shared

In all cases, the paniced OSS had 1.8G memory, and ran build lustre-b2_5/96.



 Comments   
Comment by Isaac Huang (Inactive) [ 27/Oct/14 ]

The ARC was set to be at most 900M by default, i.e. half of system memory. I couldn't get arcstats on the OSS due to the OOM, but I'm going to try lowering the ARC max size. The OOM was quite reproducible - 3 out of my 4 test runs hit it (The only success was likely because I forgot to install pios on the client node and thus the pios tests were skipped).

Comment by Isaac Huang (Inactive) [ 28/Oct/14 ]

OOM still reproducible with ARC max at 800M.

Comment by Isaac Huang (Inactive) [ 28/Oct/14 ]

It seemed that prefetching caused ARC to grow over its limit and eventually caused the OOM. I've reported it to ZoL:
https://github.com/zfsonlinux/zfs/issues/2840

Once ZFS prefetching was disabled, sanity-benchmark pios_fpp completed successfully.

Comment by Isaac Huang (Inactive) [ 29/Oct/14 ]

With ZFS prefetching disabled on the OSS, two more test runs (sanity-benchmark performance-sanity parallel-scale) completed with 0 error.

Generated at Sat Feb 10 01:54:42 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.