[LU-1261] oom-killer was invoked while running recovery-*-scale tests on VMs Created: 27/Mar/12  Updated: 29/May/17  Resolved: 29/May/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.2.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Jian Yu Assignee: WC Triage
Resolution: Cannot Reproduce Votes: 0
Labels: None
Environment:

Lustre Tag: v2_2_0_0_RC2
Lustre Build: http://build.whamcloud.com/job/lustre-b2_2/17/
Distro/Arch: RHEL5.7/x86_64(client), RHEL6.2/x86_64(server)
Network: TCP (1GigE)
ENABLE_QUOTA=yes
FAILURE_MODE=HARD

MGS/MDS Nodes: client-32vm5(active), client-32vm6(passive)
\ /
1 combined MGS/MDT

OSS Nodes: client-32vm7(active), client-32vm8(active)
\ /
OST1 (active in client-32vm7)
OST2 (active in client-32vm8)
OST3 (active in client-32vm7)
OST4 (active in client-32vm8)
OST5 (active in client-32vm7)
OST6 (active in client-32vm8)

Client Nodes: client-32vm[1-4]


Severity: 3
Rank (Obsolete): 9770

 Description   

While running recovery-*-scale tests on VMs with RHEL5.7/x86_64 clients and RHEL6.2/x86_64 servers, oom issue kept occurring on one of the client nodes:

init invoked oom-killer: gfp_mask=0x200d2, order=0, oomkilladj=0

Call Trace:
 [<ffffffff800c962a>] out_of_memory+0x8e/0x2f3
 [<ffffffff8000f625>] __alloc_pages+0x27f/0x308
 [<ffffffff80032903>] read_swap_cache_async+0x45/0xd8
 [<ffffffff800cf3e3>] swapin_readahead+0x60/0xd3
 [<ffffffff800092cb>] __handle_mm_fault+0xb62/0x1039
 [<ffffffff8008e430>] default_wake_function+0x0/0xe
 [<ffffffff8006720b>] do_page_fault+0x4cb/0x874
 [<ffffffff800a4931>] ktime_get_ts+0x1a/0x4e
 [<ffffffff800bfe9c>] delayacct_end+0x5d/0x86
 [<ffffffff8005dde9>] error_exit+0x0/0x84
 [<ffffffff80061e0e>] copy_user_generic_unrolled+0x86/0xac
 [<ffffffff800eb7f9>] core_sys_select+0x1f9/0x265
 [<ffffffff8002cc16>] mntput_no_expire+0x19/0x89
 [<ffffffff8001b007>] cp_new_stat+0xe5/0xfd
 [<ffffffff80016a40>] sys_select+0x153/0x17c
 [<ffffffff8005d116>] system_call+0x7e/0x83

Node 0 DMA per-cpu:
cpu 0 hot: high 0, batch 1 used:0
cpu 0 cold: high 0, batch 1 used:0
Node 0 DMA32 per-cpu:
cpu 0 hot: high 186, batch 31 used:48
cpu 0 cold: high 62, batch 15 used:61
Node 0 Normal per-cpu: empty
Node 0 HighMem per-cpu: empty
Free pages:        8656kB (0kB HighMem)
Active:6 inactive:486961 dirty:0 writeback:675 unstable:0 free:2164 slab:12585 mapped-file:1064 mapped-anon:596 pagetables:1241
Node 0 DMA free:3032kB min:24kB low:28kB high:36kB active:0kB inactive:0kB present:9736kB pages_scanned:0 all_unreclaimable? yes
lowmem_reserve[]: 0 2003 2003 2003
Node 0 DMA32 free:5624kB min:5712kB low:7140kB high:8568kB active:24kB inactive:1947844kB present:2052068kB pages_scanned:5192088 all_unreclaimable? yes
lowmem_reserve[]: 0 0 0 0
Node 0 Normal free:0kB min:0kB low:0kB high:0kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0 0
Node 0 HighMem free:0kB min:128kB low:128kB high:128kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0 0
Node 0 DMA: 4*4kB 5*8kB 2*16kB 4*32kB 4*64kB 2*128kB 1*256kB 0*512kB 2*1024kB 0*2048kB 0*4096kB = 3032kB
Node 0 DMA32: 0*4kB 1*8kB 1*16kB 1*32kB 1*64kB 1*128kB 1*256kB 0*512kB 1*1024kB 0*2048kB 1*4096kB = 5624kB
Node 0 Normal: empty
Node 0 HighMem: empty
486967 pagecache pages
Swap cache: add 9174, delete 8578, find 337/499, race 0+0
Free swap  = 4072784kB
Total swap = 4104596kB
Out of memory: Killed process 2072, UID 51, (sendmail).

Maloo report: https://maloo.whamcloud.com/test_sessions/b3c52910-77de-11e1-841d-5254004bbbd3

Another instance: https://maloo.whamcloud.com/test_sessions/1eaca93a-7800-11e1-841d-5254004bbbd3

syslogd invoked oom-killer: gfp_mask=0x201d2, order=0, oomkilladj=0

Call Trace:
 [<ffffffff800c962a>] out_of_memory+0x8e/0x2f3
 [<ffffffff8000f625>] __alloc_pages+0x27f/0x308
 [<ffffffff8001300a>] __do_page_cache_readahead+0x96/0x179
 [<ffffffff80013945>] filemap_nopage+0x14c/0x360
 [<ffffffff80008964>] __handle_mm_fault+0x1fb/0x1039
 [<ffffffff800a28fb>] autoremove_wake_function+0x0/0x2e
 [<ffffffff8000ebd4>] find_get_pages_tag+0x34/0x89
 [<ffffffff8006720b>] do_page_fault+0x4cb/0x874
 [<ffffffff800f5a22>] sync_inode+0x24/0x33
 [<ffffffff8804c370>] :ext3:ext3_sync_file+0xcc/0xf8
 [<ffffffff8005dde9>] error_exit+0x0/0x84

Node 0 DMA per-cpu:
cpu 0 hot: high 0, batch 1 used:0
cpu 0 cold: high 0, batch 1 used:0
Node 0 DMA32 per-cpu:
cpu 0 hot: high 186, batch 31 used:55
cpu 0 cold: high 62, batch 15 used:47
Node 0 Normal per-cpu: empty
Node 0 HighMem per-cpu: empty
Free pages:        8624kB (0kB HighMem)
Active:6 inactive:486329 dirty:0 writeback:2652 unstable:0 free:2156 slab:13169 mapped-file:1064 mapped-anon:596 pagetables:1232
Node 0 DMA free:3032kB min:24kB low:28kB high:36kB active:0kB inactive:0kB present:9736kB pages_scanned:0 all_unreclaimable? yes
lowmem_reserve[]: 0 2003 2003 2003
Node 0 DMA32 free:5592kB min:5712kB low:7140kB high:8568kB active:24kB inactive:1945316kB present:2052068kB pages_scanned:4943222 all_unreclaimable? yes
lowmem_reserve[]: 0 0 0 0
Node 0 Normal free:0kB min:0kB low:0kB high:0kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0 0
Node 0 HighMem free:0kB min:128kB low:128kB high:128kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0 0
Node 0 DMA: 4*4kB 5*8kB 2*16kB 4*32kB 4*64kB 2*128kB 1*256kB 0*512kB 2*1024kB 0*2048kB 0*4096kB = 3032kB
Node 0 DMA32: 0*4kB 1*8kB 1*16kB 0*32kB 1*64kB 1*128kB 1*256kB 0*512kB 1*1024kB 0*2048kB 1*4096kB = 5592kB
Node 0 Normal: empty
Node 0 HighMem: empty
486367 pagecache pages
Swap cache: add 8131, delete 7535, find 108/139, race 0+0
Free swap  = 4072828kB
Total swap = 4104596kB
Out of memory: Killed process 2075, UID 51, (sendmail).

The total memory size on each VM is about 2GB.

BTW, the same tests passed on the same VMs with RHEL6.2/x86_64 distro/arch both on clients and servers:
https://maloo.whamcloud.com/test_sessions/f4dd044e-7708-11e1-a169-5254004bbbd3



 Comments   
Comment by Jian Yu [ 29/Mar/12 ]

While running the tests with async journal commit disabled on OSSs, the above issue did not occur.

lctl set_param obdfilter.${FSNAME}-*.sync_journal=1

Maloo reports:
https://maloo.whamcloud.com/test_sessions/d309cc0c-79a2-11e1-9d2a-5254004bbbd3
https://maloo.whamcloud.com/test_sessions/ce61e5e0-79a2-11e1-9d2a-5254004bbbd3
https://maloo.whamcloud.com/test_sessions/abd0317a-79a8-11e1-9d2a-5254004bbbd3

Comment by Andreas Dilger [ 29/May/17 ]

Close old ticket.

Generated at Sat Feb 10 01:15:03 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.