[LU-4033] Failure on test suite parallel-scale-nfsv4 test_iorssf: MDS oom - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Duplicate
Priority: Critical
Fix Version/s: None
Affects Version/s: Lustre 2.5.0
Labels:
None
Environment:
server and client: lustre-master build # 1687
client is running SLES11 SP2

Severity:
3
Rank (Obsolete):
10835

Description

This issue was created by maloo for sarah <sarah@whamcloud.com>

This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/960b8b64-2915-11e3-b598-52540035b04c.

The sub-test test_iorssf failed with the following error:

test failed to respond and timed out

MDS console

17:14:54:ptlrpcd_0: page allocation failure. order:1, mode:0x40
17:14:55:Pid: 2780, comm: ptlrpcd_0 Not tainted 2.6.32-358.18.1.el6_lustre.x86_64 #1
17:14:56:Call Trace:
17:14:57: [<ffffffff8112c257>] ? __alloc_pages_nodemask+0x757/0x8d0
17:14:58: [<ffffffff81166d92>] ? kmem_getpages+0x62/0x170
17:14:59: [<ffffffff811679aa>] ? fallback_alloc+0x1ba/0x270
17:14:59: [<ffffffff811673ff>] ? cache_grow+0x2cf/0x320
17:14:59: [<ffffffff81167729>] ? ____cache_alloc_node+0x99/0x160
17:14:59: [<ffffffffa0538ed7>] ? LNetMDAttach+0x157/0x5a0 [lnet]
17:14:59: [<ffffffff811684f9>] ? __kmalloc+0x189/0x220
17:14:59: [<ffffffffa0538ed7>] ? LNetMDAttach+0x157/0x5a0 [lnet]
17:15:00: [<ffffffffa0771b35>] ? ptlrpc_register_bulk+0x265/0x9d0 [ptlrpc]
17:15:00: [<ffffffffa0773a12>] ? ptl_send_rpc+0x232/0xc40 [ptlrpc]
17:15:00: [<ffffffff81281b74>] ? snprintf+0x34/0x40
17:15:01: [<ffffffffa0488761>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
17:15:01: [<ffffffffa07685f4>] ? ptlrpc_send_new_req+0x454/0x790 [ptlrpc]
17:15:02: [<ffffffffa076c368>] ? ptlrpc_check_set+0x888/0x1b40 [ptlrpc]
17:15:02: [<ffffffffa079801b>] ? ptlrpcd_check+0x53b/0x560 [ptlrpc]
17:15:03: [<ffffffffa079853b>] ? ptlrpcd+0x20b/0x370 [ptlrpc]
17:15:03: [<ffffffff81063410>] ? default_wake_function+0x0/0x20
17:15:03: [<ffffffffa0798330>] ? ptlrpcd+0x0/0x370 [ptlrpc]
17:15:03: [<ffffffff81096a36>] ? kthread+0x96/0xa0
17:15:03: [<ffffffff8100c0ca>] ? child_rip+0xa/0x20
17:15:04: [<ffffffff810969a0>] ? kthread+0x0/0xa0
17:15:04: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
17:15:06:Mem-Info:
17:15:06:Node 0 DMA per-cpu:
17:15:06:CPU    0: hi:    0, btch:   1 usd:   0
17:15:06:Node 0 DMA32 per-cpu:
17:15:06:CPU    0: hi:  186, btch:  31 usd:  42
17:15:06:active_anon:2345 inactive_anon:2732 isolated_anon:0
17:15:07: active_file:110430 inactive_file:238985 isolated_file:0
17:15:07: unevictable:0 dirty:3 writeback:0 unstable:0
17:15:07: free:14257 slab_reclaimable:7260 slab_unreclaimable:76976
17:15:07: mapped:2551 shmem:41 pagetables:794 bounce:0
17:15:08:Node 0 DMA free:8264kB min:332kB low:412kB high:496kB active_anon:0kB inactive_anon:0kB active_file:272kB inactive_file:5444kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15324kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:36kB slab_unreclaimable:1700kB kernel_stack:16kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
17:15:08:lowmem_reserve[]: 0 2003 2003 2003
17:15:09:Node 0 DMA32 free:48764kB min:44720kB low:55900kB high:67080kB active_anon:9380kB inactive_anon:10928kB active_file:441448kB inactive_file:950496kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2052064kB mlocked:0kB dirty:12kB writeback:0kB mapped:10204kB shmem:164kB slab_reclaimable:29004kB slab_unreclaimable:306204kB kernel_stack:1984kB pagetables:3176kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
17:15:09:lowmem_reserve[]: 0 0 0 0
17:15:10:Node 0 DMA: 58*4kB 104*8kB 102*16kB 42*32kB 6*64kB 2*128kB 2*256kB 2*512kB 0*1024kB 1*2048kB 0*4096kB = 8264kB
17:15:11:Node 0 DMA32: 10659*4kB 2*8kB 2*16kB 2*32kB 2*64kB 0*128kB 1*256kB 1*512kB 1*1024kB 0*2048kB 1*4096kB = 48764kB
17:15:11:269122 total pagecache pages
17:15:11:28 pages in swap cache
17:15:11:Swap cache stats: add 62, delete 34, find 18/22
17:15:11:Free swap  = 4128648kB
17:15:12:Total swap = 4128760kB
17:15:12:524284 pages RAM
17:15:12:43669 pages reserved
17:15:13:282260 pages shared
17:15:13:194054 pages non-shared
17:15:14:LNetError: 2780:0:(lib-lnet.h:457:lnet_md_alloc()) LNET: out of memory at /var/lib/jenkins/workspace/lustre-master/arch/x86_64/build_type/server/distro/el6/ib_stack/inkernel/BUILD/BUILD/lustre-2.4.93/lnet/include/lnet/lib-lnet.h:457 (tried to alloc '(md)' = 4208)
17:15:14:LNetError: 2780:0:(lib-lnet.h:457:lnet_md_alloc()) LNET: 55064047 total bytes allocated by lnet
17:15:15:LustreError: 2780:0:(niobuf.c:376:ptlrpc_register_bulk()) lustre-OST0002-osc-ffff88006f296400: LNetMDAttach failed x1447417177531472/0: rc = -12

Attachments

Issue Links

duplicates

LU-4357 page allocation failure. mode:0x40 caused by missing __GFP_WAIT flag

Resolved

is related to

LU-4053 client leaking objects/locks during IO

Resolved

Activity

[LU-4033] Failure on test suite parallel-scale-nfsv4 test_iorssf: MDS oom

Sarah Liu added a comment - 04/Dec/13 4:19 AM

hit this bug in interop testing between 2.6 server and 2.5 server

https://maloo.whamcloud.com/test_sets/f380a6a4-5beb-11e3-8bdd-52540035b04c

Sarah Liu added a comment - 04/Dec/13 4:19 AM hit this bug in interop testing between 2.6 server and 2.5 server https://maloo.whamcloud.com/test_sets/f380a6a4-5beb-11e3-8bdd-52540035b04c

Alex Zhuravlev added a comment - 18/Oct/13 6:43 PM

Andreas, have a look at the proto: http://review.whamcloud.com/#/c/8003/ - the idea is to signal the client the file is removed, so the client can reset nlink and let the kernel to drop the inode and the pages. it's not intended for landing yet, but I'd like to hear your opinion on the approach.

Alex Zhuravlev added a comment - 18/Oct/13 6:43 PM Andreas, have a look at the proto: http://review.whamcloud.com/#/c/8003/ - the idea is to signal the client the file is removed, so the client can reset nlink and let the kernel to drop the inode and the pages. it's not intended for landing yet, but I'd like to hear your opinion on the approach.

Alex Zhuravlev added a comment - 07/Oct/13 10:35 AM

to clarify a bit.. given OST_DESTROY isn't executed immediately (but after commit on MDS at least), there is a window when MDT object is destroyed (in cache), but OST objects aren't. if the kernel decides to flush dirty pages at that point, the client code will try to revalidate layout (which was invalidated by UNLINK). this way we get lu-objects in the memory. they are supposed to be purged at some point (given no access can be made after OST_DESTROY). I'd think this isn't a big issue, but it'd be cool if we can use layout lock for this purpose as well.

Alex Zhuravlev added a comment - 07/Oct/13 10:35 AM to clarify a bit.. given OST_DESTROY isn't executed immediately (but after commit on MDS at least), there is a window when MDT object is destroyed (in cache), but OST objects aren't. if the kernel decides to flush dirty pages at that point, the client code will try to revalidate layout (which was invalidated by UNLINK). this way we get lu-objects in the memory. they are supposed to be purged at some point (given no access can be made after OST_DESTROY). I'd think this isn't a big issue, but it'd be cool if we can use layout lock for this purpose as well.

Alex Zhuravlev added a comment - 06/Oct/13 5:31 PM

the root cause seems to be remaining dirty pages on the client which cause ENQUEUEs for the layouts which in turn populate MDS cache with lu-objects for already removed files.

Alex Zhuravlev added a comment - 06/Oct/13 5:31 PM the root cause seems to be remaining dirty pages on the client which cause ENQUEUEs for the layouts which in turn populate MDS cache with lu-objects for already removed files.

Alex Zhuravlev added a comment - 05/Oct/13 6:42 PM

sure, I'll try to reproduce with dbench.

Alex Zhuravlev added a comment - 05/Oct/13 6:42 PM sure, I'll try to reproduce with dbench.

Andreas Dilger added a comment - 05/Oct/13 6:33 PM

If the VM was just retaining the slabs, then they would not be marked active, I think. Also, near the end of my 1h dbench run there was starting to be some considerable memory pressure on the other slabs, so these should have been shrunk at that time if they were just in percpu cache.

It may be that a workload different from createmany/unlinkmany is needed? For example, neither of these operations does a lookup or readdir or a stat, or any number of other combinations. I don't think a 3600s dbench run is needed, I just was doing that to see if there is a long-term increase in memory use (which there is). Probably even a short run with full +malloc tracing would be enough.

Andreas Dilger added a comment - 05/Oct/13 6:33 PM If the VM was just retaining the slabs, then they would not be marked active, I think. Also, near the end of my 1h dbench run there was starting to be some considerable memory pressure on the other slabs, so these should have been shrunk at that time if they were just in percpu cache. It may be that a workload different from createmany/unlinkmany is needed? For example, neither of these operations does a lookup or readdir or a stat, or any number of other combinations. I don't think a 3600s dbench run is needed, I just was doing that to see if there is a long-term increase in memory use (which there is). Probably even a short run with full +malloc tracing would be enough.

People

Assignee:: Bob Glossman (Inactive)

Reporter:: Maloo

Votes:: 0 Vote for this issue

Watchers:: 9 Start watching this issue

Dates

Created:: 30/Sep/13 11:21 PM

Updated:: 21/May/14 7:02 PM

Resolved:: 13/Feb/14 10:05 PM