[LU-4033] Failure on test suite parallel-scale-nfsv4 test_iorssf: MDS oom Created: 30/Sep/13 Updated: 21/May/14 Resolved: 13/Feb/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.5.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical |
| Reporter: | Maloo | Assignee: | Bob Glossman (Inactive) |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Environment: |
server and client: lustre-master build # 1687 |
||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 10835 | ||||||||||||||||
| Description |
|
This issue was created by maloo for sarah <sarah@whamcloud.com> This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/960b8b64-2915-11e3-b598-52540035b04c. The sub-test test_iorssf failed with the following error:
MDS console 17:14:54:ptlrpcd_0: page allocation failure. order:1, mode:0x40 17:14:55:Pid: 2780, comm: ptlrpcd_0 Not tainted 2.6.32-358.18.1.el6_lustre.x86_64 #1 17:14:56:Call Trace: 17:14:57: [<ffffffff8112c257>] ? __alloc_pages_nodemask+0x757/0x8d0 17:14:58: [<ffffffff81166d92>] ? kmem_getpages+0x62/0x170 17:14:59: [<ffffffff811679aa>] ? fallback_alloc+0x1ba/0x270 17:14:59: [<ffffffff811673ff>] ? cache_grow+0x2cf/0x320 17:14:59: [<ffffffff81167729>] ? ____cache_alloc_node+0x99/0x160 17:14:59: [<ffffffffa0538ed7>] ? LNetMDAttach+0x157/0x5a0 [lnet] 17:14:59: [<ffffffff811684f9>] ? __kmalloc+0x189/0x220 17:14:59: [<ffffffffa0538ed7>] ? LNetMDAttach+0x157/0x5a0 [lnet] 17:15:00: [<ffffffffa0771b35>] ? ptlrpc_register_bulk+0x265/0x9d0 [ptlrpc] 17:15:00: [<ffffffffa0773a12>] ? ptl_send_rpc+0x232/0xc40 [ptlrpc] 17:15:00: [<ffffffff81281b74>] ? snprintf+0x34/0x40 17:15:01: [<ffffffffa0488761>] ? libcfs_debug_msg+0x41/0x50 [libcfs] 17:15:01: [<ffffffffa07685f4>] ? ptlrpc_send_new_req+0x454/0x790 [ptlrpc] 17:15:02: [<ffffffffa076c368>] ? ptlrpc_check_set+0x888/0x1b40 [ptlrpc] 17:15:02: [<ffffffffa079801b>] ? ptlrpcd_check+0x53b/0x560 [ptlrpc] 17:15:03: [<ffffffffa079853b>] ? ptlrpcd+0x20b/0x370 [ptlrpc] 17:15:03: [<ffffffff81063410>] ? default_wake_function+0x0/0x20 17:15:03: [<ffffffffa0798330>] ? ptlrpcd+0x0/0x370 [ptlrpc] 17:15:03: [<ffffffff81096a36>] ? kthread+0x96/0xa0 17:15:03: [<ffffffff8100c0ca>] ? child_rip+0xa/0x20 17:15:04: [<ffffffff810969a0>] ? kthread+0x0/0xa0 17:15:04: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 17:15:06:Mem-Info: 17:15:06:Node 0 DMA per-cpu: 17:15:06:CPU 0: hi: 0, btch: 1 usd: 0 17:15:06:Node 0 DMA32 per-cpu: 17:15:06:CPU 0: hi: 186, btch: 31 usd: 42 17:15:06:active_anon:2345 inactive_anon:2732 isolated_anon:0 17:15:07: active_file:110430 inactive_file:238985 isolated_file:0 17:15:07: unevictable:0 dirty:3 writeback:0 unstable:0 17:15:07: free:14257 slab_reclaimable:7260 slab_unreclaimable:76976 17:15:07: mapped:2551 shmem:41 pagetables:794 bounce:0 17:15:08:Node 0 DMA free:8264kB min:332kB low:412kB high:496kB active_anon:0kB inactive_anon:0kB active_file:272kB inactive_file:5444kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15324kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:36kB slab_unreclaimable:1700kB kernel_stack:16kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no 17:15:08:lowmem_reserve[]: 0 2003 2003 2003 17:15:09:Node 0 DMA32 free:48764kB min:44720kB low:55900kB high:67080kB active_anon:9380kB inactive_anon:10928kB active_file:441448kB inactive_file:950496kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2052064kB mlocked:0kB dirty:12kB writeback:0kB mapped:10204kB shmem:164kB slab_reclaimable:29004kB slab_unreclaimable:306204kB kernel_stack:1984kB pagetables:3176kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no 17:15:09:lowmem_reserve[]: 0 0 0 0 17:15:10:Node 0 DMA: 58*4kB 104*8kB 102*16kB 42*32kB 6*64kB 2*128kB 2*256kB 2*512kB 0*1024kB 1*2048kB 0*4096kB = 8264kB 17:15:11:Node 0 DMA32: 10659*4kB 2*8kB 2*16kB 2*32kB 2*64kB 0*128kB 1*256kB 1*512kB 1*1024kB 0*2048kB 1*4096kB = 48764kB 17:15:11:269122 total pagecache pages 17:15:11:28 pages in swap cache 17:15:11:Swap cache stats: add 62, delete 34, find 18/22 17:15:11:Free swap = 4128648kB 17:15:12:Total swap = 4128760kB 17:15:12:524284 pages RAM 17:15:12:43669 pages reserved 17:15:13:282260 pages shared 17:15:13:194054 pages non-shared 17:15:14:LNetError: 2780:0:(lib-lnet.h:457:lnet_md_alloc()) LNET: out of memory at /var/lib/jenkins/workspace/lustre-master/arch/x86_64/build_type/server/distro/el6/ib_stack/inkernel/BUILD/BUILD/lustre-2.4.93/lnet/include/lnet/lib-lnet.h:457 (tried to alloc '(md)' = 4208) 17:15:14:LNetError: 2780:0:(lib-lnet.h:457:lnet_md_alloc()) LNET: 55064047 total bytes allocated by lnet 17:15:15:LustreError: 2780:0:(niobuf.c:376:ptlrpc_register_bulk()) lustre-OST0002-osc-ffff88006f296400: LNetMDAttach failed x1447417177531472/0: rc = -12 |
| Comments |
| Comment by Sarah Liu [ 01/Oct/13 ] |
|
parallel-scale-nfsv3 test_connectathon failed with similar error https://maloo.whamcloud.com/test_sets/d951fa2a-2915-11e3-b598-52540035b04c mds dmesg: ptlrpcd_1: page allocation failure. order:1, mode:0x40 Pid: 2773, comm: ptlrpcd_1 Not tainted 2.6.32-358.18.1.el6_lustre.x86_64 #1 Call Trace: [<ffffffff8112c257>] ? __alloc_pages_nodemask+0x757/0x8d0 [<ffffffff81166d92>] ? kmem_getpages+0x62/0x170 [<ffffffff811679aa>] ? fallback_alloc+0x1ba/0x270 [<ffffffff811673ff>] ? cache_grow+0x2cf/0x320 [<ffffffff81167729>] ? ____cache_alloc_node+0x99/0x160 [<ffffffffa0538ed7>] ? LNetMDAttach+0x157/0x5a0 [lnet] [<ffffffff811684f9>] ? __kmalloc+0x189/0x220 [<ffffffffa0538ed7>] ? LNetMDAttach+0x157/0x5a0 [lnet] [<ffffffffa0771b35>] ? ptlrpc_register_bulk+0x265/0x9d0 [ptlrpc] [<ffffffffa0773a12>] ? ptl_send_rpc+0x232/0xc40 [ptlrpc] [<ffffffff81281b74>] ? snprintf+0x34/0x40 [<ffffffffa0488761>] ? libcfs_debug_msg+0x41/0x50 [libcfs] [<ffffffffa07685f4>] ? ptlrpc_send_new_req+0x454/0x790 [ptlrpc] [<ffffffffa076c368>] ? ptlrpc_check_set+0x888/0x1b40 [ptlrpc] [<ffffffffa079801b>] ? ptlrpcd_check+0x53b/0x560 [ptlrpc] [<ffffffffa079853b>] ? ptlrpcd+0x20b/0x370 [ptlrpc] [<ffffffff81063410>] ? default_wake_function+0x0/0x20 [<ffffffffa0798330>] ? ptlrpcd+0x0/0x370 [ptlrpc] [<ffffffff81096a36>] ? kthread+0x96/0xa0 [<ffffffff8100c0ca>] ? child_rip+0xa/0x20 [<ffffffff810969a0>] ? kthread+0x0/0xa0 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 Mem-Info: Node 0 DMA per-cpu: CPU 0: hi: 0, btch: 1 usd: 0 Node 0 DMA32 per-cpu: CPU 0: hi: 186, btch: 31 usd: 100 active_anon:2425 inactive_anon:2637 isolated_anon:0 active_file:18661 inactive_file:310952 isolated_file:0 unevictable:0 dirty:24 writeback:0 unstable:0 free:18497 slab_reclaimable:4305 slab_unreclaimable:95385 mapped:2054 shmem:41 pagetables:790 bounce:0 Node 0 DMA free:8276kB min:332kB low:412kB high:496kB active_anon:0kB inactive_anon:0kB active_file:92kB inactive_file:5752kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15324kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:32kB slab_unreclaimable:1572kB kernel_stack:8kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 2003 2003 2003 Node 0 DMA32 free:65712kB min:44720kB low:55900kB high:67080kB active_anon:9700kB inactive_anon:10548kB active_file:74552kB inactive_file:1238056kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2052064kB mlocked:0kB dirty:96kB writeback:0kB mapped:8216kB shmem:164kB slab_reclaimable:17188kB slab_unreclaimable:379968kB kernel_stack:2040kB pagetables:3160kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 Node 0 DMA: 25*4kB 16*8kB 23*16kB 10*32kB 3*64kB 2*128kB 5*256kB 3*512kB 2*1024kB 1*2048kB 0*4096kB = 8276kB Node 0 DMA32: 12818*4kB 329*8kB 116*16kB 87*32kB 20*64kB 0*128kB 1*256kB 1*512kB 1*1024kB 0*2048kB 1*4096kB = 65712kB 329671 total pagecache pages 0 pages in swap cache Swap cache stats: add 0, delete 0, find 0/0 Free swap = 4128760kB Total swap = 4128760kB 524284 pages RAM 43669 pages reserved 339843 pages shared 128622 pages non-shared LNetError: 2773:0:(lib-lnet.h:457:lnet_md_alloc()) LNET: out of memory at /var/lib/jenkins/workspace/lustre-master/arch/x86_64/build_type/server/distro/el6/ib_stack/inkernel/BUILD/BUILD/lustre-2.4.93/lnet/include/lnet/lib-lnet.h:457 (tried to alloc '(md)' = 4192) |
| Comment by Jodi Levi (Inactive) [ 01/Oct/13 ] |
|
Bob, |
| Comment by Andreas Dilger [ 01/Oct/13 ] |
|
We need to get some information from the MDS, like /proc/slabinfo and "lctl get_param memused" to determine where the memory is being used. This should be saved every minute or so, since it will not be able to collect this info when the MDS node runs out of memory. |
| Comment by Bob Glossman (Inactive) [ 01/Oct/13 ] |
|
A couple of items of info gleaned from discussion with sarah 1) This is a new failure. Only seen during testing on the 2.4.93 tag. Not seen in earlier tests. 2) Going to try to reproduce it with SLES11 SP3 clients to see if it happens there too. |
| Comment by Bob Glossman (Inactive) [ 02/Oct/13 ] |
|
Seen only on full tests, not review. |
| Comment by Andreas Dilger [ 02/Oct/13 ] |
|
Bob, though this still concerns me if either the client or the MDS are getting OOM problems. There shouldn't be a client workload or failure condition that causes the MDS to run out of memory. The fact that it is only seen under full tests and not review just means that we don't have as much testing for each patch as we really need, and doesn't at all imply that this isn't a problem that will be hit under normal usage. We reformat and restart our filesystems so often under review testing that it doesn't really resemble a normal workload for the servers. |
| Comment by Bob Glossman (Inactive) [ 02/Oct/13 ] |
|
Andreas, I agree with you. Shouldn't be seeing alloc failures in clients or servers with any reasonable kind of load. Shouldn't see effects persisting on MDS after client side failures. So far haven't been able to reproduce this failure on a small scale using only VMs. No free nodes available on Toro to test at bigger scale at the moment. No loadable profiles for SLES11 SP2 or SP3 on Rosso, so I can't test there. |
| Comment by Sarah Liu [ 02/Oct/13 ] |
|
I ran iorssf two times with SLES11 SP3 client, cannot reproduce: Do I need run the full test suite on SLES11 SP3? |
| Comment by Bob Glossman (Inactive) [ 02/Oct/13 ] |
|
I hate to waste the time, but I think we do need to make best effort to do the same test sequence that failed in SP2 to see if it reproduces in SP3. So yes, please run the full test suite. The fact that the single test alone passed repeatedly is a good indication that it by itself isn't the direct cause of resource exhaustion on the MDS. I'm betting if you just ran this test alone on SP2 a few time that would pass too. I'm worried that there may be a slow memory leak somewhere that we are hitting only in the case of a full test and a particular sequence of failures in different places. Not sure how we go about tracking down such a thing. |
| Comment by Bob Glossman (Inactive) [ 03/Oct/13 ] |
|
I'm wondering if this bug may be related to the leak behavior Andreas is looking at in |
| Comment by Andreas Dilger [ 03/Oct/13 ] |
|
Running dbench for 1h on a single-node test setup in After deleting all of the files in the filesystem and unmounting the client (freeing all of the client-side locks and slabs), the MDS still has a huge number of objects allocated. During the test run there were never more than about 1000 files in existence, but a relatively large number of files created and deleted. OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME 289460 132395 45% 0.19K 14473 20 57892K size-192 203463 179354 88% 0.10K 5499 37 21996K buffer_head 128894 127934 99% 0.11K 3791 34 15164K lod_obj 128880 127934 99% 0.08K 2685 48 10740K mdd_obj 128869 127934 99% 0.28K 9913 13 39652K mdt_obj 71760 50205 69% 0.19K 3588 20 14352K dentry 1491 1176 78% 1.03K 497 3 1988K ldiskfs_inode_cache After leaving the system idle overnight, there are still a large number of objects, but some have been freed: OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME 221280 91056 41% 0.19K 11064 20 44256K size-192 203463 147927 72% 0.10K 5499 37 21996K buffer_head 94944 86590 91% 0.08K 1978 48 7912K mdd_obj 94418 86590 91% 0.11K 2777 34 11108K lod_obj 93106 86590 93% 0.28K 7162 13 28648K mdt_obj 987 955 96% 1.03K 329 3 1316K ldiskfs_inode_cache It doesn't make sense to me that the MDT is caching objects that have been deleted. This is just causing useless memory pressure, even if they are eventually freed. |
| Comment by Sarah Liu [ 03/Oct/13 ] |
|
I just pushed a for test only patch to run full suite with SLES11 SP3 client: |
| Comment by Bob Glossman (Inactive) [ 04/Oct/13 ] |
|
I'm wondering if there's a low rate flaw in the reference counting of objects. It occurs to me that all it would take is one seldom used code path with one too many xxx_object_get() calls or one too few xxx_object_put() calls and the reference count will never drop to 0. If the reference count is wrong that could explain never freeing objects. Over time the population of unreleased objects would grow. |
| Comment by Oleg Drokin [ 04/Oct/13 ] |
|
I think it's still not related to any real object leakage, otherwise we would have noticed that by other means. |
| Comment by Andreas Dilger [ 04/Oct/13 ] |
|
This bug is intended to track the problem with MDS-side objects not being freed (mdd_obj, lod_obj, mdt_obj slabs). The I think there is just something wrong in the MDS stack that it is not destroying the whole lu_obj (or whatever?) when an object is unlinked, and this is only freed at unmount time or possibly very slowly under memory pressure. It doesn't make any sense to keep objects in memory for FIDs that have been deleted. |
| Comment by Alex Zhuravlev [ 04/Oct/13 ] |
|
I did a simple test:
then in few dozen seconds:
probably MM just retains lots of actually free pieces in per-cpu caches or something like that? |
| Comment by Andreas Dilger [ 05/Oct/13 ] |
|
If the VM was just retaining the slabs, then they would not be marked active, I think. Also, near the end of my 1h dbench run there was starting to be some considerable memory pressure on the other slabs, so these should have been shrunk at that time if they were just in percpu cache. It may be that a workload different from createmany/unlinkmany is needed? For example, neither of these operations does a lookup or readdir or a stat, or any number of other combinations. I don't think a 3600s dbench run is needed, I just was doing that to see if there is a long-term increase in memory use (which there is). Probably even a short run with full +malloc tracing would be enough. |
| Comment by Alex Zhuravlev [ 05/Oct/13 ] |
|
sure, I'll try to reproduce with dbench. |
| Comment by Alex Zhuravlev [ 06/Oct/13 ] |
|
the root cause seems to be remaining dirty pages on the client which cause ENQUEUEs for the layouts which in turn populate MDS cache with lu-objects for already removed files. |
| Comment by Alex Zhuravlev [ 07/Oct/13 ] |
|
to clarify a bit.. given OST_DESTROY isn't executed immediately (but after commit on MDS at least), there is a window when MDT object is destroyed (in cache), but OST objects aren't. if the kernel decides to flush dirty pages at that point, the client code will try to revalidate layout (which was invalidated by UNLINK). this way we get lu-objects in the memory. they are supposed to be purged at some point (given no access can be made after OST_DESTROY). I'd think this isn't a big issue, but it'd be cool if we can use layout lock for this purpose as well. |
| Comment by Alex Zhuravlev [ 18/Oct/13 ] |
|
Andreas, have a look at the proto: http://review.whamcloud.com/#/c/8003/ - the idea is to signal the client the file is removed, so the client can reset nlink and let the kernel to drop the inode and the pages. it's not intended for landing yet, but I'd like to hear your opinion on the approach. |
| Comment by Sarah Liu [ 04/Dec/13 ] |
|
hit this bug in interop testing between 2.6 server and 2.5 server https://maloo.whamcloud.com/test_sets/f380a6a4-5beb-11e3-8bdd-52540035b04c |