[LU-4033] Failure on test suite parallel-scale-nfsv4 test_iorssf: MDS oom Created: 30/Sep/13  Updated: 21/May/14  Resolved: 13/Feb/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.0
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Maloo Assignee: Bob Glossman (Inactive)
Resolution: Duplicate Votes: 0
Labels: None
Environment:

server and client: lustre-master build # 1687
client is running SLES11 SP2


Issue Links:
Duplicate
duplicates LU-4357 page allocation failure. mode:0x40 ca... Resolved
Related
is related to LU-4053 client leaking objects/locks during IO Resolved
Severity: 3
Rank (Obsolete): 10835

 Description   

This issue was created by maloo for sarah <sarah@whamcloud.com>

This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/960b8b64-2915-11e3-b598-52540035b04c.

The sub-test test_iorssf failed with the following error:

test failed to respond and timed out

MDS console

17:14:54:ptlrpcd_0: page allocation failure. order:1, mode:0x40
17:14:55:Pid: 2780, comm: ptlrpcd_0 Not tainted 2.6.32-358.18.1.el6_lustre.x86_64 #1
17:14:56:Call Trace:
17:14:57: [<ffffffff8112c257>] ? __alloc_pages_nodemask+0x757/0x8d0
17:14:58: [<ffffffff81166d92>] ? kmem_getpages+0x62/0x170
17:14:59: [<ffffffff811679aa>] ? fallback_alloc+0x1ba/0x270
17:14:59: [<ffffffff811673ff>] ? cache_grow+0x2cf/0x320
17:14:59: [<ffffffff81167729>] ? ____cache_alloc_node+0x99/0x160
17:14:59: [<ffffffffa0538ed7>] ? LNetMDAttach+0x157/0x5a0 [lnet]
17:14:59: [<ffffffff811684f9>] ? __kmalloc+0x189/0x220
17:14:59: [<ffffffffa0538ed7>] ? LNetMDAttach+0x157/0x5a0 [lnet]
17:15:00: [<ffffffffa0771b35>] ? ptlrpc_register_bulk+0x265/0x9d0 [ptlrpc]
17:15:00: [<ffffffffa0773a12>] ? ptl_send_rpc+0x232/0xc40 [ptlrpc]
17:15:00: [<ffffffff81281b74>] ? snprintf+0x34/0x40
17:15:01: [<ffffffffa0488761>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
17:15:01: [<ffffffffa07685f4>] ? ptlrpc_send_new_req+0x454/0x790 [ptlrpc]
17:15:02: [<ffffffffa076c368>] ? ptlrpc_check_set+0x888/0x1b40 [ptlrpc]
17:15:02: [<ffffffffa079801b>] ? ptlrpcd_check+0x53b/0x560 [ptlrpc]
17:15:03: [<ffffffffa079853b>] ? ptlrpcd+0x20b/0x370 [ptlrpc]
17:15:03: [<ffffffff81063410>] ? default_wake_function+0x0/0x20
17:15:03: [<ffffffffa0798330>] ? ptlrpcd+0x0/0x370 [ptlrpc]
17:15:03: [<ffffffff81096a36>] ? kthread+0x96/0xa0
17:15:03: [<ffffffff8100c0ca>] ? child_rip+0xa/0x20
17:15:04: [<ffffffff810969a0>] ? kthread+0x0/0xa0
17:15:04: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
17:15:06:Mem-Info:
17:15:06:Node 0 DMA per-cpu:
17:15:06:CPU    0: hi:    0, btch:   1 usd:   0
17:15:06:Node 0 DMA32 per-cpu:
17:15:06:CPU    0: hi:  186, btch:  31 usd:  42
17:15:06:active_anon:2345 inactive_anon:2732 isolated_anon:0
17:15:07: active_file:110430 inactive_file:238985 isolated_file:0
17:15:07: unevictable:0 dirty:3 writeback:0 unstable:0
17:15:07: free:14257 slab_reclaimable:7260 slab_unreclaimable:76976
17:15:07: mapped:2551 shmem:41 pagetables:794 bounce:0
17:15:08:Node 0 DMA free:8264kB min:332kB low:412kB high:496kB active_anon:0kB inactive_anon:0kB active_file:272kB inactive_file:5444kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15324kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:36kB slab_unreclaimable:1700kB kernel_stack:16kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
17:15:08:lowmem_reserve[]: 0 2003 2003 2003
17:15:09:Node 0 DMA32 free:48764kB min:44720kB low:55900kB high:67080kB active_anon:9380kB inactive_anon:10928kB active_file:441448kB inactive_file:950496kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2052064kB mlocked:0kB dirty:12kB writeback:0kB mapped:10204kB shmem:164kB slab_reclaimable:29004kB slab_unreclaimable:306204kB kernel_stack:1984kB pagetables:3176kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
17:15:09:lowmem_reserve[]: 0 0 0 0
17:15:10:Node 0 DMA: 58*4kB 104*8kB 102*16kB 42*32kB 6*64kB 2*128kB 2*256kB 2*512kB 0*1024kB 1*2048kB 0*4096kB = 8264kB
17:15:11:Node 0 DMA32: 10659*4kB 2*8kB 2*16kB 2*32kB 2*64kB 0*128kB 1*256kB 1*512kB 1*1024kB 0*2048kB 1*4096kB = 48764kB
17:15:11:269122 total pagecache pages
17:15:11:28 pages in swap cache
17:15:11:Swap cache stats: add 62, delete 34, find 18/22
17:15:11:Free swap  = 4128648kB
17:15:12:Total swap = 4128760kB
17:15:12:524284 pages RAM
17:15:12:43669 pages reserved
17:15:13:282260 pages shared
17:15:13:194054 pages non-shared
17:15:14:LNetError: 2780:0:(lib-lnet.h:457:lnet_md_alloc()) LNET: out of memory at /var/lib/jenkins/workspace/lustre-master/arch/x86_64/build_type/server/distro/el6/ib_stack/inkernel/BUILD/BUILD/lustre-2.4.93/lnet/include/lnet/lib-lnet.h:457 (tried to alloc '(md)' = 4208)
17:15:14:LNetError: 2780:0:(lib-lnet.h:457:lnet_md_alloc()) LNET: 55064047 total bytes allocated by lnet
17:15:15:LustreError: 2780:0:(niobuf.c:376:ptlrpc_register_bulk()) lustre-OST0002-osc-ffff88006f296400: LNetMDAttach failed x1447417177531472/0: rc = -12


 Comments   
Comment by Sarah Liu [ 01/Oct/13 ]

parallel-scale-nfsv3 test_connectathon failed with similar error

https://maloo.whamcloud.com/test_sets/d951fa2a-2915-11e3-b598-52540035b04c

mds dmesg:

ptlrpcd_1: page allocation failure. order:1, mode:0x40
Pid: 2773, comm: ptlrpcd_1 Not tainted 2.6.32-358.18.1.el6_lustre.x86_64 #1
Call Trace:
 [<ffffffff8112c257>] ? __alloc_pages_nodemask+0x757/0x8d0
 [<ffffffff81166d92>] ? kmem_getpages+0x62/0x170
 [<ffffffff811679aa>] ? fallback_alloc+0x1ba/0x270
 [<ffffffff811673ff>] ? cache_grow+0x2cf/0x320
 [<ffffffff81167729>] ? ____cache_alloc_node+0x99/0x160
 [<ffffffffa0538ed7>] ? LNetMDAttach+0x157/0x5a0 [lnet]
 [<ffffffff811684f9>] ? __kmalloc+0x189/0x220
 [<ffffffffa0538ed7>] ? LNetMDAttach+0x157/0x5a0 [lnet]
 [<ffffffffa0771b35>] ? ptlrpc_register_bulk+0x265/0x9d0 [ptlrpc]
 [<ffffffffa0773a12>] ? ptl_send_rpc+0x232/0xc40 [ptlrpc]
 [<ffffffff81281b74>] ? snprintf+0x34/0x40
 [<ffffffffa0488761>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
 [<ffffffffa07685f4>] ? ptlrpc_send_new_req+0x454/0x790 [ptlrpc]
 [<ffffffffa076c368>] ? ptlrpc_check_set+0x888/0x1b40 [ptlrpc]
 [<ffffffffa079801b>] ? ptlrpcd_check+0x53b/0x560 [ptlrpc]
 [<ffffffffa079853b>] ? ptlrpcd+0x20b/0x370 [ptlrpc]
 [<ffffffff81063410>] ? default_wake_function+0x0/0x20
 [<ffffffffa0798330>] ? ptlrpcd+0x0/0x370 [ptlrpc]
 [<ffffffff81096a36>] ? kthread+0x96/0xa0
 [<ffffffff8100c0ca>] ? child_rip+0xa/0x20
 [<ffffffff810969a0>] ? kthread+0x0/0xa0
 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
Mem-Info:
Node 0 DMA per-cpu:
CPU    0: hi:    0, btch:   1 usd:   0
Node 0 DMA32 per-cpu:
CPU    0: hi:  186, btch:  31 usd: 100
active_anon:2425 inactive_anon:2637 isolated_anon:0
 active_file:18661 inactive_file:310952 isolated_file:0
 unevictable:0 dirty:24 writeback:0 unstable:0
 free:18497 slab_reclaimable:4305 slab_unreclaimable:95385
 mapped:2054 shmem:41 pagetables:790 bounce:0
Node 0 DMA free:8276kB min:332kB low:412kB high:496kB active_anon:0kB inactive_anon:0kB active_file:92kB inactive_file:5752kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15324kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:32kB slab_unreclaimable:1572kB kernel_stack:8kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 2003 2003 2003
Node 0 DMA32 free:65712kB min:44720kB low:55900kB high:67080kB active_anon:9700kB inactive_anon:10548kB active_file:74552kB inactive_file:1238056kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2052064kB mlocked:0kB dirty:96kB writeback:0kB mapped:8216kB shmem:164kB slab_reclaimable:17188kB slab_unreclaimable:379968kB kernel_stack:2040kB pagetables:3160kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0 0
Node 0 DMA: 25*4kB 16*8kB 23*16kB 10*32kB 3*64kB 2*128kB 5*256kB 3*512kB 2*1024kB 1*2048kB 0*4096kB = 8276kB
Node 0 DMA32: 12818*4kB 329*8kB 116*16kB 87*32kB 20*64kB 0*128kB 1*256kB 1*512kB 1*1024kB 0*2048kB 1*4096kB = 65712kB
329671 total pagecache pages
0 pages in swap cache
Swap cache stats: add 0, delete 0, find 0/0
Free swap  = 4128760kB
Total swap = 4128760kB
524284 pages RAM
43669 pages reserved
339843 pages shared
128622 pages non-shared
LNetError: 2773:0:(lib-lnet.h:457:lnet_md_alloc()) LNET: out of memory at /var/lib/jenkins/workspace/lustre-master/arch/x86_64/build_type/server/distro/el6/ib_stack/inkernel/BUILD/BUILD/lustre-2.4.93/lnet/include/lnet/lib-lnet.h:457 (tried to alloc '(md)' = 4192)
Comment by Jodi Levi (Inactive) [ 01/Oct/13 ]

Bob,
Could you have a look at this one please?
Thank you!

Comment by Andreas Dilger [ 01/Oct/13 ]

We need to get some information from the MDS, like /proc/slabinfo and "lctl get_param memused" to determine where the memory is being used. This should be saved every minute or so, since it will not be able to collect this info when the MDS node runs out of memory.

Comment by Bob Glossman (Inactive) [ 01/Oct/13 ]

A couple of items of info gleaned from discussion with sarah

1) This is a new failure. Only seen during testing on the 2.4.93 tag. Not seen in earlier tests.

2) Going to try to reproduce it with SLES11 SP3 clients to see if it happens there too.

Comment by Bob Glossman (Inactive) [ 02/Oct/13 ]

Seen only on full tests, not review.
Seen only after several other test failures, including ones with client side alloc failures. Suspect other failed tests aren't cleaning up after themselves nicely.

Comment by Andreas Dilger [ 02/Oct/13 ]

Bob, though this still concerns me if either the client or the MDS are getting OOM problems. There shouldn't be a client workload or failure condition that causes the MDS to run out of memory. The fact that it is only seen under full tests and not review just means that we don't have as much testing for each patch as we really need, and doesn't at all imply that this isn't a problem that will be hit under normal usage. We reformat and restart our filesystems so often under review testing that it doesn't really resemble a normal workload for the servers.

Comment by Bob Glossman (Inactive) [ 02/Oct/13 ]

Andreas, I agree with you. Shouldn't be seeing alloc failures in clients or servers with any reasonable kind of load. Shouldn't see effects persisting on MDS after client side failures.

So far haven't been able to reproduce this failure on a small scale using only VMs. No free nodes available on Toro to test at bigger scale at the moment. No loadable profiles for SLES11 SP2 or SP3 on Rosso, so I can't test there.

Comment by Sarah Liu [ 02/Oct/13 ]

I ran iorssf two times with SLES11 SP3 client, cannot reproduce:
https://maloo.whamcloud.com/test_sets/5f03fc72-2ba0-11e3-a203-52540035b04c

Do I need run the full test suite on SLES11 SP3?

Comment by Bob Glossman (Inactive) [ 02/Oct/13 ]

I hate to waste the time, but I think we do need to make best effort to do the same test sequence that failed in SP2 to see if it reproduces in SP3. So yes, please run the full test suite.

The fact that the single test alone passed repeatedly is a good indication that it by itself isn't the direct cause of resource exhaustion on the MDS. I'm betting if you just ran this test alone on SP2 a few time that would pass too.

I'm worried that there may be a slow memory leak somewhere that we are hitting only in the case of a full test and a particular sequence of failures in different places. Not sure how we go about tracking down such a thing.

Comment by Bob Glossman (Inactive) [ 03/Oct/13 ]

I'm wondering if this bug may be related to the leak behavior Andreas is looking at in LU-4053.

Comment by Andreas Dilger [ 03/Oct/13 ]

Running dbench for 1h on a single-node test setup in LU-4053 provides a very similar memory usage profile. We can focus LU-4053 for the client-side memory usage, and this bug can focus on the MDS-side memory usage.

After deleting all of the files in the filesystem and unmounting the client (freeing all of the client-side locks and slabs), the MDS still has a huge number of objects allocated. During the test run there were never more than about 1000 files in existence, but a relatively large number of files created and deleted.

  OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME                   
289460 132395  45%    0.19K  14473       20     57892K size-192
203463 179354  88%    0.10K   5499       37     21996K buffer_head
128894 127934  99%    0.11K   3791       34     15164K lod_obj
128880 127934  99%    0.08K   2685       48     10740K mdd_obj
128869 127934  99%    0.28K   9913       13     39652K mdt_obj
 71760  50205  69%    0.19K   3588       20     14352K dentry
  1491   1176  78%    1.03K    497        3      1988K ldiskfs_inode_cache

After leaving the system idle overnight, there are still a large number of objects, but some have been freed:

  OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME                   
221280  91056  41%    0.19K  11064       20     44256K size-192
203463 147927  72%    0.10K   5499       37     21996K buffer_head
 94944  86590  91%    0.08K   1978       48      7912K mdd_obj
 94418  86590  91%    0.11K   2777       34     11108K lod_obj
 93106  86590  93%    0.28K   7162       13     28648K mdt_obj
   987    955  96%    1.03K    329        3      1316K ldiskfs_inode_cache

It doesn't make sense to me that the MDT is caching objects that have been deleted. This is just causing useless memory pressure, even if they are eventually freed.

Comment by Sarah Liu [ 03/Oct/13 ]

I just pushed a for test only patch to run full suite with SLES11 SP3 client:
http://review.whamcloud.com/#/c/7688/1

Comment by Bob Glossman (Inactive) [ 04/Oct/13 ]

I'm wondering if there's a low rate flaw in the reference counting of objects. It occurs to me that all it would take is one seldom used code path with one too many xxx_object_get() calls or one too few xxx_object_put() calls and the reference count will never drop to 0. If the reference count is wrong that could explain never freeing objects. Over time the population of unreleased objects would grow.

Comment by Oleg Drokin [ 04/Oct/13 ]

I think it's still not related to any real object leakage, otherwise we would have noticed that by other means.
Most likely it's just like in 4053 - since we stopped proactively clearing inodes from cache in 2.3

Comment by Andreas Dilger [ 04/Oct/13 ]

This bug is intended to track the problem with MDS-side objects not being freed (mdd_obj, lod_obj, mdt_obj slabs). The LU-4053 ticket is to track the client-side CLIO objects not being freed.

I think there is just something wrong in the MDS stack that it is not destroying the whole lu_obj (or whatever?) when an object is unlinked, and this is only freed at unmount time or possibly very slowly under memory pressure. It doesn't make any sense to keep objects in memory for FIDs that have been deleted.

Comment by Alex Zhuravlev [ 04/Oct/13 ]

I did a simple test:

  1. grep mdt_obj /proc/slabinfo
    mdt_obj 28 28 280 14 1 : tunables 32 16 8 : slabdata 2 2 0 : globalstat 19938 5040 1413 23
  2. ./createmany -o /mnt/lustre/d0/f 10000
    total: 10000 creates in 2.97 seconds: 3366.59 creates/second
  3. grep mdt_obj /proc/slabinfo
    mdt_obj 10038 10038 280 14 1 : tunables 32 16 8 : slabdata 717 717 0 : globalstat 29948 10038 2128 23
  4. ./unlinkmany /mnt/lustre/d0/f 10000
  • unlinked 0 (time 1380765784 ; total 0 ; last 0)
  1. grep mdt_obj /proc/slabinfo
    mdt_obj 206 280 280 14 1 : tunables 32 16 8 : slabdata 17 20 128 : globalstat 39958 10038 2843 32 0 4 0 0 0 : cpustat 37165 2866 37516 2487

then in few dozen seconds:

  1. grep mdt_obj /proc/slabinfo
    mdt_obj 28 28 280 14 1 : tunables 32 16 8 : slabdata 2 2 0 : globalstat 39958 10038 2843 41

probably MM just retains lots of actually free pieces in per-cpu caches or something like that?

Comment by Andreas Dilger [ 05/Oct/13 ]

If the VM was just retaining the slabs, then they would not be marked active, I think. Also, near the end of my 1h dbench run there was starting to be some considerable memory pressure on the other slabs, so these should have been shrunk at that time if they were just in percpu cache.

It may be that a workload different from createmany/unlinkmany is needed? For example, neither of these operations does a lookup or readdir or a stat, or any number of other combinations. I don't think a 3600s dbench run is needed, I just was doing that to see if there is a long-term increase in memory use (which there is). Probably even a short run with full +malloc tracing would be enough.

Comment by Alex Zhuravlev [ 05/Oct/13 ]

sure, I'll try to reproduce with dbench.

Comment by Alex Zhuravlev [ 06/Oct/13 ]

the root cause seems to be remaining dirty pages on the client which cause ENQUEUEs for the layouts which in turn populate MDS cache with lu-objects for already removed files.

Comment by Alex Zhuravlev [ 07/Oct/13 ]

to clarify a bit.. given OST_DESTROY isn't executed immediately (but after commit on MDS at least), there is a window when MDT object is destroyed (in cache), but OST objects aren't. if the kernel decides to flush dirty pages at that point, the client code will try to revalidate layout (which was invalidated by UNLINK). this way we get lu-objects in the memory. they are supposed to be purged at some point (given no access can be made after OST_DESTROY). I'd think this isn't a big issue, but it'd be cool if we can use layout lock for this purpose as well.

Comment by Alex Zhuravlev [ 18/Oct/13 ]

Andreas, have a look at the proto: http://review.whamcloud.com/#/c/8003/ - the idea is to signal the client the file is removed, so the client can reset nlink and let the kernel to drop the inode and the pages. it's not intended for landing yet, but I'd like to hear your opinion on the approach.

Comment by Sarah Liu [ 04/Dec/13 ]

hit this bug in interop testing between 2.6 server and 2.5 server

https://maloo.whamcloud.com/test_sets/f380a6a4-5beb-11e3-8bdd-52540035b04c

Generated at Sat Feb 10 01:39:04 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.