[LU-4033] Failure on test suite parallel-scale-nfsv4 test_iorssf: MDS oom - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Duplicate
Priority: Critical
Fix Version/s: None
Affects Version/s: Lustre 2.5.0
Labels:
None
Environment:
server and client: lustre-master build # 1687
client is running SLES11 SP2

Severity:
3
Rank (Obsolete):
10835

Description

This issue was created by maloo for sarah <sarah@whamcloud.com>

This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/960b8b64-2915-11e3-b598-52540035b04c.

The sub-test test_iorssf failed with the following error:

test failed to respond and timed out

MDS console

17:14:54:ptlrpcd_0: page allocation failure. order:1, mode:0x40
17:14:55:Pid: 2780, comm: ptlrpcd_0 Not tainted 2.6.32-358.18.1.el6_lustre.x86_64 #1
17:14:56:Call Trace:
17:14:57: [<ffffffff8112c257>] ? __alloc_pages_nodemask+0x757/0x8d0
17:14:58: [<ffffffff81166d92>] ? kmem_getpages+0x62/0x170
17:14:59: [<ffffffff811679aa>] ? fallback_alloc+0x1ba/0x270
17:14:59: [<ffffffff811673ff>] ? cache_grow+0x2cf/0x320
17:14:59: [<ffffffff81167729>] ? ____cache_alloc_node+0x99/0x160
17:14:59: [<ffffffffa0538ed7>] ? LNetMDAttach+0x157/0x5a0 [lnet]
17:14:59: [<ffffffff811684f9>] ? __kmalloc+0x189/0x220
17:14:59: [<ffffffffa0538ed7>] ? LNetMDAttach+0x157/0x5a0 [lnet]
17:15:00: [<ffffffffa0771b35>] ? ptlrpc_register_bulk+0x265/0x9d0 [ptlrpc]
17:15:00: [<ffffffffa0773a12>] ? ptl_send_rpc+0x232/0xc40 [ptlrpc]
17:15:00: [<ffffffff81281b74>] ? snprintf+0x34/0x40
17:15:01: [<ffffffffa0488761>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
17:15:01: [<ffffffffa07685f4>] ? ptlrpc_send_new_req+0x454/0x790 [ptlrpc]
17:15:02: [<ffffffffa076c368>] ? ptlrpc_check_set+0x888/0x1b40 [ptlrpc]
17:15:02: [<ffffffffa079801b>] ? ptlrpcd_check+0x53b/0x560 [ptlrpc]
17:15:03: [<ffffffffa079853b>] ? ptlrpcd+0x20b/0x370 [ptlrpc]
17:15:03: [<ffffffff81063410>] ? default_wake_function+0x0/0x20
17:15:03: [<ffffffffa0798330>] ? ptlrpcd+0x0/0x370 [ptlrpc]
17:15:03: [<ffffffff81096a36>] ? kthread+0x96/0xa0
17:15:03: [<ffffffff8100c0ca>] ? child_rip+0xa/0x20
17:15:04: [<ffffffff810969a0>] ? kthread+0x0/0xa0
17:15:04: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
17:15:06:Mem-Info:
17:15:06:Node 0 DMA per-cpu:
17:15:06:CPU    0: hi:    0, btch:   1 usd:   0
17:15:06:Node 0 DMA32 per-cpu:
17:15:06:CPU    0: hi:  186, btch:  31 usd:  42
17:15:06:active_anon:2345 inactive_anon:2732 isolated_anon:0
17:15:07: active_file:110430 inactive_file:238985 isolated_file:0
17:15:07: unevictable:0 dirty:3 writeback:0 unstable:0
17:15:07: free:14257 slab_reclaimable:7260 slab_unreclaimable:76976
17:15:07: mapped:2551 shmem:41 pagetables:794 bounce:0
17:15:08:Node 0 DMA free:8264kB min:332kB low:412kB high:496kB active_anon:0kB inactive_anon:0kB active_file:272kB inactive_file:5444kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15324kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:36kB slab_unreclaimable:1700kB kernel_stack:16kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
17:15:08:lowmem_reserve[]: 0 2003 2003 2003
17:15:09:Node 0 DMA32 free:48764kB min:44720kB low:55900kB high:67080kB active_anon:9380kB inactive_anon:10928kB active_file:441448kB inactive_file:950496kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2052064kB mlocked:0kB dirty:12kB writeback:0kB mapped:10204kB shmem:164kB slab_reclaimable:29004kB slab_unreclaimable:306204kB kernel_stack:1984kB pagetables:3176kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
17:15:09:lowmem_reserve[]: 0 0 0 0
17:15:10:Node 0 DMA: 58*4kB 104*8kB 102*16kB 42*32kB 6*64kB 2*128kB 2*256kB 2*512kB 0*1024kB 1*2048kB 0*4096kB = 8264kB
17:15:11:Node 0 DMA32: 10659*4kB 2*8kB 2*16kB 2*32kB 2*64kB 0*128kB 1*256kB 1*512kB 1*1024kB 0*2048kB 1*4096kB = 48764kB
17:15:11:269122 total pagecache pages
17:15:11:28 pages in swap cache
17:15:11:Swap cache stats: add 62, delete 34, find 18/22
17:15:11:Free swap  = 4128648kB
17:15:12:Total swap = 4128760kB
17:15:12:524284 pages RAM
17:15:12:43669 pages reserved
17:15:13:282260 pages shared
17:15:13:194054 pages non-shared
17:15:14:LNetError: 2780:0:(lib-lnet.h:457:lnet_md_alloc()) LNET: out of memory at /var/lib/jenkins/workspace/lustre-master/arch/x86_64/build_type/server/distro/el6/ib_stack/inkernel/BUILD/BUILD/lustre-2.4.93/lnet/include/lnet/lib-lnet.h:457 (tried to alloc '(md)' = 4208)
17:15:14:LNetError: 2780:0:(lib-lnet.h:457:lnet_md_alloc()) LNET: 55064047 total bytes allocated by lnet
17:15:15:LustreError: 2780:0:(niobuf.c:376:ptlrpc_register_bulk()) lustre-OST0002-osc-ffff88006f296400: LNetMDAttach failed x1447417177531472/0: rc = -12

Attachments

Issue Links

duplicates

LU-4357 page allocation failure. mode:0x40 caused by missing __GFP_WAIT flag

Resolved

is related to

LU-4053 client leaking objects/locks during IO

Resolved

Activity

[LU-4033] Failure on test suite parallel-scale-nfsv4 test_iorssf: MDS oom

Alex Zhuravlev added a comment - 05/Oct/13 6:42 PM

sure, I'll try to reproduce with dbench.

Alex Zhuravlev added a comment - 05/Oct/13 6:42 PM sure, I'll try to reproduce with dbench.

Andreas Dilger added a comment - 05/Oct/13 6:33 PM

If the VM was just retaining the slabs, then they would not be marked active, I think. Also, near the end of my 1h dbench run there was starting to be some considerable memory pressure on the other slabs, so these should have been shrunk at that time if they were just in percpu cache.

It may be that a workload different from createmany/unlinkmany is needed? For example, neither of these operations does a lookup or readdir or a stat, or any number of other combinations. I don't think a 3600s dbench run is needed, I just was doing that to see if there is a long-term increase in memory use (which there is). Probably even a short run with full +malloc tracing would be enough.

Andreas Dilger added a comment - 05/Oct/13 6:33 PM If the VM was just retaining the slabs, then they would not be marked active, I think. Also, near the end of my 1h dbench run there was starting to be some considerable memory pressure on the other slabs, so these should have been shrunk at that time if they were just in percpu cache. It may be that a workload different from createmany/unlinkmany is needed? For example, neither of these operations does a lookup or readdir or a stat, or any number of other combinations. I don't think a 3600s dbench run is needed, I just was doing that to see if there is a long-term increase in memory use (which there is). Probably even a short run with full +malloc tracing would be enough.

Alex Zhuravlev added a comment - 04/Oct/13 5:22 PM

I did a simple test:

grep mdt_obj /proc/slabinfo
mdt_obj 28 28 280 14 1 : tunables 32 16 8 : slabdata 2 2 0 : globalstat 19938 5040 1413 23
./createmany -o /mnt/lustre/d0/f 10000
total: 10000 creates in 2.97 seconds: 3366.59 creates/second
grep mdt_obj /proc/slabinfo
mdt_obj 10038 10038 280 14 1 : tunables 32 16 8 : slabdata 717 717 0 : globalstat 29948 10038 2128 23
./unlinkmany /mnt/lustre/d0/f 10000

unlinked 0 (time 1380765784 ; total 0 ; last 0)

grep mdt_obj /proc/slabinfo
mdt_obj 206 280 280 14 1 : tunables 32 16 8 : slabdata 17 20 128 : globalstat 39958 10038 2843 32 0 4 0 0 0 : cpustat 37165 2866 37516 2487

then in few dozen seconds:

grep mdt_obj /proc/slabinfo
mdt_obj 28 28 280 14 1 : tunables 32 16 8 : slabdata 2 2 0 : globalstat 39958 10038 2843 41

probably MM just retains lots of actually free pieces in per-cpu caches or something like that?

Alex Zhuravlev added a comment - 04/Oct/13 5:22 PM I did a simple test: grep mdt_obj /proc/slabinfo mdt_obj 28 28 280 14 1 : tunables 32 16 8 : slabdata 2 2 0 : globalstat 19938 5040 1413 23 ./createmany -o /mnt/lustre/d0/f 10000 total: 10000 creates in 2.97 seconds: 3366.59 creates/second grep mdt_obj /proc/slabinfo mdt_obj 10038 10038 280 14 1 : tunables 32 16 8 : slabdata 717 717 0 : globalstat 29948 10038 2128 23 ./unlinkmany /mnt/lustre/d0/f 10000 unlinked 0 (time 1380765784 ; total 0 ; last 0) grep mdt_obj /proc/slabinfo mdt_obj 206 280 280 14 1 : tunables 32 16 8 : slabdata 17 20 128 : globalstat 39958 10038 2843 32 0 4 0 0 0 : cpustat 37165 2866 37516 2487 then in few dozen seconds: grep mdt_obj /proc/slabinfo mdt_obj 28 28 280 14 1 : tunables 32 16 8 : slabdata 2 2 0 : globalstat 39958 10038 2843 41 probably MM just retains lots of actually free pieces in per-cpu caches or something like that?

Andreas Dilger added a comment - 04/Oct/13 5:07 PM

This bug is intended to track the problem with MDS-side objects not being freed (mdd_obj, lod_obj, mdt_obj slabs). The ~~LU-4053~~ ticket is to track the client-side CLIO objects not being freed.

I think there is just something wrong in the MDS stack that it is not destroying the whole lu_obj (or whatever?) when an object is unlinked, and this is only freed at unmount time or possibly very slowly under memory pressure. It doesn't make any sense to keep objects in memory for FIDs that have been deleted.

Andreas Dilger added a comment - 04/Oct/13 5:07 PM This bug is intended to track the problem with MDS-side objects not being freed (mdd_obj, lod_obj, mdt_obj slabs). The LU-4053 ticket is to track the client-side CLIO objects not being freed. I think there is just something wrong in the MDS stack that it is not destroying the whole lu_obj (or whatever?) when an object is unlinked, and this is only freed at unmount time or possibly very slowly under memory pressure. It doesn't make any sense to keep objects in memory for FIDs that have been deleted.

Oleg Drokin added a comment - 04/Oct/13 4:57 PM

I think it's still not related to any real object leakage, otherwise we would have noticed that by other means.
Most likely it's just like in 4053 - since we stopped proactively clearing inodes from cache in 2.3

Oleg Drokin added a comment - 04/Oct/13 4:57 PM I think it's still not related to any real object leakage, otherwise we would have noticed that by other means. Most likely it's just like in 4053 - since we stopped proactively clearing inodes from cache in 2.3

Bob Glossman (Inactive) added a comment - 04/Oct/13 3:22 PM

I'm wondering if there's a low rate flaw in the reference counting of objects. It occurs to me that all it would take is one seldom used code path with one too many xxx_object_get() calls or one too few xxx_object_put() calls and the reference count will never drop to 0. If the reference count is wrong that could explain never freeing objects. Over time the population of unreleased objects would grow.

Bob Glossman (Inactive) added a comment - 04/Oct/13 3:22 PM I'm wondering if there's a low rate flaw in the reference counting of objects. It occurs to me that all it would take is one seldom used code path with one too many xxx_object_get() calls or one too few xxx_object_put() calls and the reference count will never drop to 0. If the reference count is wrong that could explain never freeing objects. Over time the population of unreleased objects would grow.

Sarah Liu added a comment - 03/Oct/13 9:36 PM

I just pushed a for test only patch to run full suite with SLES11 SP3 client:
http://review.whamcloud.com/#/c/7688/1

Sarah Liu added a comment - 03/Oct/13 9:36 PM I just pushed a for test only patch to run full suite with SLES11 SP3 client: http://review.whamcloud.com/#/c/7688/1

Andreas Dilger added a comment - 03/Oct/13 4:56 PM

Running dbench for 1h on a single-node test setup in ~~LU-4053~~ provides a very similar memory usage profile. We can focus ~~LU-4053~~ for the client-side memory usage, and this bug can focus on the MDS-side memory usage.

After deleting all of the files in the filesystem and unmounting the client (freeing all of the client-side locks and slabs), the MDS still has a huge number of objects allocated. During the test run there were never more than about 1000 files in existence, but a relatively large number of files created and deleted.

  OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME                   
289460 132395  45%    0.19K  14473       20     57892K size-192
203463 179354  88%    0.10K   5499       37     21996K buffer_head
128894 127934  99%    0.11K   3791       34     15164K lod_obj
128880 127934  99%    0.08K   2685       48     10740K mdd_obj
128869 127934  99%    0.28K   9913       13     39652K mdt_obj
 71760  50205  69%    0.19K   3588       20     14352K dentry
  1491   1176  78%    1.03K    497        3      1988K ldiskfs_inode_cache

After leaving the system idle overnight, there are still a large number of objects, but some have been freed:

  OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME                   
221280  91056  41%    0.19K  11064       20     44256K size-192
203463 147927  72%    0.10K   5499       37     21996K buffer_head
 94944  86590  91%    0.08K   1978       48      7912K mdd_obj
 94418  86590  91%    0.11K   2777       34     11108K lod_obj
 93106  86590  93%    0.28K   7162       13     28648K mdt_obj
   987    955  96%    1.03K    329        3      1316K ldiskfs_inode_cache

It doesn't make sense to me that the MDT is caching objects that have been deleted. This is just causing useless memory pressure, even if they are eventually freed.

Andreas Dilger added a comment - 03/Oct/13 4:56 PM Running dbench for 1h on a single-node test setup in LU-4053 provides a very similar memory usage profile. We can focus LU-4053 for the client-side memory usage, and this bug can focus on the MDS-side memory usage. After deleting all of the files in the filesystem and unmounting the client (freeing all of the client-side locks and slabs), the MDS still has a huge number of objects allocated. During the test run there were never more than about 1000 files in existence, but a relatively large number of files created and deleted. OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME 289460 132395 45% 0.19K 14473 20 57892K size-192 203463 179354 88% 0.10K 5499 37 21996K buffer_head 128894 127934 99% 0.11K 3791 34 15164K lod_obj 128880 127934 99% 0.08K 2685 48 10740K mdd_obj 128869 127934 99% 0.28K 9913 13 39652K mdt_obj 71760 50205 69% 0.19K 3588 20 14352K dentry 1491 1176 78% 1.03K 497 3 1988K ldiskfs_inode_cache After leaving the system idle overnight, there are still a large number of objects, but some have been freed: OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME 221280 91056 41% 0.19K 11064 20 44256K size-192 203463 147927 72% 0.10K 5499 37 21996K buffer_head 94944 86590 91% 0.08K 1978 48 7912K mdd_obj 94418 86590 91% 0.11K 2777 34 11108K lod_obj 93106 86590 93% 0.28K 7162 13 28648K mdt_obj 987 955 96% 1.03K 329 3 1316K ldiskfs_inode_cache It doesn't make sense to me that the MDT is caching objects that have been deleted. This is just causing useless memory pressure, even if they are eventually freed.

Bob Glossman (Inactive) added a comment - 03/Oct/13 1:40 PM

I'm wondering if this bug may be related to the leak behavior Andreas is looking at in ~~LU-4053~~.

Bob Glossman (Inactive) added a comment - 03/Oct/13 1:40 PM I'm wondering if this bug may be related to the leak behavior Andreas is looking at in LU-4053 .

Bob Glossman (Inactive) added a comment - 02/Oct/13 10:21 PM - edited

I hate to waste the time, but I think we do need to make best effort to do the same test sequence that failed in SP2 to see if it reproduces in SP3. So yes, please run the full test suite.

The fact that the single test alone passed repeatedly is a good indication that it by itself isn't the direct cause of resource exhaustion on the MDS. I'm betting if you just ran this test alone on SP2 a few time that would pass too.

I'm worried that there may be a slow memory leak somewhere that we are hitting only in the case of a full test and a particular sequence of failures in different places. Not sure how we go about tracking down such a thing.

Bob Glossman (Inactive) added a comment - 02/Oct/13 10:21 PM - edited I hate to waste the time, but I think we do need to make best effort to do the same test sequence that failed in SP2 to see if it reproduces in SP3. So yes, please run the full test suite. The fact that the single test alone passed repeatedly is a good indication that it by itself isn't the direct cause of resource exhaustion on the MDS. I'm betting if you just ran this test alone on SP2 a few time that would pass too. I'm worried that there may be a slow memory leak somewhere that we are hitting only in the case of a full test and a particular sequence of failures in different places. Not sure how we go about tracking down such a thing.

Sarah Liu added a comment - 02/Oct/13 10:14 PM - edited

I ran iorssf two times with SLES11 SP3 client, cannot reproduce:
https://maloo.whamcloud.com/test_sets/5f03fc72-2ba0-11e3-a203-52540035b04c

Do I need run the full test suite on SLES11 SP3?

Sarah Liu added a comment - 02/Oct/13 10:14 PM - edited I ran iorssf two times with SLES11 SP3 client, cannot reproduce: https://maloo.whamcloud.com/test_sets/5f03fc72-2ba0-11e3-a203-52540035b04c Do I need run the full test suite on SLES11 SP3?

People

Assignee:: Bob Glossman (Inactive)

Reporter:: Maloo

Votes:: 0 Vote for this issue

Watchers:: 9 Start watching this issue

Dates

Created:: 30/Sep/13 11:21 PM

Updated:: 21/May/14 7:02 PM

Resolved:: 13/Feb/14 10:05 PM