[LU-3910] Interop 2.4.0<->2.5 failure on test suite parallel-scale-nfsv4 test_iorssf: MDS OOM - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Duplicate
Priority: Major
Fix Version/s: None
Affects Version/s: Lustre 2.5.0
Labels:
None
Environment:
client: 2.4.0
server: lustre-master build # 1652

Severity:
3
Rank (Obsolete):
10306

Description

This issue was created by maloo for sarah <sarah@whamcloud.com>

This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/f9f7ec36-15c3-11e3-a83f-52540035b04c.

The sub-test test_iorssf failed with the following error:

test failed to respond and timed out

MDS console:

12:37:48:Lustre: DEBUG MARKER: == parallel-scale-nfsv4 test iorssf: iorssf == 12:37:39 (1378323459)
12:37:48:Lustre: DEBUG MARKER: lfs setstripe /mnt/lustre/d0.ior.ssf -c -1
12:38:00:Lustre: MGS: Client a0cabda3-b9a7-2ed3-58fd-f9e8cebfe558 (at 10.10.4.199@tcp) reconnecting
12:38:00:Lustre: lustre-MDT0000: Client lustre-MDT0000-lwp-OST0000_UUID (at 10.10.4.199@tcp) reconnecting
12:43:36:ptlrpcd_0: page allocation failure. order:1, mode:0x40
12:43:36:Pid: 2733, comm: ptlrpcd_0 Not tainted 2.6.32-358.18.1.el6_lustre.x86_64 #1
12:43:37:Call Trace:
12:43:37: [<ffffffff8112c257>] ? __alloc_pages_nodemask+0x757/0x8d0
12:43:37: [<ffffffffa0767b0f>] ? ptlrpc_set_add_new_req+0xcf/0x150 [ptlrpc]
12:43:37: [<ffffffff81166d92>] ? kmem_getpages+0x62/0x170
12:43:37: [<ffffffff811679aa>] ? fallback_alloc+0x1ba/0x270
12:43:37: [<ffffffff811673ff>] ? cache_grow+0x2cf/0x320
12:43:37: [<ffffffff81167729>] ? ____cache_alloc_node+0x99/0x160
12:43:37: [<ffffffffa053cea7>] ? LNetMDAttach+0x157/0x5a0 [lnet]
12:43:37: [<ffffffff811684f9>] ? __kmalloc+0x189/0x220
12:43:38: [<ffffffffa053cea7>] ? LNetMDAttach+0x157/0x5a0 [lnet]
12:43:38: [<ffffffffa0775935>] ? ptlrpc_register_bulk+0x265/0x9d0 [ptlrpc]
12:43:38: [<ffffffffa07777f2>] ? ptl_send_rpc+0x232/0xc40 [ptlrpc]
12:43:38: [<ffffffff81281b74>] ? snprintf+0x34/0x40
12:43:38: [<ffffffffa0489951>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
12:43:38: [<ffffffffa076c3f4>] ? ptlrpc_send_new_req+0x454/0x790 [ptlrpc]
12:43:38: [<ffffffffa0770148>] ? ptlrpc_check_set+0x888/0x1b30 [ptlrpc]
12:43:38: [<ffffffffa079bb3b>] ? ptlrpcd_check+0x53b/0x560 [ptlrpc]
12:43:38: [<ffffffff8109715c>] ? remove_wait_queue+0x3c/0x50
12:43:38: [<ffffffffa079bfc0>] ? ptlrpcd+0x190/0x380 [ptlrpc]
12:43:38: [<ffffffff81063410>] ? default_wake_function+0x0/0x20
12:43:38: [<ffffffffa079be30>] ? ptlrpcd+0x0/0x380 [ptlrpc]
12:43:39: [<ffffffff81096a36>] ? kthread+0x96/0xa0
12:43:40: [<ffffffff8100c0ca>] ? child_rip+0xa/0x20
12:43:40: [<ffffffff810969a0>] ? kthread+0x0/0xa0
12:43:40: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
12:43:41:Mem-Info:
12:43:41:Node 0 DMA per-cpu:
12:43:41:CPU    0: hi:    0, btch:   1 usd:   0
12:43:41:Node 0 DMA32 per-cpu:
12:43:41:CPU    0: hi:  186, btch:  31 usd: 202
12:43:42:active_anon:2089 inactive_anon:2411 isolated_anon:0
12:43:42: active_file:60821 inactive_file:275748 isolated_file:32
12:43:42: unevictable:0 dirty:34820 writeback:16128 unstable:0
12:43:42: free:17978 slab_reclaimable:5665 slab_unreclaimable:87698
12:43:42: mapped:2558 shmem:41 pagetables:793 bounce:0
12:43:43:Node 0 DMA free:8276kB min:332kB low:412kB high:496kB active_anon:0kB inactive_anon:0kB active_file:304kB inactive_file:5516kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15324kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:16kB slab_unreclaimable:1620kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
12:43:43:lowmem_reserve[]: 0 2003 2003 2003
12:43:43:Node 0 DMA32 free:63636kB min:44720kB low:55900kB high:67080kB active_anon:8356kB inactive_anon:9644kB active_file:242980kB inactive_file:1097476kB unevictable:0kB isolated(anon):0kB isolated(file):128kB present:2052064kB mlocked:0kB dirty:139280kB writeback:64512kB mapped:10232kB shmem:164kB slab_reclaimable:22644kB slab_unreclaimable:349172kB kernel_stack:2024kB pagetables:3172kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:32 all_unreclaimable? no
12:43:43:lowmem_reserve[]: 0 0 0 0
12:43:43:Node 0 DMA: 47*4kB 5*8kB 21*16kB 9*32kB 10*64kB 5*128kB 6*256kB 1*512kB 2*1024kB 1*2048kB 0*4096kB = 8276kB
12:43:43:Node 0 DMA32: 13749*4kB 30*8kB 7*16kB 7*32kB 4*64kB 15*128kB 1*256kB 1*512kB 1*1024kB 0*2048kB 1*4096kB = 63636kB
12:43:43:299076 total pagecache pages
12:43:44:0 pages in swap cache
12:43:44:Swap cache stats: add 0, delete 0, find 0/0
12:43:45:Free swap  = 4128760kB
12:43:45:Total swap = 4128760kB
12:43:45:524284 pages RAM
12:43:45:43669 pages reserved
12:43:46:330512 pages shared
12:43:46:160098 pages non-shared
12:43:46:LNetError: 2733:0:(lib-lnet.h:457:lnet_md_alloc()) LNET: out of memory at /var/lib/jenkins/workspace/lustre-master/arch/x86_64/build_type/server/distro/el6/ib_stack/inkernel/BUILD/BUILD/lustre-2.4.92/lnet/include/lnet/lib-lnet.h:457 (tried to alloc '(md)' = 4208)
12:43:46:LNetError: 2733:0:(lib-lnet.h:457:lnet_md_alloc()) LNET: 55454455 total bytes allocated by lnet
12:43:46:LustreError: 2733:0:(niobuf.c:376:ptlrpc_register_bulk()) lustre-OST0001-osc-ffff88006a434c00: LNetMDAttach failed x1445262125420992/0: rc = -12
12:43:46:Lustre: 2733:0:(client.c:1896:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 0/real 0]  req@ffff880024150400 x1445262125420992/t0(0) o4->lustre-OST0001-osc-ffff88006a434c00@10.10.4.199@tcp:6/4 lens 488/448 e 0 to 1 dl 0 ref 2 fl Rpc:X/0/ffffffff rc -12/-1
12:43:47:Lustre: lustre-OST0001-osc-ffff88006a434c00: Connection to lustre-OST0001 (at 10.10.4.199@tcp) was lost; in progress operations using this service will wait for recovery to complete
12:43:47:LustreError: 11-0: lustre-OST0001-osc-ffff88006a434c00: Communicating with 10.10.4.199@tcp, operation ost_connect failed with -16.
12:43:47:LustreError: Skipped 1 previous similar message
12:43:47:Lustre: lustre-OST0001-osc-ffff88006a434c00: Connection restored to lustre-OST0001 (at 10.10.4.199@tcp)
12:44:19:nfsd: page allocation failure. order:1, mode:0x40
12:44:19:Pid: 10656, comm: nfsd Not tainted 2.6.32-358.18.1.el6_lustre.x86_64 #1

Attachments

Issue Links

duplicates

LU-4357 page allocation failure. mode:0x40 caused by missing __GFP_WAIT flag

Resolved

is related to

LU-2139 Tracking unstable pages

Resolved

Interop 2.4.0<->2.5 failure on test suite parallel-scale-nfsv4 test_iorssf: MDS OOM

Details

Description

Attachments

Issue Links

Activity

People

Dates