[LU-3910] Interop 2.4.0<->2.5 failure on test suite parallel-scale-nfsv4 test_iorssf: MDS OOM Created: 08/Sep/13  Updated: 13/Feb/14  Resolved: 13/Feb/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Maloo Assignee: WC Triage
Resolution: Duplicate Votes: 0
Labels: None
Environment:

client: 2.4.0
server: lustre-master build # 1652


Issue Links:
Duplicate
duplicates LU-4357 page allocation failure. mode:0x40 ca... Resolved
Related
is related to LU-2139 Tracking unstable pages Resolved
Severity: 3
Rank (Obsolete): 10306

 Description   

This issue was created by maloo for sarah <sarah@whamcloud.com>

This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/f9f7ec36-15c3-11e3-a83f-52540035b04c.

The sub-test test_iorssf failed with the following error:

test failed to respond and timed out

MDS console:

12:37:48:Lustre: DEBUG MARKER: == parallel-scale-nfsv4 test iorssf: iorssf == 12:37:39 (1378323459)
12:37:48:Lustre: DEBUG MARKER: lfs setstripe /mnt/lustre/d0.ior.ssf -c -1
12:38:00:Lustre: MGS: Client a0cabda3-b9a7-2ed3-58fd-f9e8cebfe558 (at 10.10.4.199@tcp) reconnecting
12:38:00:Lustre: lustre-MDT0000: Client lustre-MDT0000-lwp-OST0000_UUID (at 10.10.4.199@tcp) reconnecting
12:43:36:ptlrpcd_0: page allocation failure. order:1, mode:0x40
12:43:36:Pid: 2733, comm: ptlrpcd_0 Not tainted 2.6.32-358.18.1.el6_lustre.x86_64 #1
12:43:37:Call Trace:
12:43:37: [<ffffffff8112c257>] ? __alloc_pages_nodemask+0x757/0x8d0
12:43:37: [<ffffffffa0767b0f>] ? ptlrpc_set_add_new_req+0xcf/0x150 [ptlrpc]
12:43:37: [<ffffffff81166d92>] ? kmem_getpages+0x62/0x170
12:43:37: [<ffffffff811679aa>] ? fallback_alloc+0x1ba/0x270
12:43:37: [<ffffffff811673ff>] ? cache_grow+0x2cf/0x320
12:43:37: [<ffffffff81167729>] ? ____cache_alloc_node+0x99/0x160
12:43:37: [<ffffffffa053cea7>] ? LNetMDAttach+0x157/0x5a0 [lnet]
12:43:37: [<ffffffff811684f9>] ? __kmalloc+0x189/0x220
12:43:38: [<ffffffffa053cea7>] ? LNetMDAttach+0x157/0x5a0 [lnet]
12:43:38: [<ffffffffa0775935>] ? ptlrpc_register_bulk+0x265/0x9d0 [ptlrpc]
12:43:38: [<ffffffffa07777f2>] ? ptl_send_rpc+0x232/0xc40 [ptlrpc]
12:43:38: [<ffffffff81281b74>] ? snprintf+0x34/0x40
12:43:38: [<ffffffffa0489951>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
12:43:38: [<ffffffffa076c3f4>] ? ptlrpc_send_new_req+0x454/0x790 [ptlrpc]
12:43:38: [<ffffffffa0770148>] ? ptlrpc_check_set+0x888/0x1b30 [ptlrpc]
12:43:38: [<ffffffffa079bb3b>] ? ptlrpcd_check+0x53b/0x560 [ptlrpc]
12:43:38: [<ffffffff8109715c>] ? remove_wait_queue+0x3c/0x50
12:43:38: [<ffffffffa079bfc0>] ? ptlrpcd+0x190/0x380 [ptlrpc]
12:43:38: [<ffffffff81063410>] ? default_wake_function+0x0/0x20
12:43:38: [<ffffffffa079be30>] ? ptlrpcd+0x0/0x380 [ptlrpc]
12:43:39: [<ffffffff81096a36>] ? kthread+0x96/0xa0
12:43:40: [<ffffffff8100c0ca>] ? child_rip+0xa/0x20
12:43:40: [<ffffffff810969a0>] ? kthread+0x0/0xa0
12:43:40: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
12:43:41:Mem-Info:
12:43:41:Node 0 DMA per-cpu:
12:43:41:CPU    0: hi:    0, btch:   1 usd:   0
12:43:41:Node 0 DMA32 per-cpu:
12:43:41:CPU    0: hi:  186, btch:  31 usd: 202
12:43:42:active_anon:2089 inactive_anon:2411 isolated_anon:0
12:43:42: active_file:60821 inactive_file:275748 isolated_file:32
12:43:42: unevictable:0 dirty:34820 writeback:16128 unstable:0
12:43:42: free:17978 slab_reclaimable:5665 slab_unreclaimable:87698
12:43:42: mapped:2558 shmem:41 pagetables:793 bounce:0
12:43:43:Node 0 DMA free:8276kB min:332kB low:412kB high:496kB active_anon:0kB inactive_anon:0kB active_file:304kB inactive_file:5516kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15324kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:16kB slab_unreclaimable:1620kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
12:43:43:lowmem_reserve[]: 0 2003 2003 2003
12:43:43:Node 0 DMA32 free:63636kB min:44720kB low:55900kB high:67080kB active_anon:8356kB inactive_anon:9644kB active_file:242980kB inactive_file:1097476kB unevictable:0kB isolated(anon):0kB isolated(file):128kB present:2052064kB mlocked:0kB dirty:139280kB writeback:64512kB mapped:10232kB shmem:164kB slab_reclaimable:22644kB slab_unreclaimable:349172kB kernel_stack:2024kB pagetables:3172kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:32 all_unreclaimable? no
12:43:43:lowmem_reserve[]: 0 0 0 0
12:43:43:Node 0 DMA: 47*4kB 5*8kB 21*16kB 9*32kB 10*64kB 5*128kB 6*256kB 1*512kB 2*1024kB 1*2048kB 0*4096kB = 8276kB
12:43:43:Node 0 DMA32: 13749*4kB 30*8kB 7*16kB 7*32kB 4*64kB 15*128kB 1*256kB 1*512kB 1*1024kB 0*2048kB 1*4096kB = 63636kB
12:43:43:299076 total pagecache pages
12:43:44:0 pages in swap cache
12:43:44:Swap cache stats: add 0, delete 0, find 0/0
12:43:45:Free swap  = 4128760kB
12:43:45:Total swap = 4128760kB
12:43:45:524284 pages RAM
12:43:45:43669 pages reserved
12:43:46:330512 pages shared
12:43:46:160098 pages non-shared
12:43:46:LNetError: 2733:0:(lib-lnet.h:457:lnet_md_alloc()) LNET: out of memory at /var/lib/jenkins/workspace/lustre-master/arch/x86_64/build_type/server/distro/el6/ib_stack/inkernel/BUILD/BUILD/lustre-2.4.92/lnet/include/lnet/lib-lnet.h:457 (tried to alloc '(md)' = 4208)
12:43:46:LNetError: 2733:0:(lib-lnet.h:457:lnet_md_alloc()) LNET: 55454455 total bytes allocated by lnet
12:43:46:LustreError: 2733:0:(niobuf.c:376:ptlrpc_register_bulk()) lustre-OST0001-osc-ffff88006a434c00: LNetMDAttach failed x1445262125420992/0: rc = -12
12:43:46:Lustre: 2733:0:(client.c:1896:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 0/real 0]  req@ffff880024150400 x1445262125420992/t0(0) o4->lustre-OST0001-osc-ffff88006a434c00@10.10.4.199@tcp:6/4 lens 488/448 e 0 to 1 dl 0 ref 2 fl Rpc:X/0/ffffffff rc -12/-1
12:43:47:Lustre: lustre-OST0001-osc-ffff88006a434c00: Connection to lustre-OST0001 (at 10.10.4.199@tcp) was lost; in progress operations using this service will wait for recovery to complete
12:43:47:LustreError: 11-0: lustre-OST0001-osc-ffff88006a434c00: Communicating with 10.10.4.199@tcp, operation ost_connect failed with -16.
12:43:47:LustreError: Skipped 1 previous similar message
12:43:47:Lustre: lustre-OST0001-osc-ffff88006a434c00: Connection restored to lustre-OST0001 (at 10.10.4.199@tcp)
12:44:19:nfsd: page allocation failure. order:1, mode:0x40
12:44:19:Pid: 10656, comm: nfsd Not tainted 2.6.32-358.18.1.el6_lustre.x86_64 #1


 Comments   
Comment by Andreas Dilger [ 09/Sep/13 ]

There are two problems here:

  • NFS server is running on the MDS, which only has 1.83GB of memory
  • client on NFS server is holding lots of data pages pinned like LU-2139

I'm going to leave this bug to focus on changing this NFS test to run on MDS nodes with more memory, so that the testing does not fail.

Comment by Andreas Dilger [ 09/Sep/13 ]

Please note that interop testing NFS with the MDS as NFS server is not useful, since the Lustre 2.4.0 client will not be used for anything. It would be better to run the NFS server on the Lustre 2.4.0 client, which is more likely how it will be used in real life. This will also help avoid memory problems on the MDS.

How is a change like this done to the testing system?

Comment by Andreas Dilger [ 13/Feb/14 ]

Shows mode:0x40 == __GFP_IO, but missing __GFP_WAIT from LU-4357.

Generated at Sat Feb 10 01:37:59 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.