[LU-3910] Interop 2.4.0<->2.5 failure on test suite parallel-scale-nfsv4 test_iorssf: MDS OOM Created: 08/Sep/13 Updated: 13/Feb/14 Resolved: 13/Feb/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.5.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Maloo | Assignee: | WC Triage |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Environment: |
client: 2.4.0 |
||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 10306 | ||||||||||||||||
| Description |
|
This issue was created by maloo for sarah <sarah@whamcloud.com> This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/f9f7ec36-15c3-11e3-a83f-52540035b04c. The sub-test test_iorssf failed with the following error:
MDS console: 12:37:48:Lustre: DEBUG MARKER: == parallel-scale-nfsv4 test iorssf: iorssf == 12:37:39 (1378323459) 12:37:48:Lustre: DEBUG MARKER: lfs setstripe /mnt/lustre/d0.ior.ssf -c -1 12:38:00:Lustre: MGS: Client a0cabda3-b9a7-2ed3-58fd-f9e8cebfe558 (at 10.10.4.199@tcp) reconnecting 12:38:00:Lustre: lustre-MDT0000: Client lustre-MDT0000-lwp-OST0000_UUID (at 10.10.4.199@tcp) reconnecting 12:43:36:ptlrpcd_0: page allocation failure. order:1, mode:0x40 12:43:36:Pid: 2733, comm: ptlrpcd_0 Not tainted 2.6.32-358.18.1.el6_lustre.x86_64 #1 12:43:37:Call Trace: 12:43:37: [<ffffffff8112c257>] ? __alloc_pages_nodemask+0x757/0x8d0 12:43:37: [<ffffffffa0767b0f>] ? ptlrpc_set_add_new_req+0xcf/0x150 [ptlrpc] 12:43:37: [<ffffffff81166d92>] ? kmem_getpages+0x62/0x170 12:43:37: [<ffffffff811679aa>] ? fallback_alloc+0x1ba/0x270 12:43:37: [<ffffffff811673ff>] ? cache_grow+0x2cf/0x320 12:43:37: [<ffffffff81167729>] ? ____cache_alloc_node+0x99/0x160 12:43:37: [<ffffffffa053cea7>] ? LNetMDAttach+0x157/0x5a0 [lnet] 12:43:37: [<ffffffff811684f9>] ? __kmalloc+0x189/0x220 12:43:38: [<ffffffffa053cea7>] ? LNetMDAttach+0x157/0x5a0 [lnet] 12:43:38: [<ffffffffa0775935>] ? ptlrpc_register_bulk+0x265/0x9d0 [ptlrpc] 12:43:38: [<ffffffffa07777f2>] ? ptl_send_rpc+0x232/0xc40 [ptlrpc] 12:43:38: [<ffffffff81281b74>] ? snprintf+0x34/0x40 12:43:38: [<ffffffffa0489951>] ? libcfs_debug_msg+0x41/0x50 [libcfs] 12:43:38: [<ffffffffa076c3f4>] ? ptlrpc_send_new_req+0x454/0x790 [ptlrpc] 12:43:38: [<ffffffffa0770148>] ? ptlrpc_check_set+0x888/0x1b30 [ptlrpc] 12:43:38: [<ffffffffa079bb3b>] ? ptlrpcd_check+0x53b/0x560 [ptlrpc] 12:43:38: [<ffffffff8109715c>] ? remove_wait_queue+0x3c/0x50 12:43:38: [<ffffffffa079bfc0>] ? ptlrpcd+0x190/0x380 [ptlrpc] 12:43:38: [<ffffffff81063410>] ? default_wake_function+0x0/0x20 12:43:38: [<ffffffffa079be30>] ? ptlrpcd+0x0/0x380 [ptlrpc] 12:43:39: [<ffffffff81096a36>] ? kthread+0x96/0xa0 12:43:40: [<ffffffff8100c0ca>] ? child_rip+0xa/0x20 12:43:40: [<ffffffff810969a0>] ? kthread+0x0/0xa0 12:43:40: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 12:43:41:Mem-Info: 12:43:41:Node 0 DMA per-cpu: 12:43:41:CPU 0: hi: 0, btch: 1 usd: 0 12:43:41:Node 0 DMA32 per-cpu: 12:43:41:CPU 0: hi: 186, btch: 31 usd: 202 12:43:42:active_anon:2089 inactive_anon:2411 isolated_anon:0 12:43:42: active_file:60821 inactive_file:275748 isolated_file:32 12:43:42: unevictable:0 dirty:34820 writeback:16128 unstable:0 12:43:42: free:17978 slab_reclaimable:5665 slab_unreclaimable:87698 12:43:42: mapped:2558 shmem:41 pagetables:793 bounce:0 12:43:43:Node 0 DMA free:8276kB min:332kB low:412kB high:496kB active_anon:0kB inactive_anon:0kB active_file:304kB inactive_file:5516kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15324kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:16kB slab_unreclaimable:1620kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no 12:43:43:lowmem_reserve[]: 0 2003 2003 2003 12:43:43:Node 0 DMA32 free:63636kB min:44720kB low:55900kB high:67080kB active_anon:8356kB inactive_anon:9644kB active_file:242980kB inactive_file:1097476kB unevictable:0kB isolated(anon):0kB isolated(file):128kB present:2052064kB mlocked:0kB dirty:139280kB writeback:64512kB mapped:10232kB shmem:164kB slab_reclaimable:22644kB slab_unreclaimable:349172kB kernel_stack:2024kB pagetables:3172kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:32 all_unreclaimable? no 12:43:43:lowmem_reserve[]: 0 0 0 0 12:43:43:Node 0 DMA: 47*4kB 5*8kB 21*16kB 9*32kB 10*64kB 5*128kB 6*256kB 1*512kB 2*1024kB 1*2048kB 0*4096kB = 8276kB 12:43:43:Node 0 DMA32: 13749*4kB 30*8kB 7*16kB 7*32kB 4*64kB 15*128kB 1*256kB 1*512kB 1*1024kB 0*2048kB 1*4096kB = 63636kB 12:43:43:299076 total pagecache pages 12:43:44:0 pages in swap cache 12:43:44:Swap cache stats: add 0, delete 0, find 0/0 12:43:45:Free swap = 4128760kB 12:43:45:Total swap = 4128760kB 12:43:45:524284 pages RAM 12:43:45:43669 pages reserved 12:43:46:330512 pages shared 12:43:46:160098 pages non-shared 12:43:46:LNetError: 2733:0:(lib-lnet.h:457:lnet_md_alloc()) LNET: out of memory at /var/lib/jenkins/workspace/lustre-master/arch/x86_64/build_type/server/distro/el6/ib_stack/inkernel/BUILD/BUILD/lustre-2.4.92/lnet/include/lnet/lib-lnet.h:457 (tried to alloc '(md)' = 4208) 12:43:46:LNetError: 2733:0:(lib-lnet.h:457:lnet_md_alloc()) LNET: 55454455 total bytes allocated by lnet 12:43:46:LustreError: 2733:0:(niobuf.c:376:ptlrpc_register_bulk()) lustre-OST0001-osc-ffff88006a434c00: LNetMDAttach failed x1445262125420992/0: rc = -12 12:43:46:Lustre: 2733:0:(client.c:1896:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 0/real 0] req@ffff880024150400 x1445262125420992/t0(0) o4->lustre-OST0001-osc-ffff88006a434c00@10.10.4.199@tcp:6/4 lens 488/448 e 0 to 1 dl 0 ref 2 fl Rpc:X/0/ffffffff rc -12/-1 12:43:47:Lustre: lustre-OST0001-osc-ffff88006a434c00: Connection to lustre-OST0001 (at 10.10.4.199@tcp) was lost; in progress operations using this service will wait for recovery to complete 12:43:47:LustreError: 11-0: lustre-OST0001-osc-ffff88006a434c00: Communicating with 10.10.4.199@tcp, operation ost_connect failed with -16. 12:43:47:LustreError: Skipped 1 previous similar message 12:43:47:Lustre: lustre-OST0001-osc-ffff88006a434c00: Connection restored to lustre-OST0001 (at 10.10.4.199@tcp) 12:44:19:nfsd: page allocation failure. order:1, mode:0x40 12:44:19:Pid: 10656, comm: nfsd Not tainted 2.6.32-358.18.1.el6_lustre.x86_64 #1 |
| Comments |
| Comment by Andreas Dilger [ 09/Sep/13 ] |
|
There are two problems here:
I'm going to leave this bug to focus on changing this NFS test to run on MDS nodes with more memory, so that the testing does not fail. |
| Comment by Andreas Dilger [ 09/Sep/13 ] |
|
Please note that interop testing NFS with the MDS as NFS server is not useful, since the Lustre 2.4.0 client will not be used for anything. It would be better to run the NFS server on the Lustre 2.4.0 client, which is more likely how it will be used in real life. This will also help avoid memory problems on the MDS. How is a change like this done to the testing system? |
| Comment by Andreas Dilger [ 13/Feb/14 ] |
|
Shows mode:0x40 == __GFP_IO, but missing __GFP_WAIT from |