Details
-
Bug
-
Resolution: Duplicate
-
Critical
-
None
-
Lustre 2.5.0
-
None
-
server and client: lustre-master build # 1687
client is running SLES11 SP2
-
3
-
10835
Description
This issue was created by maloo for sarah <sarah@whamcloud.com>
This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/960b8b64-2915-11e3-b598-52540035b04c.
The sub-test test_iorssf failed with the following error:
test failed to respond and timed out
MDS console
17:14:54:ptlrpcd_0: page allocation failure. order:1, mode:0x40 17:14:55:Pid: 2780, comm: ptlrpcd_0 Not tainted 2.6.32-358.18.1.el6_lustre.x86_64 #1 17:14:56:Call Trace: 17:14:57: [<ffffffff8112c257>] ? __alloc_pages_nodemask+0x757/0x8d0 17:14:58: [<ffffffff81166d92>] ? kmem_getpages+0x62/0x170 17:14:59: [<ffffffff811679aa>] ? fallback_alloc+0x1ba/0x270 17:14:59: [<ffffffff811673ff>] ? cache_grow+0x2cf/0x320 17:14:59: [<ffffffff81167729>] ? ____cache_alloc_node+0x99/0x160 17:14:59: [<ffffffffa0538ed7>] ? LNetMDAttach+0x157/0x5a0 [lnet] 17:14:59: [<ffffffff811684f9>] ? __kmalloc+0x189/0x220 17:14:59: [<ffffffffa0538ed7>] ? LNetMDAttach+0x157/0x5a0 [lnet] 17:15:00: [<ffffffffa0771b35>] ? ptlrpc_register_bulk+0x265/0x9d0 [ptlrpc] 17:15:00: [<ffffffffa0773a12>] ? ptl_send_rpc+0x232/0xc40 [ptlrpc] 17:15:00: [<ffffffff81281b74>] ? snprintf+0x34/0x40 17:15:01: [<ffffffffa0488761>] ? libcfs_debug_msg+0x41/0x50 [libcfs] 17:15:01: [<ffffffffa07685f4>] ? ptlrpc_send_new_req+0x454/0x790 [ptlrpc] 17:15:02: [<ffffffffa076c368>] ? ptlrpc_check_set+0x888/0x1b40 [ptlrpc] 17:15:02: [<ffffffffa079801b>] ? ptlrpcd_check+0x53b/0x560 [ptlrpc] 17:15:03: [<ffffffffa079853b>] ? ptlrpcd+0x20b/0x370 [ptlrpc] 17:15:03: [<ffffffff81063410>] ? default_wake_function+0x0/0x20 17:15:03: [<ffffffffa0798330>] ? ptlrpcd+0x0/0x370 [ptlrpc] 17:15:03: [<ffffffff81096a36>] ? kthread+0x96/0xa0 17:15:03: [<ffffffff8100c0ca>] ? child_rip+0xa/0x20 17:15:04: [<ffffffff810969a0>] ? kthread+0x0/0xa0 17:15:04: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 17:15:06:Mem-Info: 17:15:06:Node 0 DMA per-cpu: 17:15:06:CPU 0: hi: 0, btch: 1 usd: 0 17:15:06:Node 0 DMA32 per-cpu: 17:15:06:CPU 0: hi: 186, btch: 31 usd: 42 17:15:06:active_anon:2345 inactive_anon:2732 isolated_anon:0 17:15:07: active_file:110430 inactive_file:238985 isolated_file:0 17:15:07: unevictable:0 dirty:3 writeback:0 unstable:0 17:15:07: free:14257 slab_reclaimable:7260 slab_unreclaimable:76976 17:15:07: mapped:2551 shmem:41 pagetables:794 bounce:0 17:15:08:Node 0 DMA free:8264kB min:332kB low:412kB high:496kB active_anon:0kB inactive_anon:0kB active_file:272kB inactive_file:5444kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15324kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:36kB slab_unreclaimable:1700kB kernel_stack:16kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no 17:15:08:lowmem_reserve[]: 0 2003 2003 2003 17:15:09:Node 0 DMA32 free:48764kB min:44720kB low:55900kB high:67080kB active_anon:9380kB inactive_anon:10928kB active_file:441448kB inactive_file:950496kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2052064kB mlocked:0kB dirty:12kB writeback:0kB mapped:10204kB shmem:164kB slab_reclaimable:29004kB slab_unreclaimable:306204kB kernel_stack:1984kB pagetables:3176kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no 17:15:09:lowmem_reserve[]: 0 0 0 0 17:15:10:Node 0 DMA: 58*4kB 104*8kB 102*16kB 42*32kB 6*64kB 2*128kB 2*256kB 2*512kB 0*1024kB 1*2048kB 0*4096kB = 8264kB 17:15:11:Node 0 DMA32: 10659*4kB 2*8kB 2*16kB 2*32kB 2*64kB 0*128kB 1*256kB 1*512kB 1*1024kB 0*2048kB 1*4096kB = 48764kB 17:15:11:269122 total pagecache pages 17:15:11:28 pages in swap cache 17:15:11:Swap cache stats: add 62, delete 34, find 18/22 17:15:11:Free swap = 4128648kB 17:15:12:Total swap = 4128760kB 17:15:12:524284 pages RAM 17:15:12:43669 pages reserved 17:15:13:282260 pages shared 17:15:13:194054 pages non-shared 17:15:14:LNetError: 2780:0:(lib-lnet.h:457:lnet_md_alloc()) LNET: out of memory at /var/lib/jenkins/workspace/lustre-master/arch/x86_64/build_type/server/distro/el6/ib_stack/inkernel/BUILD/BUILD/lustre-2.4.93/lnet/include/lnet/lib-lnet.h:457 (tried to alloc '(md)' = 4208) 17:15:14:LNetError: 2780:0:(lib-lnet.h:457:lnet_md_alloc()) LNET: 55064047 total bytes allocated by lnet 17:15:15:LustreError: 2780:0:(niobuf.c:376:ptlrpc_register_bulk()) lustre-OST0002-osc-ffff88006f296400: LNetMDAttach failed x1447417177531472/0: rc = -12
This bug is intended to track the problem with MDS-side objects not being freed (mdd_obj, lod_obj, mdt_obj slabs). The
LU-4053ticket is to track the client-side CLIO objects not being freed.I think there is just something wrong in the MDS stack that it is not destroying the whole lu_obj (or whatever?) when an object is unlinked, and this is only freed at unmount time or possibly very slowly under memory pressure. It doesn't make any sense to keep objects in memory for FIDs that have been deleted.