Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4033

Failure on test suite parallel-scale-nfsv4 test_iorssf: MDS oom

Details

    • Bug
    • Resolution: Duplicate
    • Critical
    • None
    • Lustre 2.5.0
    • None
    • server and client: lustre-master build # 1687
      client is running SLES11 SP2
    • 3
    • 10835

    Description

      This issue was created by maloo for sarah <sarah@whamcloud.com>

      This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/960b8b64-2915-11e3-b598-52540035b04c.

      The sub-test test_iorssf failed with the following error:

      test failed to respond and timed out

      MDS console

      17:14:54:ptlrpcd_0: page allocation failure. order:1, mode:0x40
      17:14:55:Pid: 2780, comm: ptlrpcd_0 Not tainted 2.6.32-358.18.1.el6_lustre.x86_64 #1
      17:14:56:Call Trace:
      17:14:57: [<ffffffff8112c257>] ? __alloc_pages_nodemask+0x757/0x8d0
      17:14:58: [<ffffffff81166d92>] ? kmem_getpages+0x62/0x170
      17:14:59: [<ffffffff811679aa>] ? fallback_alloc+0x1ba/0x270
      17:14:59: [<ffffffff811673ff>] ? cache_grow+0x2cf/0x320
      17:14:59: [<ffffffff81167729>] ? ____cache_alloc_node+0x99/0x160
      17:14:59: [<ffffffffa0538ed7>] ? LNetMDAttach+0x157/0x5a0 [lnet]
      17:14:59: [<ffffffff811684f9>] ? __kmalloc+0x189/0x220
      17:14:59: [<ffffffffa0538ed7>] ? LNetMDAttach+0x157/0x5a0 [lnet]
      17:15:00: [<ffffffffa0771b35>] ? ptlrpc_register_bulk+0x265/0x9d0 [ptlrpc]
      17:15:00: [<ffffffffa0773a12>] ? ptl_send_rpc+0x232/0xc40 [ptlrpc]
      17:15:00: [<ffffffff81281b74>] ? snprintf+0x34/0x40
      17:15:01: [<ffffffffa0488761>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
      17:15:01: [<ffffffffa07685f4>] ? ptlrpc_send_new_req+0x454/0x790 [ptlrpc]
      17:15:02: [<ffffffffa076c368>] ? ptlrpc_check_set+0x888/0x1b40 [ptlrpc]
      17:15:02: [<ffffffffa079801b>] ? ptlrpcd_check+0x53b/0x560 [ptlrpc]
      17:15:03: [<ffffffffa079853b>] ? ptlrpcd+0x20b/0x370 [ptlrpc]
      17:15:03: [<ffffffff81063410>] ? default_wake_function+0x0/0x20
      17:15:03: [<ffffffffa0798330>] ? ptlrpcd+0x0/0x370 [ptlrpc]
      17:15:03: [<ffffffff81096a36>] ? kthread+0x96/0xa0
      17:15:03: [<ffffffff8100c0ca>] ? child_rip+0xa/0x20
      17:15:04: [<ffffffff810969a0>] ? kthread+0x0/0xa0
      17:15:04: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
      17:15:06:Mem-Info:
      17:15:06:Node 0 DMA per-cpu:
      17:15:06:CPU    0: hi:    0, btch:   1 usd:   0
      17:15:06:Node 0 DMA32 per-cpu:
      17:15:06:CPU    0: hi:  186, btch:  31 usd:  42
      17:15:06:active_anon:2345 inactive_anon:2732 isolated_anon:0
      17:15:07: active_file:110430 inactive_file:238985 isolated_file:0
      17:15:07: unevictable:0 dirty:3 writeback:0 unstable:0
      17:15:07: free:14257 slab_reclaimable:7260 slab_unreclaimable:76976
      17:15:07: mapped:2551 shmem:41 pagetables:794 bounce:0
      17:15:08:Node 0 DMA free:8264kB min:332kB low:412kB high:496kB active_anon:0kB inactive_anon:0kB active_file:272kB inactive_file:5444kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15324kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:36kB slab_unreclaimable:1700kB kernel_stack:16kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
      17:15:08:lowmem_reserve[]: 0 2003 2003 2003
      17:15:09:Node 0 DMA32 free:48764kB min:44720kB low:55900kB high:67080kB active_anon:9380kB inactive_anon:10928kB active_file:441448kB inactive_file:950496kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2052064kB mlocked:0kB dirty:12kB writeback:0kB mapped:10204kB shmem:164kB slab_reclaimable:29004kB slab_unreclaimable:306204kB kernel_stack:1984kB pagetables:3176kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
      17:15:09:lowmem_reserve[]: 0 0 0 0
      17:15:10:Node 0 DMA: 58*4kB 104*8kB 102*16kB 42*32kB 6*64kB 2*128kB 2*256kB 2*512kB 0*1024kB 1*2048kB 0*4096kB = 8264kB
      17:15:11:Node 0 DMA32: 10659*4kB 2*8kB 2*16kB 2*32kB 2*64kB 0*128kB 1*256kB 1*512kB 1*1024kB 0*2048kB 1*4096kB = 48764kB
      17:15:11:269122 total pagecache pages
      17:15:11:28 pages in swap cache
      17:15:11:Swap cache stats: add 62, delete 34, find 18/22
      17:15:11:Free swap  = 4128648kB
      17:15:12:Total swap = 4128760kB
      17:15:12:524284 pages RAM
      17:15:12:43669 pages reserved
      17:15:13:282260 pages shared
      17:15:13:194054 pages non-shared
      17:15:14:LNetError: 2780:0:(lib-lnet.h:457:lnet_md_alloc()) LNET: out of memory at /var/lib/jenkins/workspace/lustre-master/arch/x86_64/build_type/server/distro/el6/ib_stack/inkernel/BUILD/BUILD/lustre-2.4.93/lnet/include/lnet/lib-lnet.h:457 (tried to alloc '(md)' = 4208)
      17:15:14:LNetError: 2780:0:(lib-lnet.h:457:lnet_md_alloc()) LNET: 55064047 total bytes allocated by lnet
      17:15:15:LustreError: 2780:0:(niobuf.c:376:ptlrpc_register_bulk()) lustre-OST0002-osc-ffff88006f296400: LNetMDAttach failed x1447417177531472/0: rc = -12
      

      Attachments

        Issue Links

          Activity

            [LU-4033] Failure on test suite parallel-scale-nfsv4 test_iorssf: MDS oom
            sarah Sarah Liu added a comment -

            hit this bug in interop testing between 2.6 server and 2.5 server

            https://maloo.whamcloud.com/test_sets/f380a6a4-5beb-11e3-8bdd-52540035b04c

            sarah Sarah Liu added a comment - hit this bug in interop testing between 2.6 server and 2.5 server https://maloo.whamcloud.com/test_sets/f380a6a4-5beb-11e3-8bdd-52540035b04c

            Andreas, have a look at the proto: http://review.whamcloud.com/#/c/8003/ - the idea is to signal the client the file is removed, so the client can reset nlink and let the kernel to drop the inode and the pages. it's not intended for landing yet, but I'd like to hear your opinion on the approach.

            bzzz Alex Zhuravlev added a comment - Andreas, have a look at the proto: http://review.whamcloud.com/#/c/8003/ - the idea is to signal the client the file is removed, so the client can reset nlink and let the kernel to drop the inode and the pages. it's not intended for landing yet, but I'd like to hear your opinion on the approach.

            to clarify a bit.. given OST_DESTROY isn't executed immediately (but after commit on MDS at least), there is a window when MDT object is destroyed (in cache), but OST objects aren't. if the kernel decides to flush dirty pages at that point, the client code will try to revalidate layout (which was invalidated by UNLINK). this way we get lu-objects in the memory. they are supposed to be purged at some point (given no access can be made after OST_DESTROY). I'd think this isn't a big issue, but it'd be cool if we can use layout lock for this purpose as well.

            bzzz Alex Zhuravlev added a comment - to clarify a bit.. given OST_DESTROY isn't executed immediately (but after commit on MDS at least), there is a window when MDT object is destroyed (in cache), but OST objects aren't. if the kernel decides to flush dirty pages at that point, the client code will try to revalidate layout (which was invalidated by UNLINK). this way we get lu-objects in the memory. they are supposed to be purged at some point (given no access can be made after OST_DESTROY). I'd think this isn't a big issue, but it'd be cool if we can use layout lock for this purpose as well.

            the root cause seems to be remaining dirty pages on the client which cause ENQUEUEs for the layouts which in turn populate MDS cache with lu-objects for already removed files.

            bzzz Alex Zhuravlev added a comment - the root cause seems to be remaining dirty pages on the client which cause ENQUEUEs for the layouts which in turn populate MDS cache with lu-objects for already removed files.

            sure, I'll try to reproduce with dbench.

            bzzz Alex Zhuravlev added a comment - sure, I'll try to reproduce with dbench.

            If the VM was just retaining the slabs, then they would not be marked active, I think. Also, near the end of my 1h dbench run there was starting to be some considerable memory pressure on the other slabs, so these should have been shrunk at that time if they were just in percpu cache.

            It may be that a workload different from createmany/unlinkmany is needed? For example, neither of these operations does a lookup or readdir or a stat, or any number of other combinations. I don't think a 3600s dbench run is needed, I just was doing that to see if there is a long-term increase in memory use (which there is). Probably even a short run with full +malloc tracing would be enough.

            adilger Andreas Dilger added a comment - If the VM was just retaining the slabs, then they would not be marked active, I think. Also, near the end of my 1h dbench run there was starting to be some considerable memory pressure on the other slabs, so these should have been shrunk at that time if they were just in percpu cache. It may be that a workload different from createmany/unlinkmany is needed? For example, neither of these operations does a lookup or readdir or a stat, or any number of other combinations. I don't think a 3600s dbench run is needed, I just was doing that to see if there is a long-term increase in memory use (which there is). Probably even a short run with full +malloc tracing would be enough.

            People

              bogl Bob Glossman (Inactive)
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: