Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4033

Failure on test suite parallel-scale-nfsv4 test_iorssf: MDS oom

Details

    • Bug
    • Resolution: Duplicate
    • Critical
    • None
    • Lustre 2.5.0
    • None
    • server and client: lustre-master build # 1687
      client is running SLES11 SP2
    • 3
    • 10835

    Description

      This issue was created by maloo for sarah <sarah@whamcloud.com>

      This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/960b8b64-2915-11e3-b598-52540035b04c.

      The sub-test test_iorssf failed with the following error:

      test failed to respond and timed out

      MDS console

      17:14:54:ptlrpcd_0: page allocation failure. order:1, mode:0x40
      17:14:55:Pid: 2780, comm: ptlrpcd_0 Not tainted 2.6.32-358.18.1.el6_lustre.x86_64 #1
      17:14:56:Call Trace:
      17:14:57: [<ffffffff8112c257>] ? __alloc_pages_nodemask+0x757/0x8d0
      17:14:58: [<ffffffff81166d92>] ? kmem_getpages+0x62/0x170
      17:14:59: [<ffffffff811679aa>] ? fallback_alloc+0x1ba/0x270
      17:14:59: [<ffffffff811673ff>] ? cache_grow+0x2cf/0x320
      17:14:59: [<ffffffff81167729>] ? ____cache_alloc_node+0x99/0x160
      17:14:59: [<ffffffffa0538ed7>] ? LNetMDAttach+0x157/0x5a0 [lnet]
      17:14:59: [<ffffffff811684f9>] ? __kmalloc+0x189/0x220
      17:14:59: [<ffffffffa0538ed7>] ? LNetMDAttach+0x157/0x5a0 [lnet]
      17:15:00: [<ffffffffa0771b35>] ? ptlrpc_register_bulk+0x265/0x9d0 [ptlrpc]
      17:15:00: [<ffffffffa0773a12>] ? ptl_send_rpc+0x232/0xc40 [ptlrpc]
      17:15:00: [<ffffffff81281b74>] ? snprintf+0x34/0x40
      17:15:01: [<ffffffffa0488761>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
      17:15:01: [<ffffffffa07685f4>] ? ptlrpc_send_new_req+0x454/0x790 [ptlrpc]
      17:15:02: [<ffffffffa076c368>] ? ptlrpc_check_set+0x888/0x1b40 [ptlrpc]
      17:15:02: [<ffffffffa079801b>] ? ptlrpcd_check+0x53b/0x560 [ptlrpc]
      17:15:03: [<ffffffffa079853b>] ? ptlrpcd+0x20b/0x370 [ptlrpc]
      17:15:03: [<ffffffff81063410>] ? default_wake_function+0x0/0x20
      17:15:03: [<ffffffffa0798330>] ? ptlrpcd+0x0/0x370 [ptlrpc]
      17:15:03: [<ffffffff81096a36>] ? kthread+0x96/0xa0
      17:15:03: [<ffffffff8100c0ca>] ? child_rip+0xa/0x20
      17:15:04: [<ffffffff810969a0>] ? kthread+0x0/0xa0
      17:15:04: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
      17:15:06:Mem-Info:
      17:15:06:Node 0 DMA per-cpu:
      17:15:06:CPU    0: hi:    0, btch:   1 usd:   0
      17:15:06:Node 0 DMA32 per-cpu:
      17:15:06:CPU    0: hi:  186, btch:  31 usd:  42
      17:15:06:active_anon:2345 inactive_anon:2732 isolated_anon:0
      17:15:07: active_file:110430 inactive_file:238985 isolated_file:0
      17:15:07: unevictable:0 dirty:3 writeback:0 unstable:0
      17:15:07: free:14257 slab_reclaimable:7260 slab_unreclaimable:76976
      17:15:07: mapped:2551 shmem:41 pagetables:794 bounce:0
      17:15:08:Node 0 DMA free:8264kB min:332kB low:412kB high:496kB active_anon:0kB inactive_anon:0kB active_file:272kB inactive_file:5444kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15324kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:36kB slab_unreclaimable:1700kB kernel_stack:16kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
      17:15:08:lowmem_reserve[]: 0 2003 2003 2003
      17:15:09:Node 0 DMA32 free:48764kB min:44720kB low:55900kB high:67080kB active_anon:9380kB inactive_anon:10928kB active_file:441448kB inactive_file:950496kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2052064kB mlocked:0kB dirty:12kB writeback:0kB mapped:10204kB shmem:164kB slab_reclaimable:29004kB slab_unreclaimable:306204kB kernel_stack:1984kB pagetables:3176kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
      17:15:09:lowmem_reserve[]: 0 0 0 0
      17:15:10:Node 0 DMA: 58*4kB 104*8kB 102*16kB 42*32kB 6*64kB 2*128kB 2*256kB 2*512kB 0*1024kB 1*2048kB 0*4096kB = 8264kB
      17:15:11:Node 0 DMA32: 10659*4kB 2*8kB 2*16kB 2*32kB 2*64kB 0*128kB 1*256kB 1*512kB 1*1024kB 0*2048kB 1*4096kB = 48764kB
      17:15:11:269122 total pagecache pages
      17:15:11:28 pages in swap cache
      17:15:11:Swap cache stats: add 62, delete 34, find 18/22
      17:15:11:Free swap  = 4128648kB
      17:15:12:Total swap = 4128760kB
      17:15:12:524284 pages RAM
      17:15:12:43669 pages reserved
      17:15:13:282260 pages shared
      17:15:13:194054 pages non-shared
      17:15:14:LNetError: 2780:0:(lib-lnet.h:457:lnet_md_alloc()) LNET: out of memory at /var/lib/jenkins/workspace/lustre-master/arch/x86_64/build_type/server/distro/el6/ib_stack/inkernel/BUILD/BUILD/lustre-2.4.93/lnet/include/lnet/lib-lnet.h:457 (tried to alloc '(md)' = 4208)
      17:15:14:LNetError: 2780:0:(lib-lnet.h:457:lnet_md_alloc()) LNET: 55064047 total bytes allocated by lnet
      17:15:15:LustreError: 2780:0:(niobuf.c:376:ptlrpc_register_bulk()) lustre-OST0002-osc-ffff88006f296400: LNetMDAttach failed x1447417177531472/0: rc = -12
      

      Attachments

        Issue Links

          Activity

            [LU-4033] Failure on test suite parallel-scale-nfsv4 test_iorssf: MDS oom
            sarah Sarah Liu added a comment -

            hit this bug in interop testing between 2.6 server and 2.5 server

            https://maloo.whamcloud.com/test_sets/f380a6a4-5beb-11e3-8bdd-52540035b04c

            sarah Sarah Liu added a comment - hit this bug in interop testing between 2.6 server and 2.5 server https://maloo.whamcloud.com/test_sets/f380a6a4-5beb-11e3-8bdd-52540035b04c

            Andreas, have a look at the proto: http://review.whamcloud.com/#/c/8003/ - the idea is to signal the client the file is removed, so the client can reset nlink and let the kernel to drop the inode and the pages. it's not intended for landing yet, but I'd like to hear your opinion on the approach.

            bzzz Alex Zhuravlev added a comment - Andreas, have a look at the proto: http://review.whamcloud.com/#/c/8003/ - the idea is to signal the client the file is removed, so the client can reset nlink and let the kernel to drop the inode and the pages. it's not intended for landing yet, but I'd like to hear your opinion on the approach.

            to clarify a bit.. given OST_DESTROY isn't executed immediately (but after commit on MDS at least), there is a window when MDT object is destroyed (in cache), but OST objects aren't. if the kernel decides to flush dirty pages at that point, the client code will try to revalidate layout (which was invalidated by UNLINK). this way we get lu-objects in the memory. they are supposed to be purged at some point (given no access can be made after OST_DESTROY). I'd think this isn't a big issue, but it'd be cool if we can use layout lock for this purpose as well.

            bzzz Alex Zhuravlev added a comment - to clarify a bit.. given OST_DESTROY isn't executed immediately (but after commit on MDS at least), there is a window when MDT object is destroyed (in cache), but OST objects aren't. if the kernel decides to flush dirty pages at that point, the client code will try to revalidate layout (which was invalidated by UNLINK). this way we get lu-objects in the memory. they are supposed to be purged at some point (given no access can be made after OST_DESTROY). I'd think this isn't a big issue, but it'd be cool if we can use layout lock for this purpose as well.

            the root cause seems to be remaining dirty pages on the client which cause ENQUEUEs for the layouts which in turn populate MDS cache with lu-objects for already removed files.

            bzzz Alex Zhuravlev added a comment - the root cause seems to be remaining dirty pages on the client which cause ENQUEUEs for the layouts which in turn populate MDS cache with lu-objects for already removed files.

            sure, I'll try to reproduce with dbench.

            bzzz Alex Zhuravlev added a comment - sure, I'll try to reproduce with dbench.

            If the VM was just retaining the slabs, then they would not be marked active, I think. Also, near the end of my 1h dbench run there was starting to be some considerable memory pressure on the other slabs, so these should have been shrunk at that time if they were just in percpu cache.

            It may be that a workload different from createmany/unlinkmany is needed? For example, neither of these operations does a lookup or readdir or a stat, or any number of other combinations. I don't think a 3600s dbench run is needed, I just was doing that to see if there is a long-term increase in memory use (which there is). Probably even a short run with full +malloc tracing would be enough.

            adilger Andreas Dilger added a comment - If the VM was just retaining the slabs, then they would not be marked active, I think. Also, near the end of my 1h dbench run there was starting to be some considerable memory pressure on the other slabs, so these should have been shrunk at that time if they were just in percpu cache. It may be that a workload different from createmany/unlinkmany is needed? For example, neither of these operations does a lookup or readdir or a stat, or any number of other combinations. I don't think a 3600s dbench run is needed, I just was doing that to see if there is a long-term increase in memory use (which there is). Probably even a short run with full +malloc tracing would be enough.

            I did a simple test:

            1. grep mdt_obj /proc/slabinfo
              mdt_obj 28 28 280 14 1 : tunables 32 16 8 : slabdata 2 2 0 : globalstat 19938 5040 1413 23
            2. ./createmany -o /mnt/lustre/d0/f 10000
              total: 10000 creates in 2.97 seconds: 3366.59 creates/second
            3. grep mdt_obj /proc/slabinfo
              mdt_obj 10038 10038 280 14 1 : tunables 32 16 8 : slabdata 717 717 0 : globalstat 29948 10038 2128 23
            4. ./unlinkmany /mnt/lustre/d0/f 10000
            • unlinked 0 (time 1380765784 ; total 0 ; last 0)
            1. grep mdt_obj /proc/slabinfo
              mdt_obj 206 280 280 14 1 : tunables 32 16 8 : slabdata 17 20 128 : globalstat 39958 10038 2843 32 0 4 0 0 0 : cpustat 37165 2866 37516 2487

            then in few dozen seconds:

            1. grep mdt_obj /proc/slabinfo
              mdt_obj 28 28 280 14 1 : tunables 32 16 8 : slabdata 2 2 0 : globalstat 39958 10038 2843 41

            probably MM just retains lots of actually free pieces in per-cpu caches or something like that?

            bzzz Alex Zhuravlev added a comment - I did a simple test: grep mdt_obj /proc/slabinfo mdt_obj 28 28 280 14 1 : tunables 32 16 8 : slabdata 2 2 0 : globalstat 19938 5040 1413 23 ./createmany -o /mnt/lustre/d0/f 10000 total: 10000 creates in 2.97 seconds: 3366.59 creates/second grep mdt_obj /proc/slabinfo mdt_obj 10038 10038 280 14 1 : tunables 32 16 8 : slabdata 717 717 0 : globalstat 29948 10038 2128 23 ./unlinkmany /mnt/lustre/d0/f 10000 unlinked 0 (time 1380765784 ; total 0 ; last 0) grep mdt_obj /proc/slabinfo mdt_obj 206 280 280 14 1 : tunables 32 16 8 : slabdata 17 20 128 : globalstat 39958 10038 2843 32 0 4 0 0 0 : cpustat 37165 2866 37516 2487 then in few dozen seconds: grep mdt_obj /proc/slabinfo mdt_obj 28 28 280 14 1 : tunables 32 16 8 : slabdata 2 2 0 : globalstat 39958 10038 2843 41 probably MM just retains lots of actually free pieces in per-cpu caches or something like that?

            This bug is intended to track the problem with MDS-side objects not being freed (mdd_obj, lod_obj, mdt_obj slabs). The LU-4053 ticket is to track the client-side CLIO objects not being freed.

            I think there is just something wrong in the MDS stack that it is not destroying the whole lu_obj (or whatever?) when an object is unlinked, and this is only freed at unmount time or possibly very slowly under memory pressure. It doesn't make any sense to keep objects in memory for FIDs that have been deleted.

            adilger Andreas Dilger added a comment - This bug is intended to track the problem with MDS-side objects not being freed (mdd_obj, lod_obj, mdt_obj slabs). The LU-4053 ticket is to track the client-side CLIO objects not being freed. I think there is just something wrong in the MDS stack that it is not destroying the whole lu_obj (or whatever?) when an object is unlinked, and this is only freed at unmount time or possibly very slowly under memory pressure. It doesn't make any sense to keep objects in memory for FIDs that have been deleted.
            green Oleg Drokin added a comment -

            I think it's still not related to any real object leakage, otherwise we would have noticed that by other means.
            Most likely it's just like in 4053 - since we stopped proactively clearing inodes from cache in 2.3

            green Oleg Drokin added a comment - I think it's still not related to any real object leakage, otherwise we would have noticed that by other means. Most likely it's just like in 4053 - since we stopped proactively clearing inodes from cache in 2.3

            I'm wondering if there's a low rate flaw in the reference counting of objects. It occurs to me that all it would take is one seldom used code path with one too many xxx_object_get() calls or one too few xxx_object_put() calls and the reference count will never drop to 0. If the reference count is wrong that could explain never freeing objects. Over time the population of unreleased objects would grow.

            bogl Bob Glossman (Inactive) added a comment - I'm wondering if there's a low rate flaw in the reference counting of objects. It occurs to me that all it would take is one seldom used code path with one too many xxx_object_get() calls or one too few xxx_object_put() calls and the reference count will never drop to 0. If the reference count is wrong that could explain never freeing objects. Over time the population of unreleased objects would grow.

            People

              bogl Bob Glossman (Inactive)
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: