Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-3910

Interop 2.4.0<->2.5 failure on test suite parallel-scale-nfsv4 test_iorssf: MDS OOM

Details

    • Bug
    • Resolution: Duplicate
    • Major
    • None
    • Lustre 2.5.0
    • None
    • client: 2.4.0
      server: lustre-master build # 1652
    • 3
    • 10306

    Description

      This issue was created by maloo for sarah <sarah@whamcloud.com>

      This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/f9f7ec36-15c3-11e3-a83f-52540035b04c.

      The sub-test test_iorssf failed with the following error:

      test failed to respond and timed out

      MDS console:

      12:37:48:Lustre: DEBUG MARKER: == parallel-scale-nfsv4 test iorssf: iorssf == 12:37:39 (1378323459)
      12:37:48:Lustre: DEBUG MARKER: lfs setstripe /mnt/lustre/d0.ior.ssf -c -1
      12:38:00:Lustre: MGS: Client a0cabda3-b9a7-2ed3-58fd-f9e8cebfe558 (at 10.10.4.199@tcp) reconnecting
      12:38:00:Lustre: lustre-MDT0000: Client lustre-MDT0000-lwp-OST0000_UUID (at 10.10.4.199@tcp) reconnecting
      12:43:36:ptlrpcd_0: page allocation failure. order:1, mode:0x40
      12:43:36:Pid: 2733, comm: ptlrpcd_0 Not tainted 2.6.32-358.18.1.el6_lustre.x86_64 #1
      12:43:37:Call Trace:
      12:43:37: [<ffffffff8112c257>] ? __alloc_pages_nodemask+0x757/0x8d0
      12:43:37: [<ffffffffa0767b0f>] ? ptlrpc_set_add_new_req+0xcf/0x150 [ptlrpc]
      12:43:37: [<ffffffff81166d92>] ? kmem_getpages+0x62/0x170
      12:43:37: [<ffffffff811679aa>] ? fallback_alloc+0x1ba/0x270
      12:43:37: [<ffffffff811673ff>] ? cache_grow+0x2cf/0x320
      12:43:37: [<ffffffff81167729>] ? ____cache_alloc_node+0x99/0x160
      12:43:37: [<ffffffffa053cea7>] ? LNetMDAttach+0x157/0x5a0 [lnet]
      12:43:37: [<ffffffff811684f9>] ? __kmalloc+0x189/0x220
      12:43:38: [<ffffffffa053cea7>] ? LNetMDAttach+0x157/0x5a0 [lnet]
      12:43:38: [<ffffffffa0775935>] ? ptlrpc_register_bulk+0x265/0x9d0 [ptlrpc]
      12:43:38: [<ffffffffa07777f2>] ? ptl_send_rpc+0x232/0xc40 [ptlrpc]
      12:43:38: [<ffffffff81281b74>] ? snprintf+0x34/0x40
      12:43:38: [<ffffffffa0489951>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
      12:43:38: [<ffffffffa076c3f4>] ? ptlrpc_send_new_req+0x454/0x790 [ptlrpc]
      12:43:38: [<ffffffffa0770148>] ? ptlrpc_check_set+0x888/0x1b30 [ptlrpc]
      12:43:38: [<ffffffffa079bb3b>] ? ptlrpcd_check+0x53b/0x560 [ptlrpc]
      12:43:38: [<ffffffff8109715c>] ? remove_wait_queue+0x3c/0x50
      12:43:38: [<ffffffffa079bfc0>] ? ptlrpcd+0x190/0x380 [ptlrpc]
      12:43:38: [<ffffffff81063410>] ? default_wake_function+0x0/0x20
      12:43:38: [<ffffffffa079be30>] ? ptlrpcd+0x0/0x380 [ptlrpc]
      12:43:39: [<ffffffff81096a36>] ? kthread+0x96/0xa0
      12:43:40: [<ffffffff8100c0ca>] ? child_rip+0xa/0x20
      12:43:40: [<ffffffff810969a0>] ? kthread+0x0/0xa0
      12:43:40: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
      12:43:41:Mem-Info:
      12:43:41:Node 0 DMA per-cpu:
      12:43:41:CPU    0: hi:    0, btch:   1 usd:   0
      12:43:41:Node 0 DMA32 per-cpu:
      12:43:41:CPU    0: hi:  186, btch:  31 usd: 202
      12:43:42:active_anon:2089 inactive_anon:2411 isolated_anon:0
      12:43:42: active_file:60821 inactive_file:275748 isolated_file:32
      12:43:42: unevictable:0 dirty:34820 writeback:16128 unstable:0
      12:43:42: free:17978 slab_reclaimable:5665 slab_unreclaimable:87698
      12:43:42: mapped:2558 shmem:41 pagetables:793 bounce:0
      12:43:43:Node 0 DMA free:8276kB min:332kB low:412kB high:496kB active_anon:0kB inactive_anon:0kB active_file:304kB inactive_file:5516kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15324kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:16kB slab_unreclaimable:1620kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
      12:43:43:lowmem_reserve[]: 0 2003 2003 2003
      12:43:43:Node 0 DMA32 free:63636kB min:44720kB low:55900kB high:67080kB active_anon:8356kB inactive_anon:9644kB active_file:242980kB inactive_file:1097476kB unevictable:0kB isolated(anon):0kB isolated(file):128kB present:2052064kB mlocked:0kB dirty:139280kB writeback:64512kB mapped:10232kB shmem:164kB slab_reclaimable:22644kB slab_unreclaimable:349172kB kernel_stack:2024kB pagetables:3172kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:32 all_unreclaimable? no
      12:43:43:lowmem_reserve[]: 0 0 0 0
      12:43:43:Node 0 DMA: 47*4kB 5*8kB 21*16kB 9*32kB 10*64kB 5*128kB 6*256kB 1*512kB 2*1024kB 1*2048kB 0*4096kB = 8276kB
      12:43:43:Node 0 DMA32: 13749*4kB 30*8kB 7*16kB 7*32kB 4*64kB 15*128kB 1*256kB 1*512kB 1*1024kB 0*2048kB 1*4096kB = 63636kB
      12:43:43:299076 total pagecache pages
      12:43:44:0 pages in swap cache
      12:43:44:Swap cache stats: add 0, delete 0, find 0/0
      12:43:45:Free swap  = 4128760kB
      12:43:45:Total swap = 4128760kB
      12:43:45:524284 pages RAM
      12:43:45:43669 pages reserved
      12:43:46:330512 pages shared
      12:43:46:160098 pages non-shared
      12:43:46:LNetError: 2733:0:(lib-lnet.h:457:lnet_md_alloc()) LNET: out of memory at /var/lib/jenkins/workspace/lustre-master/arch/x86_64/build_type/server/distro/el6/ib_stack/inkernel/BUILD/BUILD/lustre-2.4.92/lnet/include/lnet/lib-lnet.h:457 (tried to alloc '(md)' = 4208)
      12:43:46:LNetError: 2733:0:(lib-lnet.h:457:lnet_md_alloc()) LNET: 55454455 total bytes allocated by lnet
      12:43:46:LustreError: 2733:0:(niobuf.c:376:ptlrpc_register_bulk()) lustre-OST0001-osc-ffff88006a434c00: LNetMDAttach failed x1445262125420992/0: rc = -12
      12:43:46:Lustre: 2733:0:(client.c:1896:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 0/real 0]  req@ffff880024150400 x1445262125420992/t0(0) o4->lustre-OST0001-osc-ffff88006a434c00@10.10.4.199@tcp:6/4 lens 488/448 e 0 to 1 dl 0 ref 2 fl Rpc:X/0/ffffffff rc -12/-1
      12:43:47:Lustre: lustre-OST0001-osc-ffff88006a434c00: Connection to lustre-OST0001 (at 10.10.4.199@tcp) was lost; in progress operations using this service will wait for recovery to complete
      12:43:47:LustreError: 11-0: lustre-OST0001-osc-ffff88006a434c00: Communicating with 10.10.4.199@tcp, operation ost_connect failed with -16.
      12:43:47:LustreError: Skipped 1 previous similar message
      12:43:47:Lustre: lustre-OST0001-osc-ffff88006a434c00: Connection restored to lustre-OST0001 (at 10.10.4.199@tcp)
      12:44:19:nfsd: page allocation failure. order:1, mode:0x40
      12:44:19:Pid: 10656, comm: nfsd Not tainted 2.6.32-358.18.1.el6_lustre.x86_64 #1
      

      Attachments

        Issue Links

          Activity

            [LU-3910] Interop 2.4.0<->2.5 failure on test suite parallel-scale-nfsv4 test_iorssf: MDS OOM

            Shows mode:0x40 == __GFP_IO, but missing __GFP_WAIT from LU-4357.

            adilger Andreas Dilger added a comment - Shows mode:0x40 == __GFP_IO, but missing __GFP_WAIT from LU-4357 .

            Please note that interop testing NFS with the MDS as NFS server is not useful, since the Lustre 2.4.0 client will not be used for anything. It would be better to run the NFS server on the Lustre 2.4.0 client, which is more likely how it will be used in real life. This will also help avoid memory problems on the MDS.

            How is a change like this done to the testing system?

            adilger Andreas Dilger added a comment - Please note that interop testing NFS with the MDS as NFS server is not useful, since the Lustre 2.4.0 client will not be used for anything. It would be better to run the NFS server on the Lustre 2.4.0 client, which is more likely how it will be used in real life. This will also help avoid memory problems on the MDS. How is a change like this done to the testing system?

            There are two problems here:

            • NFS server is running on the MDS, which only has 1.83GB of memory
            • client on NFS server is holding lots of data pages pinned like LU-2139

            I'm going to leave this bug to focus on changing this NFS test to run on MDS nodes with more memory, so that the testing does not fail.

            adilger Andreas Dilger added a comment - There are two problems here: NFS server is running on the MDS, which only has 1.83GB of memory client on NFS server is holding lots of data pages pinned like LU-2139 I'm going to leave this bug to focus on changing this NFS test to run on MDS nodes with more memory, so that the testing does not fail.

            People

              wc-triage WC Triage
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: