Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2468

MDS out of memory, blocked in ldlm_pools_shrink()

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.4.0
    • Lustre 2.4.0
    • None
    • 3
    • 5814

    Description

      This issue was created by maloo for Andreas Dilger <adilger@whamcloud.com>

      This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/35a3873c-4166-11e2-af91-52540035b04c.

      The sub-test test_24a failed with the following error on the MDS console:

      test failed to respond and timed out
      11:14:17:Lustre: DEBUG MARKER: == conf-sanity test 24a: Multiple MDTs on a single node == 11:13:50 (1355166830)
      11:14:17:Lustre: DEBUG MARKER: grep -c /mnt/fs2mds' ' /proc/mounts
      11:14:17:Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && lctl dl | grep ' ST '
      11:14:17:Lustre: DEBUG MARKER: mkfs.lustre --mgsnode=client-29vm3@tcp --fsname=lustre --mdt --index=0 --param=sys.timeout=20 --param=lov.stripesize=1048576 --param=lov.stripecount=0 --param=mdt.identity_upcall=/usr/sbin/l_getidentity --backfstype=ldiskfs --device-size=2097152 --mkfsopt
      11:14:17:LDISKFS-fs (dm-1): mounted filesystem with ordered data mode. quota=on. Opts:
      11:14:17:Lustre: DEBUG MARKER: mkdir -p /mnt/mds1
      11:14:17:Lustre: DEBUG MARKER: test -b /dev/lvm-MDS/P1
      11:14:17:Lustre: DEBUG MARKER: mkdir -p /mnt/mds1; mount -t lustre -o user_xattr,acl /dev/lvm-MDS/P1 /mnt/mds1
      11:14:17:LDISKFS-fs (dm-0): mounted filesystem with ordered data mode. quota=on. Opts:
      11:14:17:Lustre: MGC10.10.4.174@tcp: Reactivating import
      11:14:17:Lustre: lustre-MDT0000: used disk, loading
      11:14:17:__ratelimit: 582 callbacks suppressed
      11:14:17:cannot allocate a tage (0)
      11:14:17:cannot allocate a tage (0)
      11:14:17:cannot allocate a tage (0)
      11:14:17:cannot allocate a tage (0)
      11:14:17:cannot allocate a tage (0)
      11:14:17:cannot allocate a tage (0)
      11:14:17:cannot allocate a tage (0)

      11:15:55:LNet: Service thread pid 16764 was inactive for 40.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      11:15:55:Pid: 16764, comm: mdt01_001
      11:15:55:
      11:15:55:Call Trace:
      11:15:55: [<ffffffffa0e19d97>] ? cfs_hash_bd_lookup_intent+0x37/0x130 [libcfs]
      11:15:55: [<ffffffff814ffa2e>] ? __mutex_lock_slowpath+0x13e/0x180
      11:15:55: [<ffffffffa051a691>] ? ldlm_cli_pool_shrink+0x71/0x130 [ptlrpc]
      11:15:55: [<ffffffffa0e14591>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
      11:15:55: [<ffffffff814ff8cb>] ? mutex_lock+0x2b/0x50
      11:15:55: [<ffffffffa0568e40>] ? enc_pools_shrink+0x3d0/0x560 [ptlrpc]
      11:15:55: [<ffffffffa05198c3>] ? ldlm_pools_srv_shrink+0x13/0x20 [ptlrpc]
      11:15:55: [<ffffffff8112d34a>] ? shrink_slab+0x8a/0x1a0
      11:15:55: [<ffffffff8112f36f>] ? do_try_to_free_pages+0x2ff/0x520
      11:15:55: [<ffffffff8100bc0e>] ? apic_timer_interrupt+0xe/0x20
      11:15:55: [<ffffffff8112f77d>] ? try_to_free_pages+0x9d/0x130
      11:15:55: [<ffffffff811308d0>] ? isolate_pages_global+0x0/0x350
      11:15:55: [<ffffffff8112758d>] ? __alloc_pages_nodemask+0x40d/0x940
      11:15:55: [<ffffffff81162372>] ? kmem_getpages+0x62/0x170
      11:15:55: [<ffffffff81162f8a>] ? fallback_alloc+0x1ba/0x270
      11:15:55: [<ffffffff811629df>] ? cache_grow+0x2cf/0x320
      11:15:55: [<ffffffff81162d09>] ? ____cache_alloc_node+0x99/0x160
      11:15:55: [<ffffffff81163aeb>] ? kmem_cache_alloc+0x11b/0x190
      11:15:55: [<ffffffffa0e04af2>] ? cfs_mem_cache_alloc+0x22/0x30 [libcfs]
      11:15:55: [<ffffffffa071b00a>] ? osc_session_init+0x3a/0x200 [osc]
      11:15:55: [<ffffffffa0ee0baf>] ? keys_fill+0x6f/0x1a0 [obdclass]
      11:15:55: [<ffffffffa0ee499b>] ? lu_context_init+0xab/0x260 [obdclass]
      11:15:55: [<ffffffffa0542db4>] ? ptlrpc_server_handle_request+0x194/0xe00 [ptlrpc]
      11:15:55: [<ffffffffa0e0465e>] ? cfs_timer_arm+0xe/0x10 [libcfs]
      11:15:55: [<ffffffffa0e160ef>] ? lc_watchdog_touch+0x6f/0x180 [libcfs]
      11:15:55: [<ffffffffa053a429>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
      11:15:55: [<ffffffffa0e14591>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
      11:15:55: [<ffffffff81053463>] ? __wake_up+0x53/0x70
      11:15:55: [<ffffffffa05445d5>] ? ptlrpc_main+0xbb5/0x1970 [ptlrpc]
      11:15:55: [<ffffffffa0543a20>] ? ptlrpc_main+0x0/0x1970 [ptlrpc]

      11:31:32:INFO: task mount.lustre:17152 blocked for more than 120 seconds.
      11:31:32:mount.lustre D 0000000000000000 0 17152 1 0x00100084
      11:31:32: ffff88001e843698 0000000000000082 0000000000000000 ffffffff81060af3
      11:31:32: 0000000050c63493 0000000000000282 0000000000800500 0000000000000000
      11:31:32: ffff88001a3e3098 ffff88001e843fd8 000000000000fb88 ffff88001a3e3098
      11:31:32:Call Trace:
      11:31:32: [<ffffffff81060af3>] ? wake_up_new_task+0xd3/0x120
      11:31:32: [<ffffffff814ff245>] schedule_timeout+0x215/0x2e0
      11:31:32: [<ffffffffa0e840b0>] ? llog_process_thread_daemonize+0x0/0x80 [obdclass]
      11:31:32: [<ffffffff8100c0e2>] ? kernel_thread+0x82/0xe0
      11:31:32: [<ffffffffa0e840b0>] ? llog_process_thread_daemonize+0x0/0x80 [obdclass]
      11:31:32: [<ffffffff814feec3>] wait_for_common+0x123/0x180
      11:31:32: [<ffffffff810602c0>] ? default_wake_function+0x0/0x20
      11:31:32: [<ffffffffa0e0b77a>] ? cfs_create_thread+0x7a/0xa0 [libcfs]
      11:31:32: [<ffffffffa0ec8650>] ? class_config_llog_handler+0x0/0x1850 [obdclass]
      11:31:32: [<ffffffff814fefdd>] wait_for_completion+0x1d/0x20
      11:31:32: [<ffffffffa0e85ae3>] llog_process_or_fork+0x333/0x660 [obdclass]
      11:31:32: [<ffffffffa0e85e24>] llog_process+0x14/0x20 [obdclass]
      11:31:32: [<ffffffffa0ebdd64>] class_config_parse_llog+0x1e4/0x340 [obdclass]
      11:31:32: [<ffffffffa0821ced>] mgc_process_cfg_log+0x5cd/0x1600 [mgc]
      11:31:32: [<ffffffffa0823163>] mgc_process_log+0x443/0x1350 [mgc]
      11:31:32: [<ffffffffa0e14591>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
      11:31:32: [<ffffffffa0e14591>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
      11:31:32: [<ffffffffa081da80>] ? mgc_blocking_ast+0x0/0x780 [mgc]
      11:31:32: [<ffffffffa0509930>] ? ldlm_completion_ast+0x0/0x980 [ptlrpc]
      11:31:32: [<ffffffffa0825974>] mgc_process_config+0x594/0xee0 [mgc]
      11:31:32: [<ffffffffa0ecfc1c>] lustre_process_log+0x25c/0xad0 [obdclass]
      11:31:32: [<ffffffffa0e14591>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
      11:31:32: [<ffffffffa0e0f028>] ? libcfs_log_return+0x28/0x40 [libcfs]
      11:31:32: [<ffffffffa0ed1691>] ? server_register_mount+0x551/0x8f0 [obdclass]
      11:31:32: [<ffffffffa0edd607>] server_start_targets+0x5c7/0x18f0 [obdclass]
      11:31:32: [<ffffffffa0e0f028>] ? libcfs_log_return+0x28/0x40 [libcfs]
      11:31:32: [<ffffffffa0ed86a0>] ? lustre_start_mgc+0x4e0/0x1bc0 [obdclass]
      11:31:32: [<ffffffffa0e14591>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
      11:31:32: [<ffffffffa0ec8650>] ? class_config_llog_handler+0x0/0x1850 [obdclass]
      11:31:32: [<ffffffffa0edfd87>] lustre_fill_super+0x1457/0x1b00 [obdclass]
      11:31:32: [<ffffffff8117d200>] ? set_anon_super+0x0/0x100
      11:31:32: [<ffffffffa0ede930>] ? lustre_fill_super+0x0/0x1b00 [obdclass]
      11:31:32: [<ffffffff8117e66f>] get_sb_nodev+0x5f/0xa0
      11:31:32: [<ffffffffa0ec9fa5>] lustre_get_sb+0x25/0x30 [obdclass]
      11:31:32: [<ffffffff8117e2cb>] vfs_kern_mount+0x7b/0x1b0
      11:31:32: [<ffffffff8117e472>] do_kern_mount+0x52/0x130
      11:31:32: [<ffffffff8119cb42>] do_mount+0x2d2/0x8d0
      11:31:32: [<ffffffff8119d1d0>] sys_mount+0x90/0xe0

      It is likely that this is only a symptom of something else consuming memory on the MDS, and not the root cause.

      Info required for matching: conf-sanity 24a

      Attachments

        Issue Links

          Activity

            [LU-2468] MDS out of memory, blocked in ldlm_pools_shrink()

            This was fixed for Lustre 2.4.0 by http://review.whamcloud.com/4940 landing.

            adilger Andreas Dilger added a comment - This was fixed for Lustre 2.4.0 by http://review.whamcloud.com/4940 landing.

            Hi,

            Do you have any news on this one?
            This issue is occurring more and more often on various customer clusters.

            Each time, we can see a lot OOM messages in the syslog of the dump files.

            TIA,
            Sebastien.

            sebastien.buisson Sebastien Buisson (Inactive) added a comment - Hi, Do you have any news on this one? This issue is occurring more and more often on various customer clusters. Each time, we can see a lot OOM messages in the syslog of the dump files. TIA, Sebastien.

            A Bull customer with lustre 2.1.3 had the same hang on a lustre client (loggin node).
            The system was hung and a dump was initiated from the the bmc, by sending a NMI.
            The dump shows there was no more activity on the system. The 12 CPUs are idle (swapper).
            A lot of processes are in page_fault(), blocked in ldlm_pools_shrink().
            I have attached the output of the "foreach bt" crash command. Let me know if you need the vmcore file.
            I have asked the support team to provide more information on the frequency of this hang, the syslog, the activity when the hang occured...

            patrick.valentin Patrick Valentin (Inactive) added a comment - A Bull customer with lustre 2.1.3 had the same hang on a lustre client (loggin node). The system was hung and a dump was initiated from the the bmc, by sending a NMI. The dump shows there was no more activity on the system. The 12 CPUs are idle (swapper). A lot of processes are in page_fault(), blocked in ldlm_pools_shrink(). I have attached the output of the "foreach bt" crash command. Let me know if you need the vmcore file. I have asked the support team to provide more information on the frequency of this hang, the syslog, the activity when the hang occured...

            I guess we didn't submit anything to 2.3, which means recent patches can't help for 2.3 MDS?

            liang Liang Zhen (Inactive) added a comment - I guess we didn't submit anything to 2.3, which means recent patches can't help for 2.3 MDS?
            sarah Sarah Liu added a comment -

            Andreas, got it, thanks!

            sarah Sarah Liu added a comment - Andreas, got it, thanks!

            Sarah to submit a patch to skip conf-sanity test_24a:

                    [ $(lustre_version_code mds) -eq $(version_code 2.3.0) ] &&
                            skip "skipping test for 2.3.0 MDS due to LU-2468" && return 0
            
            adilger Andreas Dilger added a comment - Sarah to submit a patch to skip conf-sanity test_24a: [ $(lustre_version_code mds) -eq $(version_code 2.3.0) ] && skip "skipping test for 2.3.0 MDS due to LU-2468" && return 0
            sarah Sarah Liu added a comment -

            I think this should be raise priority since it affected many interop tests

            sarah Sarah Liu added a comment - I think this should be raise priority since it affected many interop tests
            sarah Sarah Liu added a comment - - edited

            Found OOM in the interop test between 2.3 server and the latest tag-2.3.62

            https://maloo.whamcloud.com/test_sets/422ac24e-8b95-11e2-abec-52540035b04c

            MDS console:

            00:03:57:Lustre: DEBUG MARKER: == conf-sanity test 24a: Multiple MDTs on a single node == 00:03:40 (1363071820)
            00:03:57:Lustre: DEBUG MARKER: grep -c /mnt/fs2mds' ' /proc/mounts
            00:03:57:Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && lctl dl | grep ' ST '
            00:03:57:Lustre: DEBUG MARKER: mkfs.lustre --mgsnode=client-31vm3@tcp --fsname=lustre --mdt --index=0 --param=sys.timeout=20 --param=lov.stripesize=1048576 --param=lov.stripecount=0 --param=mdt.identity_upcall=/usr/sbin/l_getidentity --backfstype=ldiskfs --device-size=2097152 --mkfsopt
            00:03:57:LDISKFS-fs (dm-1): mounted filesystem with ordered data mode. quota=off. Opts: 
            00:03:57:Lustre: DEBUG MARKER: mkdir -p /mnt/mds1
            00:03:57:Lustre: DEBUG MARKER: test -b /dev/lvm-MDS/P1
            00:03:57:Lustre: DEBUG MARKER: mkdir -p /mnt/mds1; mount -t lustre -o user_xattr,acl  		                   /dev/lvm-MDS/P1 /mnt/mds1
            00:03:57:LDISKFS-fs (dm-0): mounted filesystem with ordered data mode. quota=off. Opts: 
            00:03:57:LDISKFS-fs (dm-0): mounted filesystem with ordered data mode. quota=off. Opts: 
            00:03:57:Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/usr/lib64/lustre/tests//usr/lib64/lustre/tests:/usr/lib64/lustre/tests:/usr/lib64/lustre/tests/../utils:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lust
            00:03:57:Lustre: DEBUG MARKER: e2label /dev/lvm-MDS/P1 2>/dev/null
            00:04:08:Lustre: 3472:0:(client.c:1917:ptlrpc_expire_one_request()) @@@ Request  sent has timed out for slow reply: [sent 1363071835/real 1363071835]  req@ffff880066563800 x1429282250294701/t0(0) o8->lustre-OST0000-osc-MDT0000@10.10.4.191@tcp:28/4 lens 400/544 e 0 to 1 dl 1363071840 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
            00:04:08:Lustre: 3472:0:(client.c:1917:ptlrpc_expire_one_request()) Skipped 12 previous similar messages
            00:04:10:Lustre: DEBUG MARKER: mkdir -p /mnt/fs2mds
            00:04:11:Lustre: DEBUG MARKER: test -b /dev/lvm-MDS/S1
            00:04:12:Lustre: DEBUG MARKER: mkdir -p /mnt/fs2mds; mount -t lustre -o user_xattr,acl  		                   /dev/lvm-MDS/S1 /mnt/fs2mds
            00:04:12:LDISKFS-fs (dm-1): mounted filesystem with ordered data mode. quota=off. Opts: 
            00:04:12:LDISKFS-fs (dm-1): mounted filesystem with ordered data mode. quota=off. Opts: 
            00:04:12:Lustre: Setting parameter test1234-MDT0000-mdtlov.lov.stripesize in log test1234-MDT0000
            00:04:12:Lustre: Skipped 4 previous similar messages
            00:04:12:Lustre: test1234-MDT0000: new disk, initializing
            00:04:12:__ratelimit: 23 callbacks suppressed
            00:04:12:llog_process_th invoked oom-killer: gfp_mask=0xd2, order=0, oom_adj=-17, oom_score_adj=0
            00:04:12:llog_process_th cpuset=/ mems_allowed=0
            00:04:12:Pid: 900, comm: llog_process_th Not tainted 2.6.32-279.5.1.el6_lustre.gb16fe80.x86_64 #1
            00:04:12:Call Trace:
            00:04:12: [<ffffffff810c4aa1>] ? cpuset_print_task_mems_allowed+0x91/0xb0
            00:04:12: [<ffffffff81117210>] ? dump_header+0x90/0x1b0
            00:04:12: [<ffffffff81117692>] ? oom_kill_process+0x82/0x2a0
            00:04:12: [<ffffffff8111758e>] ? select_bad_process+0x9e/0x120
            00:04:12: [<ffffffff81117ad0>] ? out_of_memory+0x220/0x3c0
            00:04:12: [<ffffffff811277ee>] ? __alloc_pages_nodemask+0x89e/0x940
            00:04:12: [<ffffffff8114d789>] ? __vmalloc_area_node+0xb9/0x190
            00:04:12: [<ffffffffa04df9e0>] ? cfs_cpt_vmalloc+0x20/0x30 [libcfs]
            00:04:12: [<ffffffff8114d65d>] ? __vmalloc_node+0xad/0x120
            00:04:12: [<ffffffffa04df9e0>] ? cfs_cpt_vmalloc+0x20/0x30 [libcfs]
            00:04:12: [<ffffffff8114d9a9>] ? vmalloc_node+0x29/0x30
            00:04:12: [<ffffffffa04df9e0>] ? cfs_cpt_vmalloc+0x20/0x30 [libcfs]
            00:04:12: [<ffffffffa07e726e>] ? ptlrpc_alloc_rqbd+0x13e/0x690 [ptlrpc]
            00:04:12: [<ffffffffa07e7825>] ? ptlrpc_grow_req_bufs+0x65/0x1b0 [ptlrpc]
            00:04:12: [<ffffffffa07e8e07>] ? ptlrpc_register_service+0x1497/0x1770 [ptlrpc]
            00:04:12: [<ffffffffa0e38c62>] ? mdt_init0+0x1202/0x23f0 [mdt]
            00:04:12: [<ffffffffa0e39f43>] ? mdt_device_alloc+0xf3/0x220 [mdt]
            00:04:12: [<ffffffffa064b0d7>] ? obd_setup+0x1d7/0x2f0 [obdclass]
            00:04:12: [<ffffffffa064b3f8>] ? class_setup+0x208/0x890 [obdclass]
            00:04:12: [<ffffffffa065308c>] ? class_process_config+0xc0c/0x1c30 [obdclass]
            00:04:12: [<ffffffffa04ea088>] ? libcfs_log_return+0x28/0x40 [libcfs]
            00:04:12: [<ffffffffa064cef1>] ? lustre_cfg_new+0x391/0x7e0 [obdclass]
            00:04:12: [<ffffffffa065515b>] ? class_config_llog_handler+0x9bb/0x1610 [obdclass]
            00:04:12: [<ffffffffa061e1f8>] ? llog_process_thread+0x888/0xd00 [obdclass]
            00:04:12: [<ffffffffa061d970>] ? llog_process_thread+0x0/0xd00 [obdclass]
            00:04:12: [<ffffffff8100c14a>] ? child_rip+0xa/0x20
            00:04:12: [<ffffffffa061d970>] ? llog_process_thread+0x0/0xd00 [obdclass]
            00:04:12: [<ffffffffa061d970>] ? llog_process_thread+0x0/0xd00 [obdclass]
            00:04:12: [<ffffffff8100c140>] ? child_rip+0x0/0x20
            00:04:12:Mem-Info:
            00:04:12:Node 0 DMA per-cpu:
            00:04:12:CPU    0: hi:    0, btch:   1 usd:   0
            00:04:12:CPU    1: hi:    0, btch:   1 usd:   0
            00:04:12:CPU    2: hi:    0, btch:   1 usd:   0
            00:04:12:CPU    3: hi:    0, btch:   1 usd:   0
            00:04:12:CPU    4: hi:    0, btch:   1 usd:   0
            00:04:12:CPU    5: hi:    0, btch:   1 usd:   0
            00:04:12:CPU    6: hi:    0, btch:   1 usd:   0
            00:04:12:CPU    7: hi:    0, btch:   1 usd:   0
            00:04:12:CPU    8: hi:    0, btch:   1 usd:   0
            00:04:12:CPU    9: hi:    0, btch:   1 usd:   0
            00:04:12:CPU   10: hi:    0, btch:   1 usd:   0
            00:04:12:CPU   11: hi:    0, btch:   1 usd:   0
            00:04:12:CPU   12: hi:    0, btch:   1 usd:   0
            00:04:12:CPU   13: hi:    0, btch:   1 usd:   0
            00:04:12:CPU   14: hi:    0, btch:   1 usd:   0
            00:04:12:CPU   15: hi:    0, btch:   1 usd:   0
            00:04:12:CPU   16: hi:    0, btch:   1 usd:   0
            00:04:12:CPU   17: hi:    0, btch:   1 usd:   0
            00:04:12:CPU   18: hi:    0, btch:   1 usd:   0
            00:04:12:CPU   19: hi:    0, btch:   1 usd:   0
            00:04:12:CPU   20: hi:    0, btch:   1 usd:   0
            00:04:12:CPU   21: hi:    0, btch:   1 usd:   0
            00:04:12:CPU   22: hi:    0, btch:   1 usd:   0
            00:04:13:CPU   23: hi:    0, btch:   1 usd:   0
            00:04:13:Node 0 DMA32 per-cpu:
            00:04:13:CPU    0: hi:  186, btch:  31 usd:  64
            00:04:13:CPU    1: hi:  186, btch:  31 usd:   0
            00:04:13:CPU    2: hi:  186, btch:  31 usd:   0
            00:04:13:CPU    3: hi:  186, btch:  31 usd:   0
            00:04:13:CPU    4: hi:  186, btch:  31 usd:   0
            00:04:13:CPU    5: hi:  186, btch:  31 usd:   0
            00:04:13:CPU    6: hi:  186, btch:  31 usd:   0
            00:04:13:CPU    7: hi:  186, btch:  31 usd:   0
            00:04:13:CPU    8: hi:  186, btch:  31 usd:   0
            00:04:13:CPU    9: hi:  186, btch:  31 usd:   0
            00:04:13:CPU   10: hi:  186, btch:  31 usd:   0
            00:04:13:CPU   11: hi:  186, btch:  31 usd:   0
            00:04:13:CPU   12: hi:  186, btch:  31 usd:   0
            00:04:13:CPU   13: hi:  186, btch:  31 usd:   0
            00:04:13:CPU   14: hi:  186, btch:  31 usd:   0
            00:04:13:CPU   15: hi:  186, btch:  31 usd:   0
            00:04:13:CPU   16: hi:  186, btch:  31 usd:   0
            00:04:13:CPU   17: hi:  186, btch:  31 usd:   0
            00:04:13:CPU   18: hi:  186, btch:  31 usd:   0
            00:04:13:CPU   19: hi:  186, btch:  31 usd:   0
            00:04:13:CPU   20: hi:  186, btch:  31 usd:   0
            00:04:13:CPU   21: hi:  186, btch:  31 usd:   0
            00:04:13:CPU   22: hi:  186, btch:  31 usd:   0
            00:04:13:CPU   23: hi:  186, btch:  31 usd:  58
            00:04:13:active_anon:3696 inactive_anon:1307 isolated_anon:0
            00:04:13: active_file:105 inactive_file:47 isolated_file:0
            00:04:13: unevictable:0 dirty:72 writeback:927 unstable:0
            00:04:13: free:13329 slab_reclaimable:2530 slab_unreclaimable:28916
            00:04:13: mapped:0 shmem:41 pagetables:818 bounce:0
            00:04:13:Node 0 DMA free:8348kB min:332kB low:412kB high:496kB active_anon:0kB inactive_anon:8kB active_file:188kB inactive_file:196kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15324kB mlocked:0kB dirty:4kB writeback:208kB mapped:0kB shmem:0kB slab_reclaimable:176kB slab_unreclaimable:452kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:4552 all_unreclaimable? yes
            00:04:13:lowmem_reserve[]: 0 2003 2003 2003
            00:04:13:Node 0 DMA32 free:44968kB min:44720kB low:55900kB high:67080kB active_anon:14784kB inactive_anon:5220kB active_file:232kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2052060kB mlocked:0kB dirty:284kB writeback:3500kB mapped:0kB shmem:164kB slab_reclaimable:9944kB slab_unreclaimable:115212kB kernel_stack:5760kB pagetables:3272kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
            00:04:13:lowmem_reserve[]: 0 0 0 0
            00:04:13:Node 0 DMA: 49*4kB 43*8kB 38*16kB 23*32kB 11*64kB 5*128kB 2*256kB 3*512kB 1*1024kB 1*2048kB 0*4096kB = 8348kB
            00:04:13:Node 0 DMA32: 1692*4kB 1057*8kB 583*16kB 269*32kB 91*64kB 16*128kB 5*256kB 1*512kB 1*1024kB 1*2048kB 0*4096kB = 45896kB
            00:04:13:1404 total pagecache pages
            00:04:13:1080 pages in swap cache
            00:04:13:Swap cache stats: add 1506, delete 426, find 0/0
            00:04:13:Free swap  = 4122736kB
            00:04:13:Total swap = 4128760kB
            00:04:13:524283 pages RAM
            00:04:13:44356 pages reserved
            00:04:13:301 pages shared
            00:04:13:455288 pages non-shared
            00:04:13:[ pid ]   uid  tgid total_vm      rss cpu oom_adj oom_score_adj name
            00:04:13:[  723]     0   723     2694       35  22     -17         -1000 udevd
            00:04:13:[ 1779]     0  1779     6909       33   7     -17         -1000 auditd
            00:04:13:[ 1795]     0  1795    62270      121   0       0             0 rsyslogd
            00:04:13:[ 1824]     0  1824     2286       42  10       0             0 irqbalance
            00:04:13:[ 1838]    32  1838     4743       17   1       0             0 rpcbind
            00:04:13:[ 1856]    29  1856     5836        3   0       0             0 rpc.statd
            00:04:13:[ 2005]    81  2005     5868       34   0       0             0 dbus-daemon
            00:04:13:[ 2038]     0  2038     1019        0  13       0             0 acpid
            00:04:13:[ 2047]    68  2047     6805      118   0       0             0 hald
            00:04:13:[ 2048]     0  2048     4525        1   5       0             0 hald-runner
            00:04:13:[ 2076]     0  2076     5054        1   0       0             0 hald-addon-inpu
            00:04:13:[ 2083]    68  2083     4450        1   0       0             0 hald-addon-acpi
            00:04:13:[ 2097]     0  2097    45434       37   0       0             0 sssd
            00:04:13:[ 2100]     0  2100    50963      186   0       0             0 sssd_be
            00:04:13:[ 2109]     0  2109    43037      114   1       0             0 sssd_nss
            00:04:13:[ 2110]     0  2110    42892        1   2       0             0 sssd_pam
            00:04:13:[ 2121]     0  2121   167752      323   0       0             0 automount
            00:04:13:[ 2162]     0  2162    26826        1   5       0             0 rpc.rquotad
            00:04:13:[ 2166]     0  2166     5413      144  11       0             0 rpc.mountd
            00:04:13:[ 2215]     0  2215     6290       62   1       0             0 rpc.idmapd
            00:04:13:[ 2258]   498  2258    56709      255   2       0             0 munged
            00:04:13:[ 2273]     0  2273    16017      166   1     -17         -1000 sshd
            00:04:13:[ 2281]     0  2281     5522       60   0       0             0 xinetd
            00:04:13:[ 2289]    38  2289     7004      113   0       0             0 ntpd
            00:04:13:[ 2305]     0  2305    22183      415   1       0             0 sendmail
            00:04:13:[ 2313]    51  2313    19528      348   1       0             0 sendmail
            00:04:13:[ 2335]     0  2335    27017       60   3       0             0 abrt-dump-oops
            00:04:13:[ 2343]     0  2343    29303      152   6       0             0 crond
            00:04:13:[ 2354]     0  2354     5363       45   0       0             0 atd
            00:04:13:[ 2380]     0  2380     1018       23   0       0             0 agetty
            00:04:13:[ 2381]     0  2381     1015       22   0       0             0 mingetty
            00:04:13:[ 2383]     0  2383     1015       22   0       0             0 mingetty
            00:04:13:[ 2385]     0  2385     1015       21   6       0             0 mingetty
            00:04:14:[ 2387]     0  2387     1015       22   9       0             0 mingetty
            00:04:15:[ 2389]     0  2389     1015       21   6       0             0 mingetty
            00:04:15:[ 2390]     0  2390     2706       46   2     -17         -1000 udevd
            00:04:15:[ 2391]     0  2391     2706       39   5     -17         -1000 udevd
            00:04:17:[ 2393]     0  2393     1015       21   6       0             0 mingetty
            00:04:17:[  806]     0   806    15354      173   0       0             0 in.mrshd
            00:04:17:[  807]     0   807    26514       66  15       0             0 bash
            00:04:17:[  828]     0   828    26514       66  16       0             0 bash
            00:04:17:[  829]     0   829    26515       49  20       0             0 sh
            00:04:17:[  831]     0   831    27935       39  22       0             0 mount
            00:04:17:[  832]     0   832     2092       42   0       0             0 mount.lustre
            00:04:18:Out of memory: Kill process 1795 (rsyslogd) score 1 or sacrifice child
            00:04:18:Killed process 1795, UID 0, (rsyslogd) total-vm:249080kB, anon-rss:480kB, file-rss:4kB
            00:04:18:llog_process_th invoked oom-killer: gfp_mask=0xd2, order=0, oom_adj=-17, oom_score_adj=0
            00:04:18:llog_process_th cpuset=/ mems_allowed=0
            00:04:18:Pid: 900, comm: llog_process_th Not tainted 2.6.32-279.5.1.el6_lustre.gb16fe80.x86_64 #1
            00:04:19:Call Trace:
            00:04:19: [<ffffffff810c4aa1>] ? cpuset_print_task_mems_allowed+0x91/0xb0
            00:04:19: [<ffffffff81117210>] ? dump_header+0x90/0x1b0
            00:04:19: [<ffffffff810e368e>] ? __delayacct_freepages_end+0x2e/0x30
            00:04:19: [<ffffffff8121489c>] ? security_real_capable_noaudit+0x3c/0x70
            00:04:19: [<ffffffff81117692>] ? oom_kill_process+0x82/0x2a0
            00:04:20: [<ffffffff8111758e>] ? select_bad_process+0x9e/0x120
            00:04:20: [<ffffffff81117ad0>] ? out_of_memory+0x220/0x3c0
            00:04:20: [<ffffffff811277ee>] ? __alloc_pages_nodemask+0x89e/0x940
            00:04:20: [<ffffffff8114d789>] ? __vmalloc_area_node+0xb9/0x190
            00:04:20: [<ffffffffa04df9e0>] ? cfs_cpt_vmalloc+0x20/0x30 [libcfs]
            00:04:20: [<ffffffff8114d65d>] ? __vmalloc_node+0xad/0x120
            00:04:20: [<ffffffffa04df9e0>] ? cfs_cpt_vmalloc+0x20/0x30 [libcfs]
            00:04:20: [<ffffffff8114d9a9>] ? vmalloc_node+0x29/0x30
            00:04:20: [<ffffffffa04df9e0>] ? cfs_cpt_vmalloc+0x20/0x30 [libcfs]
            00:04:20: [<ffffffffa07e726e>] ? ptlrpc_alloc_rqbd+0x13e/0x690 [ptlrpc]
            00:04:20: [<ffffffffa07e7825>] ? ptlrpc_grow_req_bufs+0x65/0x1b0 [ptlrpc]
            00:04:20: [<ffffffffa07e8e07>] ? ptlrpc_register_service+0x1497/0x1770 [ptlrpc]
            00:04:20: [<ffffffffa0e38c62>] ? mdt_init0+0x1202/0x23f0 [mdt]
            00:04:20: [<ffffffffa0e39f43>] ? mdt_device_alloc+0xf3/0x220 [mdt]
            00:04:20: [<ffffffffa064b0d7>] ? obd_setup+0x1d7/0x2f0 [obdclass]
            00:04:20: [<ffffffffa064b3f8>] ? class_setup+0x208/0x890 [obdclass]
            00:04:20: [<ffffffffa065308c>] ? class_process_config+0xc0c/0x1c30 [obdclass]
            00:04:20: [<ffffffffa04ea088>] ? libcfs_log_return+0x28/0x40 [libcfs]
            00:04:20: [<ffffffffa064cef1>] ? lustre_cfg_new+0x391/0x7e0 [obdclass]
            00:04:20: [<ffffffffa065515b>] ? class_config_llog_handler+0x9bb/0x1610 [obdclass]
            00:04:20: [<ffffffffa061e1f8>] ? llog_process_thread+0x888/0xd00 [obdclass]
            00:04:20: [<ffffffffa061d970>] ? llog_process_thread+0x0/0xd00 [obdclass]
            00:04:20: [<ffffffff8100c14a>] ? child_rip+0xa/0x20
            00:04:20: [<ffffffffa061d970>] ? llog_process_thread+0x0/0xd00 [obdclass]
            00:04:20: [<ffffffffa061d970>] ? llog_process_thread+0x0/0xd00 [obdclass]
            00:04:20: [<ffffffff8100c140>] ? child_rip+0x0/0x20
            

            another failure:
            https://maloo.whamcloud.com/test_sets/ebd2a076-8b97-11e2-abec-52540035b04c
            https://maloo.whamcloud.com/test_sets/89f10fd6-8b98-11e2-abec-52540035b04c
            https://maloo.whamcloud.com/test_sets/3a168224-8b99-11e2-abec-52540035b04c

            sarah Sarah Liu added a comment - - edited Found OOM in the interop test between 2.3 server and the latest tag-2.3.62 https://maloo.whamcloud.com/test_sets/422ac24e-8b95-11e2-abec-52540035b04c MDS console: 00:03:57:Lustre: DEBUG MARKER: == conf-sanity test 24a: Multiple MDTs on a single node == 00:03:40 (1363071820) 00:03:57:Lustre: DEBUG MARKER: grep -c /mnt/fs2mds' ' /proc/mounts 00:03:57:Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && lctl dl | grep ' ST ' 00:03:57:Lustre: DEBUG MARKER: mkfs.lustre --mgsnode=client-31vm3@tcp --fsname=lustre --mdt --index=0 --param=sys.timeout=20 --param=lov.stripesize=1048576 --param=lov.stripecount=0 --param=mdt.identity_upcall=/usr/sbin/l_getidentity --backfstype=ldiskfs --device-size=2097152 --mkfsopt 00:03:57:LDISKFS-fs (dm-1): mounted filesystem with ordered data mode. quota=off. Opts: 00:03:57:Lustre: DEBUG MARKER: mkdir -p /mnt/mds1 00:03:57:Lustre: DEBUG MARKER: test -b /dev/lvm-MDS/P1 00:03:57:Lustre: DEBUG MARKER: mkdir -p /mnt/mds1; mount -t lustre -o user_xattr,acl /dev/lvm-MDS/P1 /mnt/mds1 00:03:57:LDISKFS-fs (dm-0): mounted filesystem with ordered data mode. quota=off. Opts: 00:03:57:LDISKFS-fs (dm-0): mounted filesystem with ordered data mode. quota=off. Opts: 00:03:57:Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/usr/lib64/lustre/tests//usr/lib64/lustre/tests:/usr/lib64/lustre/tests:/usr/lib64/lustre/tests/../utils:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lust 00:03:57:Lustre: DEBUG MARKER: e2label /dev/lvm-MDS/P1 2>/dev/null 00:04:08:Lustre: 3472:0:(client.c:1917:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1363071835/real 1363071835] req@ffff880066563800 x1429282250294701/t0(0) o8->lustre-OST0000-osc-MDT0000@10.10.4.191@tcp:28/4 lens 400/544 e 0 to 1 dl 1363071840 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 00:04:08:Lustre: 3472:0:(client.c:1917:ptlrpc_expire_one_request()) Skipped 12 previous similar messages 00:04:10:Lustre: DEBUG MARKER: mkdir -p /mnt/fs2mds 00:04:11:Lustre: DEBUG MARKER: test -b /dev/lvm-MDS/S1 00:04:12:Lustre: DEBUG MARKER: mkdir -p /mnt/fs2mds; mount -t lustre -o user_xattr,acl /dev/lvm-MDS/S1 /mnt/fs2mds 00:04:12:LDISKFS-fs (dm-1): mounted filesystem with ordered data mode. quota=off. Opts: 00:04:12:LDISKFS-fs (dm-1): mounted filesystem with ordered data mode. quota=off. Opts: 00:04:12:Lustre: Setting parameter test1234-MDT0000-mdtlov.lov.stripesize in log test1234-MDT0000 00:04:12:Lustre: Skipped 4 previous similar messages 00:04:12:Lustre: test1234-MDT0000: new disk, initializing 00:04:12:__ratelimit: 23 callbacks suppressed 00:04:12:llog_process_th invoked oom-killer: gfp_mask=0xd2, order=0, oom_adj=-17, oom_score_adj=0 00:04:12:llog_process_th cpuset=/ mems_allowed=0 00:04:12:Pid: 900, comm: llog_process_th Not tainted 2.6.32-279.5.1.el6_lustre.gb16fe80.x86_64 #1 00:04:12:Call Trace: 00:04:12: [<ffffffff810c4aa1>] ? cpuset_print_task_mems_allowed+0x91/0xb0 00:04:12: [<ffffffff81117210>] ? dump_header+0x90/0x1b0 00:04:12: [<ffffffff81117692>] ? oom_kill_process+0x82/0x2a0 00:04:12: [<ffffffff8111758e>] ? select_bad_process+0x9e/0x120 00:04:12: [<ffffffff81117ad0>] ? out_of_memory+0x220/0x3c0 00:04:12: [<ffffffff811277ee>] ? __alloc_pages_nodemask+0x89e/0x940 00:04:12: [<ffffffff8114d789>] ? __vmalloc_area_node+0xb9/0x190 00:04:12: [<ffffffffa04df9e0>] ? cfs_cpt_vmalloc+0x20/0x30 [libcfs] 00:04:12: [<ffffffff8114d65d>] ? __vmalloc_node+0xad/0x120 00:04:12: [<ffffffffa04df9e0>] ? cfs_cpt_vmalloc+0x20/0x30 [libcfs] 00:04:12: [<ffffffff8114d9a9>] ? vmalloc_node+0x29/0x30 00:04:12: [<ffffffffa04df9e0>] ? cfs_cpt_vmalloc+0x20/0x30 [libcfs] 00:04:12: [<ffffffffa07e726e>] ? ptlrpc_alloc_rqbd+0x13e/0x690 [ptlrpc] 00:04:12: [<ffffffffa07e7825>] ? ptlrpc_grow_req_bufs+0x65/0x1b0 [ptlrpc] 00:04:12: [<ffffffffa07e8e07>] ? ptlrpc_register_service+0x1497/0x1770 [ptlrpc] 00:04:12: [<ffffffffa0e38c62>] ? mdt_init0+0x1202/0x23f0 [mdt] 00:04:12: [<ffffffffa0e39f43>] ? mdt_device_alloc+0xf3/0x220 [mdt] 00:04:12: [<ffffffffa064b0d7>] ? obd_setup+0x1d7/0x2f0 [obdclass] 00:04:12: [<ffffffffa064b3f8>] ? class_setup+0x208/0x890 [obdclass] 00:04:12: [<ffffffffa065308c>] ? class_process_config+0xc0c/0x1c30 [obdclass] 00:04:12: [<ffffffffa04ea088>] ? libcfs_log_return+0x28/0x40 [libcfs] 00:04:12: [<ffffffffa064cef1>] ? lustre_cfg_new+0x391/0x7e0 [obdclass] 00:04:12: [<ffffffffa065515b>] ? class_config_llog_handler+0x9bb/0x1610 [obdclass] 00:04:12: [<ffffffffa061e1f8>] ? llog_process_thread+0x888/0xd00 [obdclass] 00:04:12: [<ffffffffa061d970>] ? llog_process_thread+0x0/0xd00 [obdclass] 00:04:12: [<ffffffff8100c14a>] ? child_rip+0xa/0x20 00:04:12: [<ffffffffa061d970>] ? llog_process_thread+0x0/0xd00 [obdclass] 00:04:12: [<ffffffffa061d970>] ? llog_process_thread+0x0/0xd00 [obdclass] 00:04:12: [<ffffffff8100c140>] ? child_rip+0x0/0x20 00:04:12:Mem-Info: 00:04:12:Node 0 DMA per-cpu: 00:04:12:CPU 0: hi: 0, btch: 1 usd: 0 00:04:12:CPU 1: hi: 0, btch: 1 usd: 0 00:04:12:CPU 2: hi: 0, btch: 1 usd: 0 00:04:12:CPU 3: hi: 0, btch: 1 usd: 0 00:04:12:CPU 4: hi: 0, btch: 1 usd: 0 00:04:12:CPU 5: hi: 0, btch: 1 usd: 0 00:04:12:CPU 6: hi: 0, btch: 1 usd: 0 00:04:12:CPU 7: hi: 0, btch: 1 usd: 0 00:04:12:CPU 8: hi: 0, btch: 1 usd: 0 00:04:12:CPU 9: hi: 0, btch: 1 usd: 0 00:04:12:CPU 10: hi: 0, btch: 1 usd: 0 00:04:12:CPU 11: hi: 0, btch: 1 usd: 0 00:04:12:CPU 12: hi: 0, btch: 1 usd: 0 00:04:12:CPU 13: hi: 0, btch: 1 usd: 0 00:04:12:CPU 14: hi: 0, btch: 1 usd: 0 00:04:12:CPU 15: hi: 0, btch: 1 usd: 0 00:04:12:CPU 16: hi: 0, btch: 1 usd: 0 00:04:12:CPU 17: hi: 0, btch: 1 usd: 0 00:04:12:CPU 18: hi: 0, btch: 1 usd: 0 00:04:12:CPU 19: hi: 0, btch: 1 usd: 0 00:04:12:CPU 20: hi: 0, btch: 1 usd: 0 00:04:12:CPU 21: hi: 0, btch: 1 usd: 0 00:04:12:CPU 22: hi: 0, btch: 1 usd: 0 00:04:13:CPU 23: hi: 0, btch: 1 usd: 0 00:04:13:Node 0 DMA32 per-cpu: 00:04:13:CPU 0: hi: 186, btch: 31 usd: 64 00:04:13:CPU 1: hi: 186, btch: 31 usd: 0 00:04:13:CPU 2: hi: 186, btch: 31 usd: 0 00:04:13:CPU 3: hi: 186, btch: 31 usd: 0 00:04:13:CPU 4: hi: 186, btch: 31 usd: 0 00:04:13:CPU 5: hi: 186, btch: 31 usd: 0 00:04:13:CPU 6: hi: 186, btch: 31 usd: 0 00:04:13:CPU 7: hi: 186, btch: 31 usd: 0 00:04:13:CPU 8: hi: 186, btch: 31 usd: 0 00:04:13:CPU 9: hi: 186, btch: 31 usd: 0 00:04:13:CPU 10: hi: 186, btch: 31 usd: 0 00:04:13:CPU 11: hi: 186, btch: 31 usd: 0 00:04:13:CPU 12: hi: 186, btch: 31 usd: 0 00:04:13:CPU 13: hi: 186, btch: 31 usd: 0 00:04:13:CPU 14: hi: 186, btch: 31 usd: 0 00:04:13:CPU 15: hi: 186, btch: 31 usd: 0 00:04:13:CPU 16: hi: 186, btch: 31 usd: 0 00:04:13:CPU 17: hi: 186, btch: 31 usd: 0 00:04:13:CPU 18: hi: 186, btch: 31 usd: 0 00:04:13:CPU 19: hi: 186, btch: 31 usd: 0 00:04:13:CPU 20: hi: 186, btch: 31 usd: 0 00:04:13:CPU 21: hi: 186, btch: 31 usd: 0 00:04:13:CPU 22: hi: 186, btch: 31 usd: 0 00:04:13:CPU 23: hi: 186, btch: 31 usd: 58 00:04:13:active_anon:3696 inactive_anon:1307 isolated_anon:0 00:04:13: active_file:105 inactive_file:47 isolated_file:0 00:04:13: unevictable:0 dirty:72 writeback:927 unstable:0 00:04:13: free:13329 slab_reclaimable:2530 slab_unreclaimable:28916 00:04:13: mapped:0 shmem:41 pagetables:818 bounce:0 00:04:13:Node 0 DMA free:8348kB min:332kB low:412kB high:496kB active_anon:0kB inactive_anon:8kB active_file:188kB inactive_file:196kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15324kB mlocked:0kB dirty:4kB writeback:208kB mapped:0kB shmem:0kB slab_reclaimable:176kB slab_unreclaimable:452kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:4552 all_unreclaimable? yes 00:04:13:lowmem_reserve[]: 0 2003 2003 2003 00:04:13:Node 0 DMA32 free:44968kB min:44720kB low:55900kB high:67080kB active_anon:14784kB inactive_anon:5220kB active_file:232kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2052060kB mlocked:0kB dirty:284kB writeback:3500kB mapped:0kB shmem:164kB slab_reclaimable:9944kB slab_unreclaimable:115212kB kernel_stack:5760kB pagetables:3272kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no 00:04:13:lowmem_reserve[]: 0 0 0 0 00:04:13:Node 0 DMA: 49*4kB 43*8kB 38*16kB 23*32kB 11*64kB 5*128kB 2*256kB 3*512kB 1*1024kB 1*2048kB 0*4096kB = 8348kB 00:04:13:Node 0 DMA32: 1692*4kB 1057*8kB 583*16kB 269*32kB 91*64kB 16*128kB 5*256kB 1*512kB 1*1024kB 1*2048kB 0*4096kB = 45896kB 00:04:13:1404 total pagecache pages 00:04:13:1080 pages in swap cache 00:04:13:Swap cache stats: add 1506, delete 426, find 0/0 00:04:13:Free swap = 4122736kB 00:04:13:Total swap = 4128760kB 00:04:13:524283 pages RAM 00:04:13:44356 pages reserved 00:04:13:301 pages shared 00:04:13:455288 pages non-shared 00:04:13:[ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name 00:04:13:[ 723] 0 723 2694 35 22 -17 -1000 udevd 00:04:13:[ 1779] 0 1779 6909 33 7 -17 -1000 auditd 00:04:13:[ 1795] 0 1795 62270 121 0 0 0 rsyslogd 00:04:13:[ 1824] 0 1824 2286 42 10 0 0 irqbalance 00:04:13:[ 1838] 32 1838 4743 17 1 0 0 rpcbind 00:04:13:[ 1856] 29 1856 5836 3 0 0 0 rpc.statd 00:04:13:[ 2005] 81 2005 5868 34 0 0 0 dbus-daemon 00:04:13:[ 2038] 0 2038 1019 0 13 0 0 acpid 00:04:13:[ 2047] 68 2047 6805 118 0 0 0 hald 00:04:13:[ 2048] 0 2048 4525 1 5 0 0 hald-runner 00:04:13:[ 2076] 0 2076 5054 1 0 0 0 hald-addon-inpu 00:04:13:[ 2083] 68 2083 4450 1 0 0 0 hald-addon-acpi 00:04:13:[ 2097] 0 2097 45434 37 0 0 0 sssd 00:04:13:[ 2100] 0 2100 50963 186 0 0 0 sssd_be 00:04:13:[ 2109] 0 2109 43037 114 1 0 0 sssd_nss 00:04:13:[ 2110] 0 2110 42892 1 2 0 0 sssd_pam 00:04:13:[ 2121] 0 2121 167752 323 0 0 0 automount 00:04:13:[ 2162] 0 2162 26826 1 5 0 0 rpc.rquotad 00:04:13:[ 2166] 0 2166 5413 144 11 0 0 rpc.mountd 00:04:13:[ 2215] 0 2215 6290 62 1 0 0 rpc.idmapd 00:04:13:[ 2258] 498 2258 56709 255 2 0 0 munged 00:04:13:[ 2273] 0 2273 16017 166 1 -17 -1000 sshd 00:04:13:[ 2281] 0 2281 5522 60 0 0 0 xinetd 00:04:13:[ 2289] 38 2289 7004 113 0 0 0 ntpd 00:04:13:[ 2305] 0 2305 22183 415 1 0 0 sendmail 00:04:13:[ 2313] 51 2313 19528 348 1 0 0 sendmail 00:04:13:[ 2335] 0 2335 27017 60 3 0 0 abrt-dump-oops 00:04:13:[ 2343] 0 2343 29303 152 6 0 0 crond 00:04:13:[ 2354] 0 2354 5363 45 0 0 0 atd 00:04:13:[ 2380] 0 2380 1018 23 0 0 0 agetty 00:04:13:[ 2381] 0 2381 1015 22 0 0 0 mingetty 00:04:13:[ 2383] 0 2383 1015 22 0 0 0 mingetty 00:04:13:[ 2385] 0 2385 1015 21 6 0 0 mingetty 00:04:14:[ 2387] 0 2387 1015 22 9 0 0 mingetty 00:04:15:[ 2389] 0 2389 1015 21 6 0 0 mingetty 00:04:15:[ 2390] 0 2390 2706 46 2 -17 -1000 udevd 00:04:15:[ 2391] 0 2391 2706 39 5 -17 -1000 udevd 00:04:17:[ 2393] 0 2393 1015 21 6 0 0 mingetty 00:04:17:[ 806] 0 806 15354 173 0 0 0 in.mrshd 00:04:17:[ 807] 0 807 26514 66 15 0 0 bash 00:04:17:[ 828] 0 828 26514 66 16 0 0 bash 00:04:17:[ 829] 0 829 26515 49 20 0 0 sh 00:04:17:[ 831] 0 831 27935 39 22 0 0 mount 00:04:17:[ 832] 0 832 2092 42 0 0 0 mount.lustre 00:04:18:Out of memory: Kill process 1795 (rsyslogd) score 1 or sacrifice child 00:04:18:Killed process 1795, UID 0, (rsyslogd) total-vm:249080kB, anon-rss:480kB, file-rss:4kB 00:04:18:llog_process_th invoked oom-killer: gfp_mask=0xd2, order=0, oom_adj=-17, oom_score_adj=0 00:04:18:llog_process_th cpuset=/ mems_allowed=0 00:04:18:Pid: 900, comm: llog_process_th Not tainted 2.6.32-279.5.1.el6_lustre.gb16fe80.x86_64 #1 00:04:19:Call Trace: 00:04:19: [<ffffffff810c4aa1>] ? cpuset_print_task_mems_allowed+0x91/0xb0 00:04:19: [<ffffffff81117210>] ? dump_header+0x90/0x1b0 00:04:19: [<ffffffff810e368e>] ? __delayacct_freepages_end+0x2e/0x30 00:04:19: [<ffffffff8121489c>] ? security_real_capable_noaudit+0x3c/0x70 00:04:19: [<ffffffff81117692>] ? oom_kill_process+0x82/0x2a0 00:04:20: [<ffffffff8111758e>] ? select_bad_process+0x9e/0x120 00:04:20: [<ffffffff81117ad0>] ? out_of_memory+0x220/0x3c0 00:04:20: [<ffffffff811277ee>] ? __alloc_pages_nodemask+0x89e/0x940 00:04:20: [<ffffffff8114d789>] ? __vmalloc_area_node+0xb9/0x190 00:04:20: [<ffffffffa04df9e0>] ? cfs_cpt_vmalloc+0x20/0x30 [libcfs] 00:04:20: [<ffffffff8114d65d>] ? __vmalloc_node+0xad/0x120 00:04:20: [<ffffffffa04df9e0>] ? cfs_cpt_vmalloc+0x20/0x30 [libcfs] 00:04:20: [<ffffffff8114d9a9>] ? vmalloc_node+0x29/0x30 00:04:20: [<ffffffffa04df9e0>] ? cfs_cpt_vmalloc+0x20/0x30 [libcfs] 00:04:20: [<ffffffffa07e726e>] ? ptlrpc_alloc_rqbd+0x13e/0x690 [ptlrpc] 00:04:20: [<ffffffffa07e7825>] ? ptlrpc_grow_req_bufs+0x65/0x1b0 [ptlrpc] 00:04:20: [<ffffffffa07e8e07>] ? ptlrpc_register_service+0x1497/0x1770 [ptlrpc] 00:04:20: [<ffffffffa0e38c62>] ? mdt_init0+0x1202/0x23f0 [mdt] 00:04:20: [<ffffffffa0e39f43>] ? mdt_device_alloc+0xf3/0x220 [mdt] 00:04:20: [<ffffffffa064b0d7>] ? obd_setup+0x1d7/0x2f0 [obdclass] 00:04:20: [<ffffffffa064b3f8>] ? class_setup+0x208/0x890 [obdclass] 00:04:20: [<ffffffffa065308c>] ? class_process_config+0xc0c/0x1c30 [obdclass] 00:04:20: [<ffffffffa04ea088>] ? libcfs_log_return+0x28/0x40 [libcfs] 00:04:20: [<ffffffffa064cef1>] ? lustre_cfg_new+0x391/0x7e0 [obdclass] 00:04:20: [<ffffffffa065515b>] ? class_config_llog_handler+0x9bb/0x1610 [obdclass] 00:04:20: [<ffffffffa061e1f8>] ? llog_process_thread+0x888/0xd00 [obdclass] 00:04:20: [<ffffffffa061d970>] ? llog_process_thread+0x0/0xd00 [obdclass] 00:04:20: [<ffffffff8100c14a>] ? child_rip+0xa/0x20 00:04:20: [<ffffffffa061d970>] ? llog_process_thread+0x0/0xd00 [obdclass] 00:04:20: [<ffffffffa061d970>] ? llog_process_thread+0x0/0xd00 [obdclass] 00:04:20: [<ffffffff8100c140>] ? child_rip+0x0/0x20 another failure: https://maloo.whamcloud.com/test_sets/ebd2a076-8b97-11e2-abec-52540035b04c https://maloo.whamcloud.com/test_sets/89f10fd6-8b98-11e2-abec-52540035b04c https://maloo.whamcloud.com/test_sets/3a168224-8b99-11e2-abec-52540035b04c

            Per discussions with BobiJam, lowering priority since http://review.whamcloud.com/4940 landed and believed to help. Will raise priority if further testing proves otherwise.

            jlevi Jodi Levi (Inactive) added a comment - Per discussions with BobiJam, lowering priority since http://review.whamcloud.com/4940 landed and believed to help. Will raise priority if further testing proves otherwise.

            I pushed http://review.whamcloud.com/5470 for quieting the spurious "cannot allocate a tage" message, but this does not solve the root of the problem, which is that the MDS is consuming more and more memory.

            adilger Andreas Dilger added a comment - I pushed http://review.whamcloud.com/5470 for quieting the spurious "cannot allocate a tage" message, but this does not solve the root of the problem, which is that the MDS is consuming more and more memory.

            People

              sarah Sarah Liu
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              16 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: