Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2468

MDS out of memory, blocked in ldlm_pools_shrink()

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: Lustre 2.4.0
    • Fix Version/s: Lustre 2.4.0
    • Labels:
      None
    • Severity:
      3
    • Rank (Obsolete):
      5814

      Description

      This issue was created by maloo for Andreas Dilger <adilger@whamcloud.com>

      This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/35a3873c-4166-11e2-af91-52540035b04c.

      The sub-test test_24a failed with the following error on the MDS console:

      test failed to respond and timed out
      11:14:17:Lustre: DEBUG MARKER: == conf-sanity test 24a: Multiple MDTs on a single node == 11:13:50 (1355166830)
      11:14:17:Lustre: DEBUG MARKER: grep -c /mnt/fs2mds' ' /proc/mounts
      11:14:17:Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && lctl dl | grep ' ST '
      11:14:17:Lustre: DEBUG MARKER: mkfs.lustre --mgsnode=client-29vm3@tcp --fsname=lustre --mdt --index=0 --param=sys.timeout=20 --param=lov.stripesize=1048576 --param=lov.stripecount=0 --param=mdt.identity_upcall=/usr/sbin/l_getidentity --backfstype=ldiskfs --device-size=2097152 --mkfsopt
      11:14:17:LDISKFS-fs (dm-1): mounted filesystem with ordered data mode. quota=on. Opts:
      11:14:17:Lustre: DEBUG MARKER: mkdir -p /mnt/mds1
      11:14:17:Lustre: DEBUG MARKER: test -b /dev/lvm-MDS/P1
      11:14:17:Lustre: DEBUG MARKER: mkdir -p /mnt/mds1; mount -t lustre -o user_xattr,acl /dev/lvm-MDS/P1 /mnt/mds1
      11:14:17:LDISKFS-fs (dm-0): mounted filesystem with ordered data mode. quota=on. Opts:
      11:14:17:Lustre: MGC10.10.4.174@tcp: Reactivating import
      11:14:17:Lustre: lustre-MDT0000: used disk, loading
      11:14:17:__ratelimit: 582 callbacks suppressed
      11:14:17:cannot allocate a tage (0)
      11:14:17:cannot allocate a tage (0)
      11:14:17:cannot allocate a tage (0)
      11:14:17:cannot allocate a tage (0)
      11:14:17:cannot allocate a tage (0)
      11:14:17:cannot allocate a tage (0)
      11:14:17:cannot allocate a tage (0)

      11:15:55:LNet: Service thread pid 16764 was inactive for 40.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      11:15:55:Pid: 16764, comm: mdt01_001
      11:15:55:
      11:15:55:Call Trace:
      11:15:55: [<ffffffffa0e19d97>] ? cfs_hash_bd_lookup_intent+0x37/0x130 [libcfs]
      11:15:55: [<ffffffff814ffa2e>] ? __mutex_lock_slowpath+0x13e/0x180
      11:15:55: [<ffffffffa051a691>] ? ldlm_cli_pool_shrink+0x71/0x130 [ptlrpc]
      11:15:55: [<ffffffffa0e14591>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
      11:15:55: [<ffffffff814ff8cb>] ? mutex_lock+0x2b/0x50
      11:15:55: [<ffffffffa0568e40>] ? enc_pools_shrink+0x3d0/0x560 [ptlrpc]
      11:15:55: [<ffffffffa05198c3>] ? ldlm_pools_srv_shrink+0x13/0x20 [ptlrpc]
      11:15:55: [<ffffffff8112d34a>] ? shrink_slab+0x8a/0x1a0
      11:15:55: [<ffffffff8112f36f>] ? do_try_to_free_pages+0x2ff/0x520
      11:15:55: [<ffffffff8100bc0e>] ? apic_timer_interrupt+0xe/0x20
      11:15:55: [<ffffffff8112f77d>] ? try_to_free_pages+0x9d/0x130
      11:15:55: [<ffffffff811308d0>] ? isolate_pages_global+0x0/0x350
      11:15:55: [<ffffffff8112758d>] ? __alloc_pages_nodemask+0x40d/0x940
      11:15:55: [<ffffffff81162372>] ? kmem_getpages+0x62/0x170
      11:15:55: [<ffffffff81162f8a>] ? fallback_alloc+0x1ba/0x270
      11:15:55: [<ffffffff811629df>] ? cache_grow+0x2cf/0x320
      11:15:55: [<ffffffff81162d09>] ? ____cache_alloc_node+0x99/0x160
      11:15:55: [<ffffffff81163aeb>] ? kmem_cache_alloc+0x11b/0x190
      11:15:55: [<ffffffffa0e04af2>] ? cfs_mem_cache_alloc+0x22/0x30 [libcfs]
      11:15:55: [<ffffffffa071b00a>] ? osc_session_init+0x3a/0x200 [osc]
      11:15:55: [<ffffffffa0ee0baf>] ? keys_fill+0x6f/0x1a0 [obdclass]
      11:15:55: [<ffffffffa0ee499b>] ? lu_context_init+0xab/0x260 [obdclass]
      11:15:55: [<ffffffffa0542db4>] ? ptlrpc_server_handle_request+0x194/0xe00 [ptlrpc]
      11:15:55: [<ffffffffa0e0465e>] ? cfs_timer_arm+0xe/0x10 [libcfs]
      11:15:55: [<ffffffffa0e160ef>] ? lc_watchdog_touch+0x6f/0x180 [libcfs]
      11:15:55: [<ffffffffa053a429>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
      11:15:55: [<ffffffffa0e14591>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
      11:15:55: [<ffffffff81053463>] ? __wake_up+0x53/0x70
      11:15:55: [<ffffffffa05445d5>] ? ptlrpc_main+0xbb5/0x1970 [ptlrpc]
      11:15:55: [<ffffffffa0543a20>] ? ptlrpc_main+0x0/0x1970 [ptlrpc]

      11:31:32:INFO: task mount.lustre:17152 blocked for more than 120 seconds.
      11:31:32:mount.lustre D 0000000000000000 0 17152 1 0x00100084
      11:31:32: ffff88001e843698 0000000000000082 0000000000000000 ffffffff81060af3
      11:31:32: 0000000050c63493 0000000000000282 0000000000800500 0000000000000000
      11:31:32: ffff88001a3e3098 ffff88001e843fd8 000000000000fb88 ffff88001a3e3098
      11:31:32:Call Trace:
      11:31:32: [<ffffffff81060af3>] ? wake_up_new_task+0xd3/0x120
      11:31:32: [<ffffffff814ff245>] schedule_timeout+0x215/0x2e0
      11:31:32: [<ffffffffa0e840b0>] ? llog_process_thread_daemonize+0x0/0x80 [obdclass]
      11:31:32: [<ffffffff8100c0e2>] ? kernel_thread+0x82/0xe0
      11:31:32: [<ffffffffa0e840b0>] ? llog_process_thread_daemonize+0x0/0x80 [obdclass]
      11:31:32: [<ffffffff814feec3>] wait_for_common+0x123/0x180
      11:31:32: [<ffffffff810602c0>] ? default_wake_function+0x0/0x20
      11:31:32: [<ffffffffa0e0b77a>] ? cfs_create_thread+0x7a/0xa0 [libcfs]
      11:31:32: [<ffffffffa0ec8650>] ? class_config_llog_handler+0x0/0x1850 [obdclass]
      11:31:32: [<ffffffff814fefdd>] wait_for_completion+0x1d/0x20
      11:31:32: [<ffffffffa0e85ae3>] llog_process_or_fork+0x333/0x660 [obdclass]
      11:31:32: [<ffffffffa0e85e24>] llog_process+0x14/0x20 [obdclass]
      11:31:32: [<ffffffffa0ebdd64>] class_config_parse_llog+0x1e4/0x340 [obdclass]
      11:31:32: [<ffffffffa0821ced>] mgc_process_cfg_log+0x5cd/0x1600 [mgc]
      11:31:32: [<ffffffffa0823163>] mgc_process_log+0x443/0x1350 [mgc]
      11:31:32: [<ffffffffa0e14591>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
      11:31:32: [<ffffffffa0e14591>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
      11:31:32: [<ffffffffa081da80>] ? mgc_blocking_ast+0x0/0x780 [mgc]
      11:31:32: [<ffffffffa0509930>] ? ldlm_completion_ast+0x0/0x980 [ptlrpc]
      11:31:32: [<ffffffffa0825974>] mgc_process_config+0x594/0xee0 [mgc]
      11:31:32: [<ffffffffa0ecfc1c>] lustre_process_log+0x25c/0xad0 [obdclass]
      11:31:32: [<ffffffffa0e14591>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
      11:31:32: [<ffffffffa0e0f028>] ? libcfs_log_return+0x28/0x40 [libcfs]
      11:31:32: [<ffffffffa0ed1691>] ? server_register_mount+0x551/0x8f0 [obdclass]
      11:31:32: [<ffffffffa0edd607>] server_start_targets+0x5c7/0x18f0 [obdclass]
      11:31:32: [<ffffffffa0e0f028>] ? libcfs_log_return+0x28/0x40 [libcfs]
      11:31:32: [<ffffffffa0ed86a0>] ? lustre_start_mgc+0x4e0/0x1bc0 [obdclass]
      11:31:32: [<ffffffffa0e14591>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
      11:31:32: [<ffffffffa0ec8650>] ? class_config_llog_handler+0x0/0x1850 [obdclass]
      11:31:32: [<ffffffffa0edfd87>] lustre_fill_super+0x1457/0x1b00 [obdclass]
      11:31:32: [<ffffffff8117d200>] ? set_anon_super+0x0/0x100
      11:31:32: [<ffffffffa0ede930>] ? lustre_fill_super+0x0/0x1b00 [obdclass]
      11:31:32: [<ffffffff8117e66f>] get_sb_nodev+0x5f/0xa0
      11:31:32: [<ffffffffa0ec9fa5>] lustre_get_sb+0x25/0x30 [obdclass]
      11:31:32: [<ffffffff8117e2cb>] vfs_kern_mount+0x7b/0x1b0
      11:31:32: [<ffffffff8117e472>] do_kern_mount+0x52/0x130
      11:31:32: [<ffffffff8119cb42>] do_mount+0x2d2/0x8d0
      11:31:32: [<ffffffff8119d1d0>] sys_mount+0x90/0xe0

      It is likely that this is only a symptom of something else consuming memory on the MDS, and not the root cause.

      Info required for matching: conf-sanity 24a

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                sarah Sarah Liu
                Reporter:
                maloo Maloo
              • Votes:
                0 Vote for this issue
                Watchers:
                16 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: