[LU-2468] MDS out of memory, blocked in ldlm_pools_shrink() - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: Lustre 2.4.0
Affects Version/s: Lustre 2.4.0
Labels:
None

Severity:
3
Rank (Obsolete):
5814

Description

This issue was created by maloo for Andreas Dilger <adilger@whamcloud.com>

This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/35a3873c-4166-11e2-af91-52540035b04c.

The sub-test test_24a failed with the following error on the MDS console:

test failed to respond and timed out
11:14:17:Lustre: DEBUG MARKER: == conf-sanity test 24a: Multiple MDTs on a single node == 11:13:50 (1355166830)
11:14:17:Lustre: DEBUG MARKER: grep -c /mnt/fs2mds' ' /proc/mounts
11:14:17:Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && lctl dl | grep ' ST '
11:14:17:Lustre: DEBUG MARKER: mkfs.lustre --mgsnode=client-29vm3@tcp --fsname=lustre --mdt --index=0 --param=sys.timeout=20 --param=lov.stripesize=1048576 --param=lov.stripecount=0 --param=mdt.identity_upcall=/usr/sbin/l_getidentity --backfstype=ldiskfs --device-size=2097152 --mkfsopt
11:14:17:LDISKFS-fs (dm-1): mounted filesystem with ordered data mode. quota=on. Opts:
11:14:17:Lustre: DEBUG MARKER: mkdir -p /mnt/mds1
11:14:17:Lustre: DEBUG MARKER: test -b /dev/lvm-MDS/P1
11:14:17:Lustre: DEBUG MARKER: mkdir -p /mnt/mds1; mount -t lustre -o user_xattr,acl /dev/lvm-MDS/P1 /mnt/mds1
11:14:17:LDISKFS-fs (dm-0): mounted filesystem with ordered data mode. quota=on. Opts:
11:14:17:Lustre: MGC10.10.4.174@tcp: Reactivating import
11:14:17:Lustre: lustre-MDT0000: used disk, loading
11:14:17:__ratelimit: 582 callbacks suppressed
11:14:17:cannot allocate a tage (0)
11:14:17:cannot allocate a tage (0)
11:14:17:cannot allocate a tage (0)
11:14:17:cannot allocate a tage (0)
11:14:17:cannot allocate a tage (0)
11:14:17:cannot allocate a tage (0)
11:14:17:cannot allocate a tage (0)

11:15:55:LNet: Service thread pid 16764 was inactive for 40.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
11:15:55:Pid: 16764, comm: mdt01_001
11:15:55:
11:15:55:Call Trace:
11:15:55: [<ffffffffa0e19d97>] ? cfs_hash_bd_lookup_intent+0x37/0x130 [libcfs]
11:15:55: [<ffffffff814ffa2e>] ? __mutex_lock_slowpath+0x13e/0x180
11:15:55: [<ffffffffa051a691>] ? ldlm_cli_pool_shrink+0x71/0x130 [ptlrpc]
11:15:55: [<ffffffffa0e14591>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
11:15:55: [<ffffffff814ff8cb>] ? mutex_lock+0x2b/0x50
11:15:55: [<ffffffffa0568e40>] ? enc_pools_shrink+0x3d0/0x560 [ptlrpc]
11:15:55: [<ffffffffa05198c3>] ? ldlm_pools_srv_shrink+0x13/0x20 [ptlrpc]
11:15:55: [<ffffffff8112d34a>] ? shrink_slab+0x8a/0x1a0
11:15:55: [<ffffffff8112f36f>] ? do_try_to_free_pages+0x2ff/0x520
11:15:55: [<ffffffff8100bc0e>] ? apic_timer_interrupt+0xe/0x20
11:15:55: [<ffffffff8112f77d>] ? try_to_free_pages+0x9d/0x130
11:15:55: [<ffffffff811308d0>] ? isolate_pages_global+0x0/0x350
11:15:55: [<ffffffff8112758d>] ? __alloc_pages_nodemask+0x40d/0x940
11:15:55: [<ffffffff81162372>] ? kmem_getpages+0x62/0x170
11:15:55: [<ffffffff81162f8a>] ? fallback_alloc+0x1ba/0x270
11:15:55: [<ffffffff811629df>] ? cache_grow+0x2cf/0x320
11:15:55: [<ffffffff81162d09>] ? ____cache_alloc_node+0x99/0x160
11:15:55: [<ffffffff81163aeb>] ? kmem_cache_alloc+0x11b/0x190
11:15:55: [<ffffffffa0e04af2>] ? cfs_mem_cache_alloc+0x22/0x30 [libcfs]
11:15:55: [<ffffffffa071b00a>] ? osc_session_init+0x3a/0x200 [osc]
11:15:55: [<ffffffffa0ee0baf>] ? keys_fill+0x6f/0x1a0 [obdclass]
11:15:55: [<ffffffffa0ee499b>] ? lu_context_init+0xab/0x260 [obdclass]
11:15:55: [<ffffffffa0542db4>] ? ptlrpc_server_handle_request+0x194/0xe00 [ptlrpc]
11:15:55: [<ffffffffa0e0465e>] ? cfs_timer_arm+0xe/0x10 [libcfs]
11:15:55: [<ffffffffa0e160ef>] ? lc_watchdog_touch+0x6f/0x180 [libcfs]
11:15:55: [<ffffffffa053a429>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
11:15:55: [<ffffffffa0e14591>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
11:15:55: [<ffffffff81053463>] ? __wake_up+0x53/0x70
11:15:55: [<ffffffffa05445d5>] ? ptlrpc_main+0xbb5/0x1970 [ptlrpc]
11:15:55: [<ffffffffa0543a20>] ? ptlrpc_main+0x0/0x1970 [ptlrpc]

11:31:32:INFO: task mount.lustre:17152 blocked for more than 120 seconds.
11:31:32:mount.lustre D 0000000000000000 0 17152 1 0x00100084
11:31:32: ffff88001e843698 0000000000000082 0000000000000000 ffffffff81060af3
11:31:32: 0000000050c63493 0000000000000282 0000000000800500 0000000000000000
11:31:32: ffff88001a3e3098 ffff88001e843fd8 000000000000fb88 ffff88001a3e3098
11:31:32:Call Trace:
11:31:32: [<ffffffff81060af3>] ? wake_up_new_task+0xd3/0x120
11:31:32: [<ffffffff814ff245>] schedule_timeout+0x215/0x2e0
11:31:32: [<ffffffffa0e840b0>] ? llog_process_thread_daemonize+0x0/0x80 [obdclass]
11:31:32: [<ffffffff8100c0e2>] ? kernel_thread+0x82/0xe0
11:31:32: [<ffffffffa0e840b0>] ? llog_process_thread_daemonize+0x0/0x80 [obdclass]
11:31:32: [<ffffffff814feec3>] wait_for_common+0x123/0x180
11:31:32: [<ffffffff810602c0>] ? default_wake_function+0x0/0x20
11:31:32: [<ffffffffa0e0b77a>] ? cfs_create_thread+0x7a/0xa0 [libcfs]
11:31:32: [<ffffffffa0ec8650>] ? class_config_llog_handler+0x0/0x1850 [obdclass]
11:31:32: [<ffffffff814fefdd>] wait_for_completion+0x1d/0x20
11:31:32: [<ffffffffa0e85ae3>] llog_process_or_fork+0x333/0x660 [obdclass]
11:31:32: [<ffffffffa0e85e24>] llog_process+0x14/0x20 [obdclass]
11:31:32: [<ffffffffa0ebdd64>] class_config_parse_llog+0x1e4/0x340 [obdclass]
11:31:32: [<ffffffffa0821ced>] mgc_process_cfg_log+0x5cd/0x1600 [mgc]
11:31:32: [<ffffffffa0823163>] mgc_process_log+0x443/0x1350 [mgc]
11:31:32: [<ffffffffa0e14591>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
11:31:32: [<ffffffffa0e14591>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
11:31:32: [<ffffffffa081da80>] ? mgc_blocking_ast+0x0/0x780 [mgc]
11:31:32: [<ffffffffa0509930>] ? ldlm_completion_ast+0x0/0x980 [ptlrpc]
11:31:32: [<ffffffffa0825974>] mgc_process_config+0x594/0xee0 [mgc]
11:31:32: [<ffffffffa0ecfc1c>] lustre_process_log+0x25c/0xad0 [obdclass]
11:31:32: [<ffffffffa0e14591>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
11:31:32: [<ffffffffa0e0f028>] ? libcfs_log_return+0x28/0x40 [libcfs]
11:31:32: [<ffffffffa0ed1691>] ? server_register_mount+0x551/0x8f0 [obdclass]
11:31:32: [<ffffffffa0edd607>] server_start_targets+0x5c7/0x18f0 [obdclass]
11:31:32: [<ffffffffa0e0f028>] ? libcfs_log_return+0x28/0x40 [libcfs]
11:31:32: [<ffffffffa0ed86a0>] ? lustre_start_mgc+0x4e0/0x1bc0 [obdclass]
11:31:32: [<ffffffffa0e14591>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
11:31:32: [<ffffffffa0ec8650>] ? class_config_llog_handler+0x0/0x1850 [obdclass]
11:31:32: [<ffffffffa0edfd87>] lustre_fill_super+0x1457/0x1b00 [obdclass]
11:31:32: [<ffffffff8117d200>] ? set_anon_super+0x0/0x100
11:31:32: [<ffffffffa0ede930>] ? lustre_fill_super+0x0/0x1b00 [obdclass]
11:31:32: [<ffffffff8117e66f>] get_sb_nodev+0x5f/0xa0
11:31:32: [<ffffffffa0ec9fa5>] lustre_get_sb+0x25/0x30 [obdclass]
11:31:32: [<ffffffff8117e2cb>] vfs_kern_mount+0x7b/0x1b0
11:31:32: [<ffffffff8117e472>] do_kern_mount+0x52/0x130
11:31:32: [<ffffffff8119cb42>] do_mount+0x2d2/0x8d0
11:31:32: [<ffffffff8119d1d0>] sys_mount+0x90/0xe0

It is likely that this is only a symptom of something else consuming memory on the MDS, and not the root cause.

Info required for matching: conf-sanity 24a

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

UJF_crash_foreach_bt.gz
36 kB
18/Apr/13 11:32 AM

Issue Links

is related to

LU-2703 racer: BUG: soft lockup - CPU#0 stuck for 67s! [dd:1404]

Resolved

LU-5722 memory allocation deadlock under lu_cache_shrink()

Resolved

MDS out of memory, blocked in ldlm_pools_shrink()

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates