Details
-
Bug
-
Resolution: Duplicate
-
Critical
-
None
-
None
-
None
-
3
-
15619
Description
This issue was created by maloo for John Hammond <john.hammond@intel.com>
This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/e9a29482-365e-11e4-a823-5254006e85c2.
The sub-test test_38 failed with the following error:
import is not in FULL state
MDS 1 triggered the OOM killer in test_38.
https://testing.hpdd.intel.com/test_logs/7bd1c0b2-365f-11e4-a823-5254006e85c2
Info required for matching: replay-single 38
Attachments
Issue Links
Activity
I've gotten some pretty useful information about what memory was allocated on my clients via:
# lctl set_param debug=+malloc debug_mb=256
# umount $LUSTRE
# lctl dk /tmp/umount.debug
# perl lustre/tests/leak_finder.pl /tmp/umount.debug > /tmp/umount.leak 2>&1
# awk '/Free without malloc/ { print $2,$5,$6,$9 }' /tmp/umount.leak | sort | uniq -c | sort -nr > /tmp/umount.hist
I noticed today that at the end of several tests we unload the modules on the client that runs the test script but not on the other nodes. So for example there could be weird reply states leftover from replay-mgs-triple or some other random test.
Thanks John, it's a good catch. We enable FULL logs right before the test with OOM.
start_full_debug_logging run_test 37 "abort recovery before client does replay (test mds_cleanup_orphans stop_full_debug_logging test_38() { ...
I will try to reproduce this.
I got a crash dump from a recent replay-single OOM. See https://testing.hpdd.intel.com/test_logs/ab38fbc4-4a0d-11e4-880b-5254006e85c2.
There is a lot of size-512 and size-128 slab usage:
crash> kmem -i PAGES TOTAL PERCENTAGE TOTAL MEM 480589 1.8 GB ---- FREE 13243 51.7 MB 2% of TOTAL MEM USED 467346 1.8 GB 97% of TOTAL MEM SHARED 181 724 KB 0% of TOTAL MEM BUFFERS 115 460 KB 0% of TOTAL MEM CACHED -3 70368744177664 GB 38383616923628% of TOTAL MEM SLAB 417875 1.6 GB 86% of TOTAL MEM TOTAL SWAP 1032191 3.9 GB ---- SWAP USED 3829 15 MB 0% of TOTAL SWAP SWAP FREE 1028362 3.9 GB 99% of TOTAL SWAP crash> kmem -s | awk '$4 != 0' | sort -k2 CACHE NAME OBJSIZE ALLOCATED TOTAL SLABS SSIZE ffff88007e4011c0 Acpi-Namespace 40 400 460 5 4k ffff88007e4412c0 Acpi-Operand 72 1204 1219 23 4k ffff88007f310ac0 anon_vma 40 1397 1656 18 4k ffff88007f320b00 anon_vma_chain 48 1779 2079 27 4k ffff88007e201c80 arp_cache 256 6 15 1 4k ffff88007efb1080 bdev_cache 832 35 36 9 4k ffff88007e681700 bio-0 192 177 460 23 4k ffff88007e6716c0 biovec-256 4096 98 98 98 4k ffff88007e6315c0 bip-256 4224 2 2 2 8k ffff88007e6c17c0 blkdev_ioc 80 1 48 1 4k ffff88007e701840 blkdev_queue 2864 31 32 16 8k ffff88007e6d1800 blkdev_requests 352 5 33 3 4k ffff88007f3d0dc0 buffer_head 104 441 2775 75 4k ffff8800798c3a00 ccc_session_kmem 224 4 34 2 4k ffff880059e539c0 ccc_thread_kmem 360 8 22 2 4k ffff88007c5333c0 cl_env_kmem 168 1 23 1 4k ffff88007f330b40 cred_jar 192 273 380 19 4k ffff88007ec30f40 dentry 192 5574 8200 410 4k ffff880037b525c0 dm_io 40 1536 1564 17 4k ffff880037b62600 dm_target_io 24 1536 1584 11 4k ffff880037832240 dnotify_mark_entry 112 6 34 1 4k ffff880037822200 dnotify_struct 32 6 112 1 4k ffff880037882380 dquot 256 2 15 1 4k ffff88007e161a80 eventpoll_epi 128 5 30 1 4k ffff88007e171ac0 eventpoll_pwq 72 5 53 1 4k ffff8800377d2980 ext3_inode_cache 800 303 715 143 4k ffff8800377c2940 ext3_xattr 88 1 44 1 4k ffff880079b92c40 fib6_nodes 64 22 59 1 4k ffff88007e571380 file_lock_cache 176 6 22 1 4k ffff88007f380c80 files_cache 704 35 88 8 8k ffff88007ef60fc0 filp 192 545 940 47 4k ffff88007f390cc0 fs_cache 64 76 118 2 4k ffff88007d612d40 fscache_cookie_jar 80 4 48 1 4k ffff8800378923c0 hugetlbfs_inode_cache 608 1 6 1 4k ffff88007fa809c0 idr_layer_cache 544 280 294 42 4k ffff88007e301e00 inet_peer_cache 64 1 59 1 4k ffff88007ee50f80 inode_cache 592 4471 4608 768 4k ffff880037842280 inotify_inode_mark_entry 112 115 136 4 4k ffff880079b52c00 ip6_dst_cache 384 13 20 2 4k ffff88007e211cc0 ip_dst_cache 384 18 40 4 4k ffff88007e2c1d00 ip_fib_hash 72 10 53 1 4k ffff880057953080 jbd2_revoke_table 16 2 202 1 4k ffff88007f3e0e00 key_jar 192 5 20 1 4k ffff8800793e2140 khugepaged_mm_slot 40 1 92 1 4k ffffffff81ac5420 kmem_cache 32896 243 243 243 64k ffff88005b4c3280 ldiskfs_inode_cache 1048 260 264 88 4k ffff88005a993180 ldiskfs_prealloc_space 112 1 34 1 4k ffff880060b734c0 ldlm_locks 512 69 91 13 4k ffff880060b63480 ldlm_resources 320 6 60 5 4k ffff88007b9e32c0 ll_obd_dev_cache 3592 20 20 20 4k ffff88006bfc3c40 lod_obj 120 31 64 2 4k ffff8800659c3600 lov_session_kmem 400 4 20 2 4k ffff88005ca935c0 lov_thread_kmem 384 8 20 2 4k ffff880079f03ac0 lqe_kmem 200 4 19 1 4k ffff880059d33c80 mdd_obj 80 31 96 2 4k ffff88005b473b80 mdt_obj 248 31 48 3 4k ffff88007f3a0d00 mm_struct 1408 34 55 11 8k ffff88007ef71000 mnt_cache 256 31 45 3 4k ffff8800378d2440 mqueue_inode_cache 896 1 4 1 4k ffff880079ae2bc0 ndisc_cache 256 1 15 1 4k ffff88007c5b2dc0 nfs_inode_cache 1048 2 3 1 4k ffff88007c5d2e40 nfs_write_data 960 36 36 9 4k ffff88007fab0a00 numa_policy 136 1 28 1 4k ffff88007f300a80 pid 128 221 270 9 4k ffff88007efd1100 proc_inode_cache 656 29 444 74 4k ffff88007bdd3440 ptlrpc_cache 1024 115 124 31 4k ffff88007f3f0e40 radix_tree_node 560 96 644 92 4k ffff88007e1e1c00 RAW 832 3 9 1 8k ffff8800798a2b40 RAWv6 1024 5 8 2 4k ffff880037752880 revoke_table 16 2 202 1 4k ffff88007d602d00 rpc_buffers 2048 8 8 4 4k ffff88007dbc2c80 rpc_inode_cache 832 22 24 6 4k ffff88007dbd2cc0 rpc_tasks 256 8 15 1 4k ffff8800375927c0 scsi_cmd_cache 256 2 15 1 4k ffff8800375a2800 scsi_sense_cache 128 2 30 1 4k ffff88007ec00e80 selinux_inode_security 72 7240 7314 138 4k ffff88007e7f1a00 sgpool-128 4096 2 2 2 4k ffff88007e7c1940 sgpool-16 512 2 8 1 4k ffff88007e001a40 sgpool-256 8192 2 2 2 8k ffff88007e7d1980 sgpool-32 1024 2 4 1 4k ffff88007e7e19c0 sgpool-64 2048 2 2 1 4k ffff88007e7b1900 sgpool-8 256 2 15 1 4k ffff88007e531300 shmem_inode_cache 784 674 685 137 4k ffff88007f360c00 sighand_cache 2112 202 222 74 8k ffff88007f370c40 signal_cache 1088 202 224 32 8k ffff88007efc10c0 sigqueue 160 1 24 1 4k ffff88007f8d0340 size-1024 1024 1166 1168 292 4k ffff88007f810080 size-128 128 1746292 1746390 58213 4k ffff88007f950540 size-16384 16384 22 22 22 16k ffff88007f8701c0 size-192 192 2482 2660 133 4k ffff88007f8f03c0 size-2048 2048 1155 1180 590 4k ffff88007f890240 size-256 256 420 525 35 4k ffff88007f800040 size-32 32 42008 42784 382 4k ffff88007f9705c0 size-32768 32768 2 2 2 32k ffff88007f910440 size-4096 4096 488 582 582 4k ffff88007f8b02c0 size-512 512 2788504 2788536 348567 4k ffff88007f840100 size-64 64 11193 11682 198 4k ffff88007f990640 size-65536 65536 1 1 1 64k ffff88007f9304c0 size-8192 8192 686 686 686 8k ffff88007e591400 skbuff_fclone_cache 512 3 301 43 4k ffff88007e5813c0 skbuff_head_cache 256 68 210 14 4k ffff88007e5a1440 sock_inode_cache 704 129 150 30 4k ffff88007ef81040 sysfs_dir_cache 144 8229 8262 306 4k ffff88007eff1180 task_delay_info 112 213 272 8 4k ffff88007efe1140 taskstats 328 1 12 1 4k ffff88007f340b80 task_struct 2656 212 225 75 8k ffff88007f350bc0 task_xstate 512 44 104 13 4k ffff88007e1a1b00 TCP 1728 24 28 7 8k ffff88007e311e40 tcp_bind_bucket 64 96 472 8 4k ffff88007c692a00 TCPv6 1920 12 14 7 4k ffff88007e1c1b80 tw_sock_TCP 256 75 210 14 4k ffff88007bcf2a80 tw_sock_TCPv6 320 15 60 5 4k ffff88007e1d1bc0 UDP 896 16 28 7 4k ffff88007b652ac0 UDPv6 1024 12 12 3 4k ffff880079341f40 uid_cache 128 7 30 1 4k ffff88007d831f00 UNIX 832 38 63 7 8k ffff88007f3b0d40 vm_area_struct 200 1736 2033 107 4k ffff88007b063900 vvp_session_kmem 96 4 40 1 4k ffff88005cb538c0 vvp_thread_kmem 376 8 20 2 4k
crash> bt PID: 14843 TASK: ffff88005832d540 CPU: 1 COMMAND: "mdt_out00_000" #0 [ffff88000e41d538] machine_kexec at ffffffff81038f3b #1 [ffff88000e41d598] crash_kexec at ffffffff810c5f42 #2 [ffff88000e41d668] panic at ffffffff81529003 #3 [ffff88000e41d6e8] check_panic_on_oom at ffffffff81122d1c #4 [ffff88000e41d718] out_of_memory at ffffffff811233db #5 [ffff88000e41d798] __alloc_pages_nodemask at ffffffff8112fd5f #6 [ffff88000e41d8d8] kmem_getpages at ffffffff8116e6d2 #7 [ffff88000e41d908] fallback_alloc at ffffffff8116f2ea #8 [ffff88000e41d988] ____cache_alloc_node at ffffffff8116f069 #9 [ffff88000e41d9e8] __kmalloc at ffffffff8116fe39 #10 [ffff88000e41da38] crypto_create_tfm at ffffffff8124cd7c #11 [ffff88000e41da68] crypto_init_shash_ops at ffffffff81253718 #12 [ffff88000e41da98] __crypto_alloc_tfm at ffffffff8124ce8a #13 [ffff88000e41dac8] crypto_alloc_base at ffffffff8124d6fa #14 [ffff88000e41db18] cfs_crypto_hash_alloc at ffffffffa048d107 [libcfs] #15 [ffff88000e41db68] cfs_crypto_hash_digest at ffffffffa048d7e6 [libcfs] #16 [ffff88000e41dbe8] lustre_msg_calc_cksum at ffffffffa081f3e3 [ptlrpc] #17 [ffff88000e41dc38] null_authorize at ffffffffa0858e01 [ptlrpc] #18 [ffff88000e41dc58] sptlrpc_svc_wrap_reply at ffffffffa0847e96 [ptlrpc] #19 [ffff88000e41dc78] ptlrpc_send_reply at ffffffffa081780c [ptlrpc] #20 [ffff88000e41dcf8] ptlrpc_at_check_timed at ffffffffa082ee85 [ptlrpc] #21 [ffff88000e41dda8] ptlrpc_main at ffffffffa0830808 [ptlrpc] #22 [ffff88000e41dee8] kthread at ffffffff8109abf6 #23 [ffff88000e41df48] kernel_thread at ffffffff8100c20a
I wonder if the trace from ldlm_pools_
{shrink,count}() is part of the problem here. The lustre logs on mds1 for the last 1 second before the OOM are just from these two functions:
$ grep :1412212833 1.df 00010000:00010000:1.0:1412212833.013589:0:14843:0:(ldlm_pool.c:1191:ldlm_pools_shrink()) Request to shrink 0 server locks from all pools 00010000:00010000:1.0:1412212833.013591:0:14843:0:(ldlm_pool.c:1057:ldlm_pools_count()) Request to count server locks from all pools 00010000:00010000:0.0:1412212833.013594:0:14836:0:(ldlm_pool.c:1191:ldlm_pools_shrink()) Request to shrink 0 server locks from all pools 00010000:00010000:1.0:1412212833.013594:0:14843:0:(ldlm_pool.c:1191:ldlm_pools_shrink()) Request to shrink 0 client locks from all pools 00010000:00010000:0.0:1412212833.013596:0:14836:0:(ldlm_pool.c:1057:ldlm_pools_count()) Request to count server locks from all pools ... $ grep :1412212833 1.df | wc 60378 583523 8190686
That's 88MB of ASCII. We could have run ourselves OOM with all of this shrinker chatter while trying to do an order 2 allocation. Likely from the same OUT thread above.
#define OUT_UPDATE_REPLY_SIZE 8192
The core file is at /export/scratch/dumps/shadow-13vm12.shadow.whamcloud.com/10.1.4.157-2014-10-02-01:20:37/vmcore.
Another failure instance:
https://testing.hpdd.intel.com/test_sets/e60a718e-466a-11e4-b520-5254006e85c2
Another failure instance:
https://testing.hpdd.intel.com/test_sets/7c7023ca-419e-11e4-9398-5254006e85c2
It doesn't look quite the same as LU-5077, since it doesn't have the same crypto_create_tfm() allocation in the stack that I've seen in may other cases.
That said, I think there is some kind of general problem where the MDS is allocating some data structures on the slab, but not freeing them until the MDS is unmounted. While this is not strictly a "leak", it is definitely causing memory shortages during runtime.
It would be useful to run replay-single while watching slabtop on the MDS to see in which slab the memory is being allocated and not freed. If it is a generic slab (e.g. size-128) then the MDS needs to run with "lctl set_param debug=+alloc" to see what the memory is actually being allocated for. It would be useful to run the +alloc log through lustre/tests/leak_finder.pl. While it isn't 100% consistent on a running system, it may be possible to see which structures are always being added but not freed.
Closing as a duplicate of
LU-5077.