[LU-5595] replay-single test_38: MDS 1 OOM Created: 08/Sep/14  Updated: 19/Feb/15  Resolved: 21/Oct/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Maloo Assignee: Dmitry Eremin (Inactive)
Resolution: Duplicate Votes: 0
Labels: None

Issue Links:
Related
is related to LU-5077 insanity test_1: out of memory on MDT... Resolved
is related to LU-5585 MDS became unresponsive, clients hang... Resolved
Severity: 3
Rank (Obsolete): 15619

 Description   

This issue was created by maloo for John Hammond <john.hammond@intel.com>

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/e9a29482-365e-11e4-a823-5254006e85c2.

The sub-test test_38 failed with the following error:

import is not in FULL state

MDS 1 triggered the OOM killer in test_38.

https://testing.hpdd.intel.com/test_logs/7bd1c0b2-365f-11e4-a823-5254006e85c2

Info required for matching: replay-single 38



 Comments   
Comment by Dmitry Eremin (Inactive) [ 08/Sep/14 ]

Most likely it's related to LU-5077.

Comment by Andreas Dilger [ 12/Sep/14 ]

It doesn't look quite the same as LU-5077, since it doesn't have the same crypto_create_tfm() allocation in the stack that I've seen in may other cases.

That said, I think there is some kind of general problem where the MDS is allocating some data structures on the slab, but not freeing them until the MDS is unmounted. While this is not strictly a "leak", it is definitely causing memory shortages during runtime.

It would be useful to run replay-single while watching slabtop on the MDS to see in which slab the memory is being allocated and not freed. If it is a generic slab (e.g. size-128) then the MDS needs to run with "lctl set_param debug=+alloc" to see what the memory is actually being allocated for. It would be useful to run the +alloc log through lustre/tests/leak_finder.pl. While it isn't 100% consistent on a running system, it may be possible to see which structures are always being added but not freed.

Comment by Peter Jones [ 12/Sep/14 ]

Dmitry

Could you please dig into this issue?

Thanks

Peter

Comment by nasf (Inactive) [ 22/Sep/14 ]

Another failure instance:
https://testing.hpdd.intel.com/test_sets/7c7023ca-419e-11e4-9398-5254006e85c2

Comment by nasf (Inactive) [ 28/Sep/14 ]

Another failure instance:
https://testing.hpdd.intel.com/test_sets/e60a718e-466a-11e4-b520-5254006e85c2

Comment by John Hammond [ 02/Oct/14 ]

I got a crash dump from a recent replay-single OOM. See https://testing.hpdd.intel.com/test_logs/ab38fbc4-4a0d-11e4-880b-5254006e85c2.

There is a lot of size-512 and size-128 slab usage:

crash> kmem -i
              PAGES        TOTAL      PERCENTAGE
 TOTAL MEM   480589       1.8 GB         ----
      FREE    13243      51.7 MB    2% of TOTAL MEM
      USED   467346       1.8 GB   97% of TOTAL MEM
    SHARED      181       724 KB    0% of TOTAL MEM
   BUFFERS      115       460 KB    0% of TOTAL MEM
    CACHED       -3  70368744177664 GB  38383616923628% of TOTAL MEM
      SLAB   417875       1.6 GB   86% of TOTAL MEM

TOTAL SWAP  1032191       3.9 GB         ----
 SWAP USED     3829        15 MB    0% of TOTAL SWAP
 SWAP FREE  1028362       3.9 GB   99% of TOTAL SWAP



crash> kmem -s | awk '$4 != 0' | sort -k2
CACHE            NAME                 OBJSIZE  ALLOCATED     TOTAL  SLABS  SSIZE
ffff88007e4011c0 Acpi-Namespace            40        400       460      5     4k
ffff88007e4412c0 Acpi-Operand              72       1204      1219     23     4k
ffff88007f310ac0 anon_vma                  40       1397      1656     18     4k
ffff88007f320b00 anon_vma_chain            48       1779      2079     27     4k
ffff88007e201c80 arp_cache                256          6        15      1     4k
ffff88007efb1080 bdev_cache               832         35        36      9     4k
ffff88007e681700 bio-0                    192        177       460     23     4k
ffff88007e6716c0 biovec-256              4096         98        98     98     4k
ffff88007e6315c0 bip-256                 4224          2         2      2     8k
ffff88007e6c17c0 blkdev_ioc                80          1        48      1     4k
ffff88007e701840 blkdev_queue            2864         31        32     16     8k
ffff88007e6d1800 blkdev_requests          352          5        33      3     4k
ffff88007f3d0dc0 buffer_head              104        441      2775     75     4k
ffff8800798c3a00 ccc_session_kmem         224          4        34      2     4k
ffff880059e539c0 ccc_thread_kmem          360          8        22      2     4k
ffff88007c5333c0 cl_env_kmem              168          1        23      1     4k
ffff88007f330b40 cred_jar                 192        273       380     19     4k
ffff88007ec30f40 dentry                   192       5574      8200    410     4k
ffff880037b525c0 dm_io                     40       1536      1564     17     4k
ffff880037b62600 dm_target_io              24       1536      1584     11     4k
ffff880037832240 dnotify_mark_entry       112          6        34      1     4k
ffff880037822200 dnotify_struct            32          6       112      1     4k
ffff880037882380 dquot                    256          2        15      1     4k
ffff88007e161a80 eventpoll_epi            128          5        30      1     4k
ffff88007e171ac0 eventpoll_pwq             72          5        53      1     4k
ffff8800377d2980 ext3_inode_cache         800        303       715    143     4k
ffff8800377c2940 ext3_xattr                88          1        44      1     4k
ffff880079b92c40 fib6_nodes                64         22        59      1     4k
ffff88007e571380 file_lock_cache          176          6        22      1     4k
ffff88007f380c80 files_cache              704         35        88      8     8k
ffff88007ef60fc0 filp                     192        545       940     47     4k
ffff88007f390cc0 fs_cache                  64         76       118      2     4k
ffff88007d612d40 fscache_cookie_jar        80          4        48      1     4k
ffff8800378923c0 hugetlbfs_inode_cache    608          1         6      1     4k
ffff88007fa809c0 idr_layer_cache          544        280       294     42     4k
ffff88007e301e00 inet_peer_cache           64          1        59      1     4k
ffff88007ee50f80 inode_cache              592       4471      4608    768     4k
ffff880037842280 inotify_inode_mark_entry 112        115       136      4     4k
ffff880079b52c00 ip6_dst_cache            384         13        20      2     4k
ffff88007e211cc0 ip_dst_cache             384         18        40      4     4k
ffff88007e2c1d00 ip_fib_hash               72         10        53      1     4k
ffff880057953080 jbd2_revoke_table         16          2       202      1     4k
ffff88007f3e0e00 key_jar                  192          5        20      1     4k
ffff8800793e2140 khugepaged_mm_slot        40          1        92      1     4k
ffffffff81ac5420 kmem_cache             32896        243       243    243    64k
ffff88005b4c3280 ldiskfs_inode_cache     1048        260       264     88     4k
ffff88005a993180 ldiskfs_prealloc_space   112          1        34      1     4k
ffff880060b734c0 ldlm_locks               512         69        91     13     4k
ffff880060b63480 ldlm_resources           320          6        60      5     4k
ffff88007b9e32c0 ll_obd_dev_cache        3592         20        20     20     4k
ffff88006bfc3c40 lod_obj                  120         31        64      2     4k
ffff8800659c3600 lov_session_kmem         400          4        20      2     4k
ffff88005ca935c0 lov_thread_kmem          384          8        20      2     4k
ffff880079f03ac0 lqe_kmem                 200          4        19      1     4k
ffff880059d33c80 mdd_obj                   80         31        96      2     4k
ffff88005b473b80 mdt_obj                  248         31        48      3     4k
ffff88007f3a0d00 mm_struct               1408         34        55     11     8k
ffff88007ef71000 mnt_cache                256         31        45      3     4k
ffff8800378d2440 mqueue_inode_cache       896          1         4      1     4k
ffff880079ae2bc0 ndisc_cache              256          1        15      1     4k
ffff88007c5b2dc0 nfs_inode_cache         1048          2         3      1     4k
ffff88007c5d2e40 nfs_write_data           960         36        36      9     4k
ffff88007fab0a00 numa_policy              136          1        28      1     4k
ffff88007f300a80 pid                      128        221       270      9     4k
ffff88007efd1100 proc_inode_cache         656         29       444     74     4k
ffff88007bdd3440 ptlrpc_cache            1024        115       124     31     4k
ffff88007f3f0e40 radix_tree_node          560         96       644     92     4k
ffff88007e1e1c00 RAW                      832          3         9      1     8k
ffff8800798a2b40 RAWv6                   1024          5         8      2     4k
ffff880037752880 revoke_table              16          2       202      1     4k
ffff88007d602d00 rpc_buffers             2048          8         8      4     4k
ffff88007dbc2c80 rpc_inode_cache          832         22        24      6     4k
ffff88007dbd2cc0 rpc_tasks                256          8        15      1     4k
ffff8800375927c0 scsi_cmd_cache           256          2        15      1     4k
ffff8800375a2800 scsi_sense_cache         128          2        30      1     4k
ffff88007ec00e80 selinux_inode_security    72       7240      7314    138     4k
ffff88007e7f1a00 sgpool-128              4096          2         2      2     4k
ffff88007e7c1940 sgpool-16                512          2         8      1     4k
ffff88007e001a40 sgpool-256              8192          2         2      2     8k
ffff88007e7d1980 sgpool-32               1024          2         4      1     4k
ffff88007e7e19c0 sgpool-64               2048          2         2      1     4k
ffff88007e7b1900 sgpool-8                 256          2        15      1     4k
ffff88007e531300 shmem_inode_cache        784        674       685    137     4k
ffff88007f360c00 sighand_cache           2112        202       222     74     8k
ffff88007f370c40 signal_cache            1088        202       224     32     8k
ffff88007efc10c0 sigqueue                 160          1        24      1     4k
ffff88007f8d0340 size-1024               1024       1166      1168    292     4k
ffff88007f810080 size-128                 128    1746292   1746390  58213     4k
ffff88007f950540 size-16384             16384         22        22     22    16k
ffff88007f8701c0 size-192                 192       2482      2660    133     4k
ffff88007f8f03c0 size-2048               2048       1155      1180    590     4k
ffff88007f890240 size-256                 256        420       525     35     4k
ffff88007f800040 size-32                   32      42008     42784    382     4k
ffff88007f9705c0 size-32768             32768          2         2      2    32k
ffff88007f910440 size-4096               4096        488       582    582     4k
ffff88007f8b02c0 size-512                 512    2788504   2788536 348567     4k
ffff88007f840100 size-64                   64      11193     11682    198     4k
ffff88007f990640 size-65536             65536          1         1      1    64k
ffff88007f9304c0 size-8192               8192        686       686    686     8k
ffff88007e591400 skbuff_fclone_cache      512          3       301     43     4k
ffff88007e5813c0 skbuff_head_cache        256         68       210     14     4k
ffff88007e5a1440 sock_inode_cache         704        129       150     30     4k
ffff88007ef81040 sysfs_dir_cache          144       8229      8262    306     4k
ffff88007eff1180 task_delay_info          112        213       272      8     4k
ffff88007efe1140 taskstats                328          1        12      1     4k
ffff88007f340b80 task_struct             2656        212       225     75     8k
ffff88007f350bc0 task_xstate              512         44       104     13     4k
ffff88007e1a1b00 TCP                     1728         24        28      7     8k
ffff88007e311e40 tcp_bind_bucket           64         96       472      8     4k
ffff88007c692a00 TCPv6                   1920         12        14      7     4k
ffff88007e1c1b80 tw_sock_TCP              256         75       210     14     4k
ffff88007bcf2a80 tw_sock_TCPv6            320         15        60      5     4k
ffff88007e1d1bc0 UDP                      896         16        28      7     4k
ffff88007b652ac0 UDPv6                   1024         12        12      3     4k
ffff880079341f40 uid_cache                128          7        30      1     4k
ffff88007d831f00 UNIX                     832         38        63      7     8k
ffff88007f3b0d40 vm_area_struct           200       1736      2033    107     4k
ffff88007b063900 vvp_session_kmem          96          4        40      1     4k
ffff88005cb538c0 vvp_thread_kmem          376          8        20      2     4k
crash> bt
PID: 14843  TASK: ffff88005832d540  CPU: 1   COMMAND: "mdt_out00_000"
 #0 [ffff88000e41d538] machine_kexec at ffffffff81038f3b
 #1 [ffff88000e41d598] crash_kexec at ffffffff810c5f42
 #2 [ffff88000e41d668] panic at ffffffff81529003
 #3 [ffff88000e41d6e8] check_panic_on_oom at ffffffff81122d1c
 #4 [ffff88000e41d718] out_of_memory at ffffffff811233db
 #5 [ffff88000e41d798] __alloc_pages_nodemask at ffffffff8112fd5f
 #6 [ffff88000e41d8d8] kmem_getpages at ffffffff8116e6d2
 #7 [ffff88000e41d908] fallback_alloc at ffffffff8116f2ea
 #8 [ffff88000e41d988] ____cache_alloc_node at ffffffff8116f069
 #9 [ffff88000e41d9e8] __kmalloc at ffffffff8116fe39
#10 [ffff88000e41da38] crypto_create_tfm at ffffffff8124cd7c
#11 [ffff88000e41da68] crypto_init_shash_ops at ffffffff81253718
#12 [ffff88000e41da98] __crypto_alloc_tfm at ffffffff8124ce8a
#13 [ffff88000e41dac8] crypto_alloc_base at ffffffff8124d6fa
#14 [ffff88000e41db18] cfs_crypto_hash_alloc at ffffffffa048d107 [libcfs]
#15 [ffff88000e41db68] cfs_crypto_hash_digest at ffffffffa048d7e6 [libcfs]
#16 [ffff88000e41dbe8] lustre_msg_calc_cksum at ffffffffa081f3e3 [ptlrpc]
#17 [ffff88000e41dc38] null_authorize at ffffffffa0858e01 [ptlrpc]
#18 [ffff88000e41dc58] sptlrpc_svc_wrap_reply at ffffffffa0847e96 [ptlrpc]
#19 [ffff88000e41dc78] ptlrpc_send_reply at ffffffffa081780c [ptlrpc]
#20 [ffff88000e41dcf8] ptlrpc_at_check_timed at ffffffffa082ee85 [ptlrpc]
#21 [ffff88000e41dda8] ptlrpc_main at ffffffffa0830808 [ptlrpc]
#22 [ffff88000e41dee8] kthread at ffffffff8109abf6
#23 [ffff88000e41df48] kernel_thread at ffffffff8100c20a

I wonder if the trace from ldlm_pools_

{shrink,count}

() is part of the problem here. The lustre logs on mds1 for the last 1 second before the OOM are just from these two functions:

$ grep :1412212833 1.df
00010000:00010000:1.0:1412212833.013589:0:14843:0:(ldlm_pool.c:1191:ldlm_pools_shrink()) Request to shrink 0 server locks from all pools
00010000:00010000:1.0:1412212833.013591:0:14843:0:(ldlm_pool.c:1057:ldlm_pools_count()) Request to count server locks from all pools
00010000:00010000:0.0:1412212833.013594:0:14836:0:(ldlm_pool.c:1191:ldlm_pools_shrink()) Request to shrink 0 server locks from all pools
00010000:00010000:1.0:1412212833.013594:0:14843:0:(ldlm_pool.c:1191:ldlm_pools_shrink()) Request to shrink 0 client locks from all pools
00010000:00010000:0.0:1412212833.013596:0:14836:0:(ldlm_pool.c:1057:ldlm_pools_count()) Request to count server locks from all pools
...
$ grep :1412212833 1.df | wc
  60378  583523 8190686

That's 88MB of ASCII. We could have run ourselves OOM with all of this shrinker chatter while trying to do an order 2 allocation. Likely from the same OUT thread above.

#define OUT_UPDATE_REPLY_SIZE 8192

The core file is at /export/scratch/dumps/shadow-13vm12.shadow.whamcloud.com/10.1.4.157-2014-10-02-01:20:37/vmcore.

Comment by Dmitry Eremin (Inactive) [ 03/Oct/14 ]

Thanks John, it's a good catch. We enable FULL logs right before the test with OOM.

start_full_debug_logging
run_test 37 "abort recovery before client does replay (test mds_cleanup_orphans
stop_full_debug_logging

test_38() {
...

I will try to reproduce this.

Comment by John Hammond [ 03/Oct/14 ]

I noticed today that at the end of several tests we unload the modules on the client that runs the test script but not on the other nodes. So for example there could be weird reply states leftover from replay-mgs-triple or some other random test.

Comment by Andreas Dilger [ 03/Oct/14 ]

I've gotten some pretty useful information about what memory was allocated on my clients via:

# lctl set_param debug=+malloc debug_mb=256
# umount $LUSTRE
# lctl dk /tmp/umount.debug
# perl lustre/tests/leak_finder.pl /tmp/umount.debug > /tmp/umount.leak 2>&1
# awk '/Free without malloc/ { print $2,$5,$6,$9 }' /tmp/umount.leak | sort | uniq -c | sort -nr > /tmp/umount.hist
Comment by Andreas Dilger [ 21/Oct/14 ]

Closing as a duplicate of LU-5077.

Generated at Sat Feb 10 01:52:50 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.