[LU-10846] OSS hits OOM-killer Created: 23/Mar/18  Updated: 29/Mar/18

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.11.0
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Cliff White (Inactive) Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None
Environment:

Soak stress cluster version=2.11.0_RC1


Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Running soak test, soak-4 was failed over. soak attempted to mount OSTs on soak-5. 

Soak-5 hit OOM killer, died

 

Mar 23 15:08:39 soak-5 systemd-logind: Removed session 10218.
Mar 23 15:09:28 soak-5 kernel: Lustre: 193003:0:(service.c:1364:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/5), not sending early reply#012 req@ffff8800ad21d050 x1595414941673184/t0(0) o4->d0120907-cbb4-c39b-57ca-9962ace8f63f@192.168.1.120@o2ib:513/0 lens 608/0 e 0 to 0 dl 1521817773 ref 2 fl New:/2/ffffffff rc 0/-1
Mar 23 15:09:28 soak-5 kernel: Lustre: 193003:0:(service.c:1364:ptlrpc_at_send_early_reply()) Skipped 467 previous similar messages
Mar 23 15:09:34 soak-5 kernel: Lustre: soaked-OST0015: Client 5fa89b09-17bb-b4fc-96c2-e93717c840d0 (at 192.168.1.141@o2ib) reconnecting
Mar 23 15:09:34 soak-5 kernel: Lustre: soaked-OST0015: Connection restored to (at 192.168.1.130@o2ib)
Mar 23 15:09:34 soak-5 kernel: Lustre: Skipped 73 previous similar messages
Mar 23 15:09:34 soak-5 kernel: Lustre: Skipped 70 previous similar messages
Mar 23 15:09:47 soak-5 sshd[145991]: Accepted publickey for root from 10.10.1.116 port 51278 ssh2: RSA SHA256:VGwjPuk53LIsLKjhGizbClh9X4HNRiAOs+XaQdKAWxM
Mar 23 15:09:48 soak-5 systemd: Started Session 10219 of user root.
Mar 23 15:09:28 soak-5 kernel: Lustre: 193003:0:(service.c:1364:ptlrpc_at_send_early_reply()) Skipped 467 previous similar messages
Mar 23 15:09:34 soak-5 kernel: Lustre: soaked-OST0015: Client 5fa89b09-17bb-b4fc-96c2-e93717c840d0 (at 192.168.1.141@o2ib) reconnecting
Mar 23 15:09:34 soak-5 kernel: Lustre: soaked-OST0015: Connection restored to (at 192.168.1.130@o2ib)
Mar 23 15:09:34 soak-5 kernel: Lustre: Skipped 73 previous similar messages
Mar 23 15:09:34 soak-5 kernel: Lustre: Skipped 70 previous similar messages
Mar 23 15:09:47 soak-5 sshd[145991]: Accepted publickey for root from 10.10.1.116 port 51278 ssh2: RSA SHA256:VGwjPuk53LIsLKjhGizbClh9X4HNRiAOs+XaQdKAWxM
Mar 23 15:09:48 soak-5 systemd: Started Session 10219 of user root.
Mar 23 15:09:49 soak-5 sshd[145991]: pam_unix(sshd:session): session opened for user root by (uid=0)
Mar 23 15:09:56 soak-5 kernel: ll_ost_out01_01 invoked oom-killer: gfp_mask=0x82d2, order=0, oom_score_adj=0
Mar 23 15:09:59 soak-5 kernel: ll_ost_out01_01 cpuset=/ mems_allowed=1
Mar 23 15:09:59 soak-5 kernel: CPU: 8 PID: 25625 Comm: ll_ost_out01_01 Tainted: P OE ------------ 3.10.0-693.21.1.el7_lustre.x86_64 #1
Mar 23 15:09:59 soak-5 kernel: Hardware name: Intel Corporation SandyBridge Platform/To be filled by O.E.M., BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013
Mar 23 15:09:59 soak-5 kernel: Call Trace:
Mar 23 15:09:59 soak-5 kernel: [<ffffffff816ae7c8>] dump_stack+0x19/0x1b
Mar 23 15:09:59 soak-5 kernel: [<ffffffff816a9b90>] dump_header+0x90/0x229
Mar 23 15:09:59 soak-5 kernel: [<ffffffff810ecec2>] ? ktime_get_ts64+0x52/0xf0
Mar 23 15:09:59 soak-5 kernel: [<ffffffff8114140f>] ? delayacct_end+0x8f/0xb0
Mar 23 15:09:59 soak-5 kernel: [<ffffffff8118a884>] oom_kill_process+0x254/0x3d0
Mar 23 15:09:59 soak-5 kernel: [<ffffffff8118a32d>] ? oom_unkillable_task+0xcd/0x120
Mar 23 15:09:59 soak-5 kernel: [<ffffffff8118a3d6>] ? find_lock_task_mm+0x56/0xc0
Mar 23 15:09:59 soak-5 kernel: [<ffffffff8118b0c6>] out_of_memory+0x4b6/0x4f0
Mar 23 15:09:59 soak-5 kernel: [<ffffffff816aa694>] __alloc_pages_slowpath+0x5d6/0x724
Mar 23 15:10:00 soak-5 kernel: [<ffffffff811912a5>] __alloc_pages_nodemask+0x405/0x420
Mar 23 15:10:00 soak-5 kernel: [<ffffffff811c5dcf>] __vmalloc_node_range+0x12f/0x270
Mar 23 15:10:00 soak-5 kernel: [<ffffffffc0fca0ab>] ? ptlrpc_alloc_rqbd+0x1eb/0x590 [ptlrpc]
Mar 23 15:10:00 soak-5 kernel: [<ffffffff811c5fa7>] vzalloc_node+0x47/0x50
Mar 23 15:10:00 soak-5 kernel: [<ffffffffc0fca0ab>] ? ptlrpc_alloc_rqbd+0x1eb/0x590 [ptlrpc]
Mar 23 15:10:00 soak-5 kernel: [<ffffffffc0fca0ab>] ptlrpc_alloc_rqbd+0x1eb/0x590 [ptlrpc]
Mar 23 15:10:00 soak-5 kernel: [<ffffffffc0fca547>] ptlrpc_grow_req_bufs+0xf7/0x2b0 [ptlrpc]
Mar 23 15:10:00 soak-5 kernel: [<ffffffffc0fd22fd>] ptlrpc_main+0xdcd/0x1e40 [ptlrpc]
Mar 23 15:10:00 soak-5 kernel: [<ffffffffc0fd1530>] ? ptlrpc_register_service+0xe90/0xe90 [ptlrpc]
Mar 23 15:10:00 soak-5 kernel: [<ffffffff810b4031>] kthread+0xd1/0xe0
Mar 23 15:10:00 soak-5 kernel: [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40
Mar 23 15:10:00 soak-5 kernel: [<ffffffff816c0577>] ret_from_fork+0x77/0xb0
Mar 23 15:10:00 soak-5 kernel: [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40
Mar 23 15:10:00 soak-5 kernel: Mem-Info:
Mar 23 15:10:00 soak-5 kernel: active_anon:0 inactive_anon:27 isolated_anon:0#012 active_file:146 inactive_file:205 isolated_file:0#012 unevictable:6842 dirty:0 writeback:1 unstable:0#012 slab_reclaimable:10265 slab_unreclaimable:347309#012 mapped:1568 shmem:0 pagetables:1894 bounce:0#012 free:39311 free_pcp:0 free_cma:0
Mar 23 15:10:00 soak-5 kernel: Node 1 Normal free:45692kB min:45728kB low:57160kB high:68592kB active_anon:0kB inactive_anon:40kB active_file:296kB inactive_file:512kB unevictable:25532kB isolated(anon):0kB isolated(file):0kB present:16777216kB managed:16497548kB mlocked:25532kB dirty:0kB writeback:4kB mapped:4388kB shmem:0kB slab_reclaimable:18588kB slab_unreclaimable:847360kB kernel_stack:16784kB pagetables:3576kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:1673 all_unreclaimable? yes
Mar 23 15:10:00 soak-5 kernel: lowmem_reserve[]: 0 0 0 0
Mar 23 15:10:00 soak-5 kernel: Node 1 Normal: 921*4kB (EM) 593*8kB (EM) 338*16kB (UEM) 249*32kB (UEM) 360*64kB (UEM) 9*128kB (EM) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 45996kB
Mar 23 15:10:02 soak-5 kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
Mar 23 15:10:02 soak-5 kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Mar 23 15:10:02 soak-5 kernel: Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
Mar 23 15:10:02 soak-5 kernel: Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Mar 23 15:10:02 soak-5 kernel: 1960 total pagecache pages
Mar 23 15:10:02 soak-5 kernel: 10 pages in swap cache
Mar 23 15:10:03 soak-5 kernel: Swap cache stats: add 262398, delete 262388, find 107033/136376
Mar 23 15:10:05 soak-5 kernel: Free swap = 16128112kB
Mar 23 15:10:05 soak-5 kernel: Total swap = 16319484kB
Mar 23 15:10:05 soak-5 kernel: 8369063 pages RAM
Mar 23 15:10:05 soak-5 kernel: 0 pages HighMem/MovableOnly
Mar 23 15:10:05 soak-5 kernel: 241758 pages reserved
Mar 23 15:10:05 soak-5 kernel: [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name
Mar 23 15:10:05 soak-5 kernel: [ 1012] 0 1012 27640 208 60 89 0 systemd-journal
Mar 23 15:10:05 soak-5 kernel: [ 1038] 0 1038 29712 215 28 85 0 lvmetad
Mar 23 15:10:05 soak-5 kernel: [ 1058] 0 1058 11020 233 24 163 -1000 systemd-udevd
Mar 23 15:10:05 soak-5 kernel: [ 1078] 0 1078 1572324 3746 130 0 -1000 multipathd
Mar 23 15:10:05 soak-5 kernel: [ 2525] 0 2525 13863 131 28 131 -1000 auditd
Mar 23 15:10:05 soak-5 kernel: [ 2551] 999 2551 136594 260 61 1642 0 polkitd
Mar 23 15:10:05 soak-5 kernel: [ 2556] 0 2556 64919 323 78 293 0 sssd
Mar 23 15:10:05 soak-5 kernel: [ 2557] 81 2557 9197 196 19 137 -900 dbus-daemon
Mar 23 15:10:05 soak-5 kernel: [ 2698] 0 2698 118079 296 82 746 0 NetworkManager
Mar 23 15:10:05 soak-5 kernel: [ 2741] 0 2741 5461 187 15 147 0 irqbalance
Mar 23 15:10:05 soak-5 kernel: [ 2742] 0 2742 85798 251 87 1978 0 rsyslogd
Mar 23 15:10:06 soak-5 kernel: [ 2891] 0 2891 1770 27 8 37 0 mcelog
Mar 23 15:10:06 soak-5 kernel: [ 2899] 998 2899 29553 230 27 133 0 chronyd
Mar 23 15:10:06 soak-5 kernel: [ 2900] 0 2900 67755 195 45 192 0 gssproxy
Mar 23 15:10:06 soak-5 kernel: [ 2910] 0 2910 93403 329 125 648 0 sssd_be
Mar 23 15:10:06 soak-5 kernel: [ 3152] 0 3152 66912 286 86 815 0 sssd_nss
Mar 23 15:10:06 soak-5 kernel: [ 3153] 0 3153 61252 272 75 256 0 sssd_pam
Mar 23 15:10:06 soak-5 kernel: [ 3154] 0 3154 59613 270 73 222 0 sssd_ssh
Mar 23 15:10:06 soak-5 kernel: [ 3155] 0 3155 68163 270 85 350 0 sssd_pac
Mar 23 15:10:06 soak-5 kernel: [ 3172] 0 3172 6051 182 15 79 0 systemd-logind
Mar 23 15:10:06 soak-5 kernel: [ 3675] 0 3675 26499 278 53 255 -1000 sshd
Mar 23 15:10:06 soak-5 kernel: [ 3676] 996 3676 24211 167 21 140 0 munged
Mar 23 15:10:06 soak-5 kernel: [ 3681] 0 3681 6779 147 18 63 0 xinetd
Mar 23 15:10:06 soak-5 kernel: [ 3687] 0 3687 140597 287 95 2760 0 tuned
Mar 23 15:10:06 soak-5 kernel: [ 3847] 0 3847 22907 173 44 280 0 master
Mar 23 15:10:06 soak-5 kernel: [ 3859] 89 3859 25491 214 47 258 0 qmgr
Mar 23 15:10:06 soak-5 kernel: [ 3879] 0 3879 151288 214 69 407 0 automount
Mar 23 15:10:06 soak-5 kernel: [ 3889] 0 3889 6464 155 19 54 0 atd
Mar 23 15:10:06 soak-5 kernel: [ 3907] 0 3907 31558 200 20 163 0 crond
Mar 23 15:10:06 soak-5 kernel: [ 3960] 0 3960 27511 156 13 31 0 agetty
Mar 23 15:10:06 soak-5 kernel: [ 3961] 0 3961 27511 160 10 31 0 agetty
Mar 23 15:10:06 soak-5 kernel: [111720] 0 111720 40526 300 79 326 0 sshd
Mar 23 15:10:06 soak-5 kernel: [111722] 0 111722 30868 147 18 46 0 mount
Mar 23 15:10:06 soak-5 kernel: [111740] 0 111740 20431 242 43 261 0 mount.lustre
Mar 23 15:10:06 soak-5 kernel: [140178] 89 140178 25474 219 44 261 0 pickup
Mar 23 15:10:06 soak-5 kernel: [145991] 0 145991 40526 338 76 324 0 sshd
Mar 23 15:10:06 soak-5 kernel: [145993] 0 145993 28275 122 13 38 0 bash
Mar 23 15:10:06 soak-5 kernel: Out of memory: Kill process 3687 (tuned) score 0 or sacrifice child
Mar 23 15:10:06 soak-5 kernel: Killed process 3687 (tuned) total-vm:562388kB, anon-rss:0kB, file-rss:1148kB, shmem-rss:0kB


Logs are available on soak



 Comments   
Comment by Cliff White (Inactive) [ 29/Mar/18 ]

From observing slabtob, kmalloc appears to be the most used, example:

 Active / Total Objects (% used)    : 1449782 / 1623717 (89.3%)
 Active / Total Slabs (% used)      : 32986 / 32986 (100.0%)
 Active / Total Caches (% used)     : 100 / 133 (75.2%)
 Active / Total Size (% used)       : 487113.80K / 521990.49K (93.3%)
 Minimum / Average / Maximum Object : 0.01K / 0.32K / 16.00K

  OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME                   
 25440  25272  99%    4.00K   3180        8    101760K kmalloc-4096
 10000   9979  99%    8.00K   2500        4     80000K kmalloc-8192
 28512  26032  91%    1.00K    891       32     28512K kmalloc-1024
 50880  28439  55%    0.50K    795       64     25440K kmalloc-512
 54912  34822  63%    0.33K   1144       48     18304K arc_buf_hdr_t_full
 30294  30294 100%    0.59K    561       54     17952K inode_cache
 52864  51149  96%    0.25K    826       64     13216K kmalloc-256
114444 114165  99%    0.11K   3179       36     12716K kernfs_node_cache
 65604  65592  99%    0.19K   1562       42     12496K dentry
  6144   5350  87%    2.00K    384       16     12288K kmalloc-2048

Continuing to observe

Generated at Sat Feb 10 02:38:42 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.