Details
-
Bug
-
Resolution: Unresolved
-
Minor
-
None
-
Lustre 2.10.7
-
None
-
SLES client failover testing
-
3
-
9223372036854775807
Description
recovery-mds-scale test_failover_mds crashes with OOM for SLES client failover testing.
Looking at the suite_log for the failed test suite https://testing.whamcloud.com/test_sets/87ea5270-429d-11e9-a256-52540065bddc, we can see that one MDS failover takes place and we are checking client load after the failover. The last thing seen in the suite_log is
Started lustre-MDT0000 ==== Checking the clients loads AFTER failover -- failure NOT OK 01:14:58 (1552122898) waiting for trevis-34vm7 network 5 secs ... 01:14:58 (1552122898) network interface is UP CMD: trevis-34vm7 rc=0; val=\$(/usr/sbin/lctl get_param -n catastrophe 2>&1); if [[ \$? -eq 0 && \$val -ne 0 ]]; then echo \$(hostname -s): \$val; rc=\$val; fi; exit \$rc CMD: trevis-34vm7 ps auxwww | grep -v grep | grep -q run_dd.sh 01:14:58 (1552122898) waiting for trevis-34vm8 network 5 secs ... 01:14:58 (1552122898) network interface is UP CMD: trevis-34vm8 rc=0; val=\$(/usr/sbin/lctl get_param -n catastrophe 2>&1); if [[ \$? -eq 0 && \$val -ne 0 ]]; then echo \$(hostname -s): \$val; rc=\$val; fi; exit \$rc CMD: trevis-34vm8 ps auxwww | grep -v grep | grep -q run_tar.sh mds1 has failed over 1 times, and counting... sleeping 1125 seconds...
Looking at the kernel crash, we see
[ 1358.821727] jbd2/vda1-8 invoked oom-killer: gfp_mask=0x1420848(GFP_NOFS|__GFP_NOFAIL|__GFP_HARDWALL|__GFP_MOVABLE), nodemask=0, order=0, oom_score_adj=0 [ 1358.821760] jbd2/vda1-8 cpuset=/ mems_allowed=0 [ 1358.821775] CPU: 0 PID: 273 Comm: jbd2/vda1-8 Tainted: G OE N 4.4.162-94.69-default #1 [ 1358.821775] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [ 1358.821782] 0000000000000000 ffffffff8132cdc0 ffff88003733fb88 0000000000000000 [ 1358.821784] ffffffff8120b20e 0000000000000000 0000000000000000 0000000000000000 [ 1358.821785] 0000000000000000 ffffffff810a1fb7 ffffffff81e9aa20 0000000000000000 [ 1358.821786] Call Trace: [ 1358.821886] [<ffffffff81019b09>] dump_trace+0x59/0x340 [ 1358.821894] [<ffffffff81019eda>] show_stack_log_lvl+0xea/0x170 [ 1358.821897] [<ffffffff8101acb1>] show_stack+0x21/0x40 [ 1358.821921] [<ffffffff8132cdc0>] dump_stack+0x5c/0x7c [ 1358.821953] [<ffffffff8120b20e>] dump_header+0x82/0x215 [ 1358.821981] [<ffffffff81199d39>] check_panic_on_oom+0x29/0x50 [ 1358.821993] [<ffffffff81199eda>] out_of_memory+0x17a/0x4a0 [ 1358.822000] [<ffffffff8119e849>] __alloc_pages_nodemask+0xa19/0xb70 [ 1358.822019] [<ffffffff811e6caf>] alloc_pages_current+0x7f/0x100 [ 1358.822036] [<ffffffff81196dfd>] pagecache_get_page+0x4d/0x1c0 [ 1358.822046] [<ffffffff812443ce>] __getblk_slow+0xce/0x2e0 [ 1358.822106] [<ffffffffa01bda15>] jbd2_journal_get_descriptor_buffer+0x35/0x90 [jbd2] [ 1358.822127] [<ffffffffa01b689d>] jbd2_journal_commit_transaction+0x8ed/0x1970 [jbd2] [ 1358.822136] [<ffffffffa01bb3b2>] kjournald2+0xb2/0x260 [jbd2] [ 1358.822150] [<ffffffff810a0e29>] kthread+0xc9/0xe0 [ 1358.822190] [<ffffffff8161e1f5>] ret_from_fork+0x55/0x80 [ 1358.825553] DWARF2 unwinder stuck at ret_from_fork+0x55/0x80 [ 1358.825554] [ 1358.825558] Leftover inexact backtrace: [ 1358.825573] [<ffffffff810a0d60>] ? kthread_park+0x50/0x50 [ 1358.825583] Mem-Info: [ 1358.825592] active_anon:1696 inactive_anon:1761 isolated_anon:0 active_file:219719 inactive_file:219874 isolated_file:0 unevictable:20 dirty:0 writeback:0 unstable:0 slab_reclaimable:2700 slab_unreclaimable:17655 mapped:5110 shmem:2179 pagetables:987 bounce:0 free:4229 free_pcp:48 free_cma:0 [ 1358.825612] Node 0 DMA free:7480kB min:376kB low:468kB high:560kB active_anon:0kB inactive_anon:100kB active_file:3640kB inactive_file:3680kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15992kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:172kB shmem:100kB slab_reclaimable:32kB slab_unreclaimable:592kB kernel_stack:48kB pagetables:60kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:2866048 all_unreclaimable? yes [ 1358.825613] lowmem_reserve[]: 0 1843 1843 1843 1843 [ 1358.825622] Node 0 DMA32 free:9436kB min:44676kB low:55844kB high:67012kB active_anon:6784kB inactive_anon:6944kB active_file:875236kB inactive_file:875816kB unevictable:80kB isolated(anon):0kB isolated(file):0kB present:2080744kB managed:1900752kB mlocked:80kB dirty:0kB writeback:0kB mapped:20268kB shmem:8616kB slab_reclaimable:10768kB slab_unreclaimable:70028kB kernel_stack:2656kB pagetables:3888kB unstable:0kB bounce:0kB free_pcp:192kB local_pcp:120kB free_cma:0kB writeback_tmp:0kB pages_scanned:17166780 all_unreclaimable? yes [ 1358.825624] lowmem_reserve[]: 0 0 0 0 0 [ 1358.825639] Node 0 DMA: 2*4kB (E) 4*8kB (ME) 1*16kB (U) 2*32kB (UM) 3*64kB (UME) 4*128kB (UME) 2*256kB (UE) 2*512kB (ME) 1*1024kB (E) 2*2048kB (ME) 0*4096kB = 7480kB [ 1358.825645] Node 0 DMA32: 481*4kB (UME) 239*8kB (UME) 78*16kB (ME) 26*32kB (ME) 11*64kB (ME) 22*128kB (UE) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 9436kB [ 1358.825659] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB [ 1358.825676] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB [ 1358.825677] 7018 total pagecache pages [ 1358.825678] 27 pages in swap cache [ 1358.825682] Swap cache stats: add 6097, delete 6070, find 51/94 [ 1358.825683] Free swap = 14314056kB [ 1358.825683] Total swap = 14338044kB [ 1358.825684] 524184 pages RAM [ 1358.825684] 0 pages HighMem/MovableOnly [ 1358.825685] 45020 pages reserved [ 1358.825685] 0 pages hwpoisoned [ 1358.825685] [ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name [ 1358.825896] [ 349] 0 349 10933 1003 22 3 1114 0 systemd-journal [ 1358.825902] [ 413] 495 413 13124 890 29 4 111 0 rpcbind [ 1358.825913] [ 415] 0 415 9267 660 20 3 214 -1000 systemd-udevd [ 1358.825919] [ 462] 0 462 4814 455 14 3 38 0 irqbalance [ 1358.825921] [ 464] 0 464 29706 1245 59 4 197 0 sssd [ 1358.825934] [ 476] 499 476 13452 785 28 3 150 -900 dbus-daemon [ 1358.825943] [ 535] 0 535 7447 1016 19 3 261 0 wickedd-dhcp6 [ 1358.825945] [ 554] 0 554 36530 1435 70 3 264 0 sssd_be [ 1358.825953] [ 563] 0 563 7448 1054 20 3 265 0 wickedd-dhcp4 [ 1358.825962] [ 564] 0 564 7448 1018 20 3 261 0 wickedd-auto4 [ 1358.825970] [ 565] 0 565 84317 749 37 4 259 0 rsyslogd [ 1358.825972] [ 571] 0 571 31711 1112 65 3 175 0 sssd_nss [ 1358.825974] [ 572] 0 572 26059 1058 55 3 169 0 sssd_pam [ 1358.825980] [ 573] 0 573 24977 1041 51 3 161 0 sssd_ssh [ 1358.826054] [ 761] 0 761 7480 1030 18 3 299 0 wickedd [ 1358.826060] [ 764] 0 764 7455 974 21 3 276 0 wickedd-nanny [ 1358.826069] [ 1418] 0 1418 2141 455 9 3 26 0 xinetd [ 1358.826077] [ 1464] 0 1464 16586 1551 37 3 181 -1000 sshd [ 1358.826079] [ 1477] 74 1477 8408 842 18 3 131 0 ntpd [ 1358.826094] [ 1490] 74 1490 9461 497 21 3 150 0 ntpd [ 1358.826099] [ 1511] 493 1511 55352 609 21 3 149 0 munged [ 1358.826107] [ 1527] 0 1527 1664 365 8 3 29 0 agetty [ 1358.826115] [ 1529] 0 1529 1664 407 9 3 29 0 agetty [ 1358.826117] [ 1536] 0 1536 147212 1080 60 3 346 0 automount [ 1358.826125] [ 1611] 0 1611 5513 494 16 3 65 0 systemd-logind [ 1358.826127] [ 1828] 0 1828 8861 812 20 3 98 0 master [ 1358.826130] [ 1853] 51 1853 12439 1000 24 3 106 0 pickup [ 1358.826135] [ 1854] 51 1854 12529 1309 25 3 173 0 qmgr [ 1358.826143] [ 1883] 0 1883 5197 531 17 3 144 0 cron [ 1358.826250] [15652] 0 15652 17465 844 35 4 0 0 in.mrshd [ 1358.826258] [15653] 0 15653 2894 653 11 3 0 0 bash [ 1358.826266] [15658] 0 15658 2894 492 11 3 0 0 bash [ 1358.826274] [15659] 0 15659 3034 755 10 3 0 0 run_dd.sh [ 1358.826313] [16412] 51 16412 16918 1924 32 3 0 0 smtp [ 1358.826314] [16422] 0 16422 1062 198 8 3 0 0 dd [ 1358.826316] [16425] 51 16425 12447 1130 23 3 0 0 bounce [ 1358.826322] Kernel panic - not syncing: Out of memory: system-wide panic_on_oom is enabled
We have seen several OOM kernel crashes for SLES in past testing such as LU-10319 and LU-9601.
Attachments
Issue Links
- is related to
-
LU-11410 recovery-mds-scale test failover_mds crashes with ‘ntpd invoked oom-killer: gfp_mask=0x14200ca(GFP_HIGHUSER_MOVABLE)’
- Open
-
LU-9601 recovery-mds-scale test_failover_mds: test_failover_mds returned 1
- Reopened
-
LU-10319 recovery-random-scale, test_fail_client_mds: test_fail_client_mds returned 4
- Resolved
- mentioned in
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...