[LU-12067] recovery-mds-scale test failover_mds crashes with OOM Created: 13/Mar/19 Updated: 03/Feb/23 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.10.7 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | James Nunez (Inactive) | Assignee: | WC Triage |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Environment: |
SLES client failover testing |
||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||
| Description |
|
recovery-mds-scale test_failover_mds crashes with OOM for SLES client failover testing. Looking at the suite_log for the failed test suite https://testing.whamcloud.com/test_sets/87ea5270-429d-11e9-a256-52540065bddc, we can see that one MDS failover takes place and we are checking client load after the failover. The last thing seen in the suite_log is Started lustre-MDT0000 ==== Checking the clients loads AFTER failover -- failure NOT OK 01:14:58 (1552122898) waiting for trevis-34vm7 network 5 secs ... 01:14:58 (1552122898) network interface is UP CMD: trevis-34vm7 rc=0; val=\$(/usr/sbin/lctl get_param -n catastrophe 2>&1); if [[ \$? -eq 0 && \$val -ne 0 ]]; then echo \$(hostname -s): \$val; rc=\$val; fi; exit \$rc CMD: trevis-34vm7 ps auxwww | grep -v grep | grep -q run_dd.sh 01:14:58 (1552122898) waiting for trevis-34vm8 network 5 secs ... 01:14:58 (1552122898) network interface is UP CMD: trevis-34vm8 rc=0; val=\$(/usr/sbin/lctl get_param -n catastrophe 2>&1); if [[ \$? -eq 0 && \$val -ne 0 ]]; then echo \$(hostname -s): \$val; rc=\$val; fi; exit \$rc CMD: trevis-34vm8 ps auxwww | grep -v grep | grep -q run_tar.sh mds1 has failed over 1 times, and counting... sleeping 1125 seconds... Looking at the kernel crash, we see [ 1358.821727] jbd2/vda1-8 invoked oom-killer: gfp_mask=0x1420848(GFP_NOFS|__GFP_NOFAIL|__GFP_HARDWALL|__GFP_MOVABLE), nodemask=0, order=0, oom_score_adj=0
[ 1358.821760] jbd2/vda1-8 cpuset=/ mems_allowed=0
[ 1358.821775] CPU: 0 PID: 273 Comm: jbd2/vda1-8 Tainted: G OE N 4.4.162-94.69-default #1
[ 1358.821775] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[ 1358.821782] 0000000000000000 ffffffff8132cdc0 ffff88003733fb88 0000000000000000
[ 1358.821784] ffffffff8120b20e 0000000000000000 0000000000000000 0000000000000000
[ 1358.821785] 0000000000000000 ffffffff810a1fb7 ffffffff81e9aa20 0000000000000000
[ 1358.821786] Call Trace:
[ 1358.821886] [<ffffffff81019b09>] dump_trace+0x59/0x340
[ 1358.821894] [<ffffffff81019eda>] show_stack_log_lvl+0xea/0x170
[ 1358.821897] [<ffffffff8101acb1>] show_stack+0x21/0x40
[ 1358.821921] [<ffffffff8132cdc0>] dump_stack+0x5c/0x7c
[ 1358.821953] [<ffffffff8120b20e>] dump_header+0x82/0x215
[ 1358.821981] [<ffffffff81199d39>] check_panic_on_oom+0x29/0x50
[ 1358.821993] [<ffffffff81199eda>] out_of_memory+0x17a/0x4a0
[ 1358.822000] [<ffffffff8119e849>] __alloc_pages_nodemask+0xa19/0xb70
[ 1358.822019] [<ffffffff811e6caf>] alloc_pages_current+0x7f/0x100
[ 1358.822036] [<ffffffff81196dfd>] pagecache_get_page+0x4d/0x1c0
[ 1358.822046] [<ffffffff812443ce>] __getblk_slow+0xce/0x2e0
[ 1358.822106] [<ffffffffa01bda15>] jbd2_journal_get_descriptor_buffer+0x35/0x90 [jbd2]
[ 1358.822127] [<ffffffffa01b689d>] jbd2_journal_commit_transaction+0x8ed/0x1970 [jbd2]
[ 1358.822136] [<ffffffffa01bb3b2>] kjournald2+0xb2/0x260 [jbd2]
[ 1358.822150] [<ffffffff810a0e29>] kthread+0xc9/0xe0
[ 1358.822190] [<ffffffff8161e1f5>] ret_from_fork+0x55/0x80
[ 1358.825553] DWARF2 unwinder stuck at ret_from_fork+0x55/0x80
[ 1358.825554]
[ 1358.825558] Leftover inexact backtrace:
[ 1358.825573] [<ffffffff810a0d60>] ? kthread_park+0x50/0x50
[ 1358.825583] Mem-Info:
[ 1358.825592] active_anon:1696 inactive_anon:1761 isolated_anon:0
active_file:219719 inactive_file:219874 isolated_file:0
unevictable:20 dirty:0 writeback:0 unstable:0
slab_reclaimable:2700 slab_unreclaimable:17655
mapped:5110 shmem:2179 pagetables:987 bounce:0
free:4229 free_pcp:48 free_cma:0
[ 1358.825612] Node 0 DMA free:7480kB min:376kB low:468kB high:560kB active_anon:0kB inactive_anon:100kB active_file:3640kB inactive_file:3680kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15992kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:172kB shmem:100kB slab_reclaimable:32kB slab_unreclaimable:592kB kernel_stack:48kB pagetables:60kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:2866048 all_unreclaimable? yes
[ 1358.825613] lowmem_reserve[]: 0 1843 1843 1843 1843
[ 1358.825622] Node 0 DMA32 free:9436kB min:44676kB low:55844kB high:67012kB active_anon:6784kB inactive_anon:6944kB active_file:875236kB inactive_file:875816kB unevictable:80kB isolated(anon):0kB isolated(file):0kB present:2080744kB managed:1900752kB mlocked:80kB dirty:0kB writeback:0kB mapped:20268kB shmem:8616kB slab_reclaimable:10768kB slab_unreclaimable:70028kB kernel_stack:2656kB pagetables:3888kB unstable:0kB bounce:0kB free_pcp:192kB local_pcp:120kB free_cma:0kB writeback_tmp:0kB pages_scanned:17166780 all_unreclaimable? yes
[ 1358.825624] lowmem_reserve[]: 0 0 0 0 0
[ 1358.825639] Node 0 DMA: 2*4kB (E) 4*8kB (ME) 1*16kB (U) 2*32kB (UM) 3*64kB (UME) 4*128kB (UME) 2*256kB (UE) 2*512kB (ME) 1*1024kB (E) 2*2048kB (ME) 0*4096kB = 7480kB
[ 1358.825645] Node 0 DMA32: 481*4kB (UME) 239*8kB (UME) 78*16kB (ME) 26*32kB (ME) 11*64kB (ME) 22*128kB (UE) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 9436kB
[ 1358.825659] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[ 1358.825676] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[ 1358.825677] 7018 total pagecache pages
[ 1358.825678] 27 pages in swap cache
[ 1358.825682] Swap cache stats: add 6097, delete 6070, find 51/94
[ 1358.825683] Free swap = 14314056kB
[ 1358.825683] Total swap = 14338044kB
[ 1358.825684] 524184 pages RAM
[ 1358.825684] 0 pages HighMem/MovableOnly
[ 1358.825685] 45020 pages reserved
[ 1358.825685] 0 pages hwpoisoned
[ 1358.825685] [ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name
[ 1358.825896] [ 349] 0 349 10933 1003 22 3 1114 0 systemd-journal
[ 1358.825902] [ 413] 495 413 13124 890 29 4 111 0 rpcbind
[ 1358.825913] [ 415] 0 415 9267 660 20 3 214 -1000 systemd-udevd
[ 1358.825919] [ 462] 0 462 4814 455 14 3 38 0 irqbalance
[ 1358.825921] [ 464] 0 464 29706 1245 59 4 197 0 sssd
[ 1358.825934] [ 476] 499 476 13452 785 28 3 150 -900 dbus-daemon
[ 1358.825943] [ 535] 0 535 7447 1016 19 3 261 0 wickedd-dhcp6
[ 1358.825945] [ 554] 0 554 36530 1435 70 3 264 0 sssd_be
[ 1358.825953] [ 563] 0 563 7448 1054 20 3 265 0 wickedd-dhcp4
[ 1358.825962] [ 564] 0 564 7448 1018 20 3 261 0 wickedd-auto4
[ 1358.825970] [ 565] 0 565 84317 749 37 4 259 0 rsyslogd
[ 1358.825972] [ 571] 0 571 31711 1112 65 3 175 0 sssd_nss
[ 1358.825974] [ 572] 0 572 26059 1058 55 3 169 0 sssd_pam
[ 1358.825980] [ 573] 0 573 24977 1041 51 3 161 0 sssd_ssh
[ 1358.826054] [ 761] 0 761 7480 1030 18 3 299 0 wickedd
[ 1358.826060] [ 764] 0 764 7455 974 21 3 276 0 wickedd-nanny
[ 1358.826069] [ 1418] 0 1418 2141 455 9 3 26 0 xinetd
[ 1358.826077] [ 1464] 0 1464 16586 1551 37 3 181 -1000 sshd
[ 1358.826079] [ 1477] 74 1477 8408 842 18 3 131 0 ntpd
[ 1358.826094] [ 1490] 74 1490 9461 497 21 3 150 0 ntpd
[ 1358.826099] [ 1511] 493 1511 55352 609 21 3 149 0 munged
[ 1358.826107] [ 1527] 0 1527 1664 365 8 3 29 0 agetty
[ 1358.826115] [ 1529] 0 1529 1664 407 9 3 29 0 agetty
[ 1358.826117] [ 1536] 0 1536 147212 1080 60 3 346 0 automount
[ 1358.826125] [ 1611] 0 1611 5513 494 16 3 65 0 systemd-logind
[ 1358.826127] [ 1828] 0 1828 8861 812 20 3 98 0 master
[ 1358.826130] [ 1853] 51 1853 12439 1000 24 3 106 0 pickup
[ 1358.826135] [ 1854] 51 1854 12529 1309 25 3 173 0 qmgr
[ 1358.826143] [ 1883] 0 1883 5197 531 17 3 144 0 cron
[ 1358.826250] [15652] 0 15652 17465 844 35 4 0 0 in.mrshd
[ 1358.826258] [15653] 0 15653 2894 653 11 3 0 0 bash
[ 1358.826266] [15658] 0 15658 2894 492 11 3 0 0 bash
[ 1358.826274] [15659] 0 15659 3034 755 10 3 0 0 run_dd.sh
[ 1358.826313] [16412] 51 16412 16918 1924 32 3 0 0 smtp
[ 1358.826314] [16422] 0 16422 1062 198 8 3 0 0 dd
[ 1358.826316] [16425] 51 16425 12447 1130 23 3 0 0 bounce
[ 1358.826322] Kernel panic - not syncing: Out of memory: system-wide panic_on_oom is enabled
We have seen several OOM kernel crashes for SLES in past testing such as |
| Comments |
| Comment by James Nunez (Inactive) [ 29/Apr/19 ] |
|
Moved comment to LU-11410 |
| Comment by James Nunez (Inactive) [ 14/Oct/19 ] |
|
We have many kernel crashes due to OOM in recovery-mds-scale test_failover_mds. Here's another one, https://testing.whamcloud.com/test_sets/50fbe26e-ea6e-11e9-be86-52540065bddc, with crash info [ 1520.960032] LustreError: 13429:0:(client.c:2020:ptlrpc_check_set()) @@@ bulk transfer failed req@ffff88002e249b40 x1646846100197008/t4294968624(4294968624) o4->lustre-OST0003-osc-ffff88007b6d0800@10.9.6.23@tcp:6/4 lens 488/416 e 0 to 1 dl 1570555498 ref 3 fl Bulk:ReX/4/0 rc 0/0
[ 1520.960038] LustreError: 13429:0:(osc_request.c:1924:osc_brw_redo_request()) @@@ redo for recoverable error -5 req@ffff88002e249b40 x1646846100197008/t4294968624(4294968624) o4->lustre-OST0003-osc-ffff88007b6d0800@10.9.6.23@tcp:6/4 lens 488/416 e 0 to 1 dl 1570555498 ref 3 fl Interpret:ReX/4/0 rc -5/0
[ 1525.630849] irqbalance invoked oom-killer: gfp_mask=0x16040c0(GFP_KERNEL|__GFP_COMP|__GFP_NOTRACK), nodemask=0, order=0, oom_score_adj=0
[ 1525.630862] irqbalance cpuset=/ mems_allowed=0
[ 1525.630869] CPU: 0 PID: 506 Comm: irqbalance Tainted: G OE N 4.4.180-94.97-default #1
[ 1525.630869] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[ 1525.630872] 0000000000000000 ffffffff813303b0 ffff88007a18fd68 0000000000000000
[ 1525.630873] ffffffff8120d66e 0000000000000000 0000000000000000 0000000000000000
[ 1525.630875] 0000000000000000 ffffffff810a2ad7 ffffffff81e9aae0 0000000000000000
[ 1525.630875] Call Trace:
[ 1525.630985] [<ffffffff81019b39>] dump_trace+0x59/0x340
[ 1525.630993] [<ffffffff81019f0a>] show_stack_log_lvl+0xea/0x170
[ 1525.630995] [<ffffffff8101ace1>] show_stack+0x21/0x40
[ 1525.631011] [<ffffffff813303b0>] dump_stack+0x5c/0x7c
[ 1525.631044] [<ffffffff8120d66e>] dump_header+0x82/0x215
[ 1525.631070] [<ffffffff8119bb79>] check_panic_on_oom+0x29/0x50
[ 1525.631082] [<ffffffff8119bd1a>] out_of_memory+0x17a/0x4a0
[ 1525.631087] [<ffffffff811a0758>] __alloc_pages_nodemask+0xaf8/0xb70
[ 1525.631098] [<ffffffff811eff5d>] kmem_getpages+0x4d/0xf0
[ 1525.631108] [<ffffffff811f1b6b>] fallback_alloc+0x19b/0x240
[ 1525.631110] [<ffffffff811f33d0>] kmem_cache_alloc+0x240/0x470
[ 1525.631124] [<ffffffff812207ec>] getname_flags+0x4c/0x1f0
[ 1525.631131] [<ffffffff8121114e>] do_sys_open+0xfe/0x200
[ 1525.631157] [<ffffffff81627361>] entry_SYSCALL_64_fastpath+0x20/0xe9
[ 1525.633784] DWARF2 unwinder stuck at entry_SYSCALL_64_fastpath+0x20/0xe9
[ 1525.633784]
[ 1525.633785] Leftover inexact backtrace:
[ 1525.633808] Mem-Info:
[ 1525.633816] active_anon:15 inactive_anon:0 isolated_anon:0
active_file:161804 inactive_file:264038 isolated_file:64
unevictable:20 dirty:0 writeback:46780 unstable:0
slab_reclaimable:2735 slab_unreclaimable:26502
mapped:985 shmem:1 pagetables:960 bounce:0
free:10229 free_pcp:23 free_cma:0
[ 1525.633825] Node 0 DMA free:7568kB min:376kB low:468kB high:560kB active_anon:0kB inactive_anon:8kB active_file:2144kB inactive_file:5256kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15992kB managed:15904kB mlocked:0kB dirty:0kB writeback:500kB mapped:176kB shmem:4kB slab_reclaimable:52kB slab_unreclaimable:412kB kernel_stack:16kB pagetables:16kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:44808 all_unreclaimable? yes
[ 1525.633827] lowmem_reserve[]: 0 1843 1843 1843 1843
[ 1525.633835] Node 0 DMA32 free:33348kB min:44676kB low:55844kB high:67012kB active_anon:60kB inactive_anon:0kB active_file:645072kB inactive_file:1050812kB unevictable:80kB isolated(anon):0kB isolated(file):384kB present:2080744kB managed:1900776kB mlocked:80kB dirty:0kB writeback:186620kB mapped:3764kB shmem:0kB slab_reclaimable:10888kB slab_unreclaimable:105596kB kernel_stack:2608kB pagetables:3824kB unstable:0kB bounce:0kB free_pcp:92kB local_pcp:68kB free_cma:0kB writeback_tmp:0kB pages_scanned:10890568 all_unreclaimable? yes
[ 1525.633836] lowmem_reserve[]: 0 0 0 0 0
[ 1525.633843] Node 0 DMA: 12*4kB (UE) 8*8kB (UE) 8*16kB (UME) 1*32kB (M) 4*64kB (UM) 1*128kB (E) 1*256kB (E) 3*512kB (UME) 1*1024kB (E) 2*2048kB (ME) 0*4096kB = 7568kB
[ 1525.633849] Node 0 DMA32: 1231*4kB (UME) 259*8kB (ME) 79*16kB (UME) 66*32kB (UME) 31*64kB (UME) 46*128kB (UM) 25*256kB (UM) 17*512kB (M) 0*1024kB 0*2048kB 0*4096kB = 33348kB
[ 1525.633867] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[ 1525.633884] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[ 1525.633884] 49223 total pagecache pages
[ 1525.633886] 1 pages in swap cache
[ 1525.633886] Swap cache stats: add 9520, delete 9519, find 197/379
[ 1525.633887] Free swap = 14301256kB
[ 1525.633887] Total swap = 14338044kB
[ 1525.633888] 524184 pages RAM
[ 1525.633888] 0 pages HighMem/MovableOnly
[ 1525.633888] 45014 pages reserved
[ 1525.633888] 0 pages hwpoisoned
[ 1525.633889] [ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name
[ 1525.634095] [ 358] 0 358 10933 321 24 3 1114 0 systemd-journal
[ 1525.634104] [ 395] 0 395 9229 325 20 3 174 -1000 systemd-udevd
[ 1525.634105] [ 397] 495 397 13126 311 31 3 117 0 rpcbind
[ 1525.634117] [ 467] 499 467 12922 268 27 3 150 -900 dbus-daemon
[ 1525.634129] [ 484] 0 484 10883 336 26 3 290 0 wickedd-dhcp4
[ 1525.634131] [ 504] 0 504 29175 327 59 3 236 0 sssd
[ 1525.634137] [ 505] 0 505 10882 305 25 3 286 0 wickedd-auto4
[ 1525.634138] [ 506] 0 506 4815 317 15 3 58 0 irqbalance
[ 1525.634150] [ 507] 0 507 10883 351 26 3 287 0 wickedd-dhcp6
[ 1525.634155] [ 547] 0 547 36535 404 69 4 302 0 sssd_be
[ 1525.634161] [ 567] 0 567 83783 339 37 3 269 0 rsyslogd
[ 1525.634163] [ 582] 0 582 31181 375 62 3 207 0 sssd_nss
[ 1525.634168] [ 583] 0 583 25530 245 53 3 202 0 sssd_pam
[ 1525.634170] [ 584] 0 584 24446 254 52 3 201 0 sssd_ssh
[ 1525.634267] [ 768] 0 768 10913 360 27 3 331 0 wickedd
[ 1525.634275] [ 774] 0 774 10889 290 27 3 292 0 wickedd-nanny
[ 1525.634283] [ 1430] 0 1430 2142 296 10 3 40 0 xinetd
[ 1525.634295] [ 1469] 0 1469 16594 318 35 3 180 -1000 sshd
[ 1525.634297] [ 1472] 74 1472 8412 342 17 3 152 0 ntpd
[ 1525.634304] [ 1482] 74 1482 9465 306 18 3 153 0 ntpd
[ 1525.634313] [ 1512] 490 1512 55357 0 21 3 264 0 munged
[ 1525.634325] [ 1530] 0 1530 1665 286 8 3 29 0 agetty
[ 1525.634330] [ 1532] 0 1532 147216 292 62 3 374 0 automount
[ 1525.634338] [ 1533] 0 1533 1665 295 9 3 30 0 agetty
[ 1525.634340] [ 1601] 0 1601 5516 339 17 3 79 0 systemd-logind
[ 1525.634342] [ 1857] 0 1857 8864 332 20 3 121 0 master
[ 1525.634344] [ 1869] 51 1869 12442 289 24 3 129 0 pickup
[ 1525.634352] [ 1870] 51 1870 12539 290 26 3 166 0 qmgr
[ 1525.634354] [ 1915] 0 1915 5202 313 15 3 167 0 cron
[ 1525.634417] [15802] 0 15802 17469 305 35 3 175 0 in.mrshd
[ 1525.634425] [15803] 0 15803 2895 361 10 3 78 0 bash
[ 1525.634433] [15808] 0 15808 2895 278 10 3 79 0 bash
[ 1525.634435] [15809] 0 15809 3035 365 12 3 214 0 run_dd.sh
[ 1525.634449] [16592] 0 16592 1063 203 7 3 21 0 dd
[ 1525.634450] Kernel panic - not syncing: Out of memory: system-wide panic_on_oom is enabled
|