Details
-
Bug
-
Resolution: Unresolved
-
Minor
-
None
-
Lustre 2.10.7, Lustre 2.12.1, Lustre 2.10.8, Lustre 2.12.3
-
SLES12 SP3 server and client failover testing
-
3
-
9223372036854775807
Description
We are seeing recovery-double-scale test_pairwise_fail crashing with OOM for SLES 12 SP3 failover testing only ... so far.
Looking at the kernel-crash log for https://testing.whamcloud.com/test_sets/bf0a7c40-7523-11e9-a6f2-52540065bddc , we see
[ 752.114008] Lustre: DEBUG MARKER: == recovery-double-scale test pairwise_fail: pairwise combination of clients, MDS, and OST failures == 08:37:36 (1557675456) [ 752.143553] Lustre: DEBUG MARKER: PATH=/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/lib64/mpi/gcc/openmpi/bin:/sbin: [ 752.199829] Lustre: DEBUG MARKER: /usr/sbin/lctl mark Started client load: dd on trevis-42vm3 [ 752.240449] Lustre: DEBUG MARKER: Started client load: dd on trevis-42vm3 [ 752.998214] Lustre: DEBUG MARKER: /usr/sbin/lctl mark Started client load: tar on trevis-42vm4 [ 753.060478] Lustre: DEBUG MARKER: Started client load: tar on trevis-42vm4 [ 755.135366] Lustre: DEBUG MARKER: cat /tmp/client-load.pid [ 831.669265] irqbalance invoked oom-killer: gfp_mask=0x14280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=0, order=0, oom_score_adj=0 [ 831.669300] irqbalance cpuset=/ mems_allowed=0 [ 831.669310] CPU: 0 PID: 493 Comm: irqbalance Tainted: G OE N 4.4.162-94.69-default #1 [ 831.669310] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [ 831.669317] 0000000000000000 ffffffff8132cdc0 ffff88007b2c7ac0 0000000000000000 [ 831.669318] ffffffff8120b20e 0000000000000000 0000000000000000 0000000000000000 [ 831.669320] 0000000000000000 ffffffff810a1fb7 ffffffff81e9aa20 0000000000000000 [ 831.669320] Call Trace: [ 831.669423] [<ffffffff81019b09>] dump_trace+0x59/0x340 [ 831.669430] [<ffffffff81019eda>] show_stack_log_lvl+0xea/0x170 [ 831.669435] [<ffffffff8101acb1>] show_stack+0x21/0x40 [ 831.669455] [<ffffffff8132cdc0>] dump_stack+0x5c/0x7c [ 831.669490] [<ffffffff8120b20e>] dump_header+0x82/0x215 [ 831.669519] [<ffffffff81199d39>] check_panic_on_oom+0x29/0x50 [ 831.669530] [<ffffffff81199eda>] out_of_memory+0x17a/0x4a0 [ 831.669537] [<ffffffff8119e928>] __alloc_pages_nodemask+0xaf8/0xb70 [ 831.669555] [<ffffffff811e8b24>] alloc_pages_vma+0xa4/0x220 [ 831.669569] [<ffffffff811c5063>] handle_pte_fault+0xe63/0x1660 [ 831.669577] [<ffffffff811c741a>] handle_mm_fault+0x2fa/0x640 [ 831.669593] [<ffffffff81068df7>] __do_page_fault+0x217/0x4c0 [ 831.669608] [<ffffffff8106914c>] trace_do_page_fault+0x3c/0x120 [ 831.669627] [<ffffffff81621382>] async_page_fault+0x32/0x60 [ 831.672967] DWARF2 unwinder stuck at async_page_fault+0x32/0x60 [ 831.672967] [ 831.672971] Leftover inexact backtrace: [ 831.672989] [<ffffffff8133945c>] ? copy_user_generic_string+0x2c/0x40 [ 831.673000] [<ffffffff8127fb70>] ? int_seq_next+0x20/0x20 [ 831.673009] [<ffffffff81231d84>] ? seq_read+0x2a4/0x3a0 [ 831.673015] [<ffffffff81276e4c>] ? proc_reg_read+0x3c/0x70 [ 831.673017] [<ffffffff8120f676>] ? __vfs_read+0x26/0x140 [ 831.673019] [<ffffffff8120fb09>] ? rw_verify_area+0x49/0xc0 [ 831.673021] [<ffffffff8120fbfa>] ? vfs_read+0x7a/0x120 [ 831.673022] [<ffffffff81210d12>] ? SyS_read+0x42/0xa0 [ 831.673029] [<ffffffff8161de61>] ? entry_SYSCALL_64_fastpath+0x20/0xe9 [ 831.673044] Mem-Info: [ 831.673053] active_anon:1289 inactive_anon:1830 isolated_anon:0 active_file:103761 inactive_file:336285 isolated_file:31 unevictable:20 dirty:0 writeback:0 unstable:0 slab_reclaimable:2630 slab_unreclaimable:9715 mapped:7320 shmem:2179 pagetables:945 bounce:0 free:12214 free_pcp:31 free_cma:0 [ 831.673076] Node 0 DMA free:7724kB min:376kB low:468kB high:560kB active_anon:40kB inactive_anon:60kB active_file:3444kB inactive_file:3492kB unevictable:0kB isolated(anon):0kB isolated(file):124kB present:15992kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:308kB shmem:100kB slab_reclaimable:28kB slab_unreclaimable:404kB kernel_stack:16kB pagetables:108kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:281104 all_unreclaimable? yes [ 831.673078] lowmem_reserve[]: 0 1843 1843 1843 1843 [ 831.673083] Node 0 DMA32 free:41132kB min:44676kB low:55844kB high:67012kB active_anon:5116kB inactive_anon:7260kB active_file:411600kB inactive_file:1341648kB unevictable:80kB isolated(anon):0kB isolated(file):0kB present:2080744kB managed:1900752kB mlocked:80kB dirty:0kB writeback:0kB mapped:28972kB shmem:8616kB slab_reclaimable:10492kB slab_unreclaimable:38456kB kernel_stack:2560kB pagetables:3672kB unstable:0kB bounce:0kB free_pcp:124kB local_pcp:120kB free_cma:0kB writeback_tmp:0kB pages_scanned:11405892 all_unreclaimable? yes [ 831.673085] lowmem_reserve[]: 0 0 0 0 0 [ 831.673093] Node 0 DMA: 15*4kB (UE) 12*8kB (UME) 13*16kB (UE) 6*32kB (U) 2*64kB (U) 1*128kB (E) 1*256kB (E) 1*512kB (E) 2*1024kB (ME) 2*2048kB (ME) 0*4096kB = 7724kB [ 831.673100] Node 0 DMA32: 369*4kB (UME) 171*8kB (ME) 333*16kB (UE) 248*32kB (UME) 155*64kB (UE) 72*128kB (UE) 11*256kB (UM) 4*512kB (UM) 1*1024kB (U) 0*2048kB 0*4096kB = 41132kB [ 831.673117] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB [ 831.673130] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB [ 831.673131] 14562 total pagecache pages [ 831.673132] 89 pages in swap cache [ 831.673136] Swap cache stats: add 6258, delete 6169, find 134/194 [ 831.673137] Free swap = 14313840kB [ 831.673137] Total swap = 14338044kB [ 831.673137] 524184 pages RAM [ 831.673138] 0 pages HighMem/MovableOnly [ 831.673138] 45020 pages reserved [ 831.673138] 0 pages hwpoisoned [ 831.673139] [ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name [ 831.673373] [ 359] 0 359 10933 1055 25 3 1114 0 systemd-journal [ 831.673382] [ 402] 495 402 13124 928 30 3 116 0 rpcbind [ 831.673403] [ 405] 0 405 9267 848 21 3 100 -1000 systemd-udevd [ 831.673409] [ 493] 0 493 4815 630 14 3 38 0 irqbalance [ 831.673411] [ 500] 0 500 29707 1804 57 3 204 0 sssd [ 831.673421] [ 510] 499 510 13452 892 29 3 150 -900 dbus-daemon [ 831.673423] [ 536] 0 536 36531 2054 71 3 289 0 sssd_be [ 831.673435] [ 539] 0 539 7448 1061 20 4 265 0 wickedd-dhcp4 [ 831.673444] [ 540] 0 540 7448 1033 20 3 261 0 wickedd-auto4 [ 831.673452] [ 541] 0 541 7447 1061 20 3 260 0 wickedd-dhcp6 [ 831.673481] [ 560] 0 560 84317 916 37 3 264 0 rsyslogd [ 831.673483] [ 576] 0 576 31712 1810 67 3 186 0 sssd_nss [ 831.673485] [ 577] 0 577 26060 1524 55 3 171 0 sssd_pam [ 831.673487] [ 578] 0 578 24978 1424 52 3 189 0 sssd_ssh [ 831.673574] [ 767] 0 767 7480 1064 18 3 300 0 wickedd [ 831.673580] [ 770] 0 770 7455 990 19 3 276 0 wickedd-nanny [ 831.673589] [ 1429] 0 1429 2141 457 10 3 24 0 xinetd [ 831.673597] [ 1471] 74 1471 8408 993 17 3 128 0 ntpd [ 831.673608] [ 1478] 74 1478 9461 558 18 3 148 0 ntpd [ 831.673629] [ 1491] 0 1491 16586 1569 37 3 181 -1000 sshd [ 831.673634] [ 1503] 493 1503 55352 609 20 3 144 0 munged [ 831.673643] [ 1543] 0 1543 1664 436 9 3 26 0 agetty [ 831.673660] [ 1547] 0 1547 1664 459 9 3 30 0 agetty [ 831.673662] [ 1557] 0 1557 147212 1543 61 4 345 0 automount [ 831.673671] [ 1604] 0 1604 5513 603 16 4 64 0 systemd-logind [ 831.673673] [ 1826] 0 1826 8861 838 21 3 98 0 master [ 831.673675] [ 1846] 51 1846 12439 1043 27 3 106 0 pickup [ 831.673683] [ 1847] 51 1847 12536 1329 27 3 176 0 qmgr [ 831.673691] [ 1893] 0 1893 5198 564 17 3 151 0 cron [ 831.673798] [15007] 0 15007 17465 869 36 3 0 0 in.mrshd [ 831.673809] [15008] 0 15008 2894 661 11 3 12 0 bash [ 831.673817] [15013] 0 15013 2894 494 11 3 2 0 bash [ 831.673831] [15014] 0 15014 3034 751 12 3 0 0 run_dd.sh [ 831.673833] [15047] 0 15047 1062 191 7 3 0 0 dd [ 831.673838] Kernel panic - not syncing: Out of memory: system-wide panic_on_oom is enabled [ 831.673840] CPU: 0 PID: 493 Comm: irqbalance Tainted: G OE N 4.4.162-94.69-default #1 [ 831.673840] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [ 831.673842] 0000000000000000 ffffffff8132cdc0 ffffffff81a28bf0 ffff88007b2c79e8 [ 831.673843] ffffffff81193c21 0000000000000010 ffff88007b2c79f8 ffff88007b2c7998 [ 831.673845] 0000000000000426 ffffffff81a2cec3 000000000000004f 0000000000000000 [ 831.673845] Call Trace: [ 831.673851] [<ffffffff81019b09>] dump_trace+0x59/0x340 [ 831.673854] [<ffffffff81019eda>] show_stack_log_lvl+0xea/0x170 [ 831.673856] [<ffffffff8101acb1>] show_stack+0x21/0x40 [ 831.673859] [<ffffffff8132cdc0>] dump_stack+0x5c/0x7c [ 831.673867] [<ffffffff81193c21>] panic+0xd2/0x232 [ 831.673870] [<ffffffff81199d60>] check_panic_on_oom+0x50/0x50 [ 831.673873] [<ffffffff81199eda>] out_of_memory+0x17a/0x4a0 [ 831.673879] [<ffffffff8119e928>] __alloc_pages_nodemask+0xaf8/0xb70 [ 831.673883] [<ffffffff811e8b24>] alloc_pages_vma+0xa4/0x220 [ 831.673886] [<ffffffff811c5063>] handle_pte_fault+0xe63/0x1660 [ 831.673890] [<ffffffff811c741a>] handle_mm_fault+0x2fa/0x640 [ 831.673893] [<ffffffff81068df7>] __do_page_fault+0x217/0x4c0 [ 831.673897] [<ffffffff8106914c>] trace_do_page_fault+0x3c/0x120 [ 831.673900] [<ffffffff81621382>] async_page_fault+0x32/0x60 [ 831.676117] DWARF2 unwinder stuck at async_page_fault+0x32/0x60 [ 831.676118] [ 831.676118] Leftover inexact backtrace: [ 831.676121] [<ffffffff8133945c>] ? copy_user_generic_string+0x2c/0x40 [ 831.676122] [<ffffffff8127fb70>] ? int_seq_next+0x20/0x20 [ 831.676124] [<ffffffff81231d84>] ? seq_read+0x2a4/0x3a0 [ 831.676126] [<ffffffff81276e4c>] ? proc_reg_read+0x3c/0x70 [ 831.676128] [<ffffffff8120f676>] ? __vfs_read+0x26/0x140 [ 831.676130] [<ffffffff8120fb09>] ? rw_verify_area+0x49/0xc0 [ 831.676131] [<ffffffff8120fbfa>] ? vfs_read+0x7a/0x120 [ 831.676133] [<ffffffff81210d12>] ? SyS_read+0x42/0xa0 [ 831.676135] [<ffffffff8161de61>] ? entry_SYSCALL_64_fastpath+0x20/0xe9
Here's another example of this crash: https://testing.whamcloud.com/test_sets/e2461cc8-5485-11e9-9646-52540065bddc .
A similar crash for recovery-double-scale test_pairwise_fail; https://testing.whamcloud.com/test_sets/e2461cc8-5485-11e9-9646-52540065bddc