Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12311

recovery-double-scale test pairwise_fail crashed with OOM

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.10.7, Lustre 2.12.1, Lustre 2.10.8, Lustre 2.12.3
    • SLES12 SP3 server and client failover testing
    • 3
    • 9223372036854775807

    Description

      We are seeing recovery-double-scale test_pairwise_fail crashing with OOM for SLES 12 SP3 failover testing only ... so far.

      Looking at the kernel-crash log for https://testing.whamcloud.com/test_sets/bf0a7c40-7523-11e9-a6f2-52540065bddc , we see

      [  752.114008] Lustre: DEBUG MARKER: == recovery-double-scale test pairwise_fail: pairwise combination of clients, MDS, and OST failures == 08:37:36 (1557675456)
      [  752.143553] Lustre: DEBUG MARKER: PATH=/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/lib64/mpi/gcc/openmpi/bin:/sbin:
      [  752.199829] Lustre: DEBUG MARKER: /usr/sbin/lctl mark Started client load: dd on trevis-42vm3
      [  752.240449] Lustre: DEBUG MARKER: Started client load: dd on trevis-42vm3
      [  752.998214] Lustre: DEBUG MARKER: /usr/sbin/lctl mark Started client load: tar on trevis-42vm4
      [  753.060478] Lustre: DEBUG MARKER: Started client load: tar on trevis-42vm4
      [  755.135366] Lustre: DEBUG MARKER: cat /tmp/client-load.pid
      [  831.669265] irqbalance invoked oom-killer: gfp_mask=0x14280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=0, order=0, oom_score_adj=0
      [  831.669300] irqbalance cpuset=/ mems_allowed=0
      [  831.669310] CPU: 0 PID: 493 Comm: irqbalance Tainted: G           OE   N  4.4.162-94.69-default #1
      [  831.669310] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
      [  831.669317]  0000000000000000 ffffffff8132cdc0 ffff88007b2c7ac0 0000000000000000
      [  831.669318]  ffffffff8120b20e 0000000000000000 0000000000000000 0000000000000000
      [  831.669320]  0000000000000000 ffffffff810a1fb7 ffffffff81e9aa20 0000000000000000
      [  831.669320] Call Trace:
      [  831.669423]  [<ffffffff81019b09>] dump_trace+0x59/0x340
      [  831.669430]  [<ffffffff81019eda>] show_stack_log_lvl+0xea/0x170
      [  831.669435]  [<ffffffff8101acb1>] show_stack+0x21/0x40
      [  831.669455]  [<ffffffff8132cdc0>] dump_stack+0x5c/0x7c
      [  831.669490]  [<ffffffff8120b20e>] dump_header+0x82/0x215
      [  831.669519]  [<ffffffff81199d39>] check_panic_on_oom+0x29/0x50
      [  831.669530]  [<ffffffff81199eda>] out_of_memory+0x17a/0x4a0
      [  831.669537]  [<ffffffff8119e928>] __alloc_pages_nodemask+0xaf8/0xb70
      [  831.669555]  [<ffffffff811e8b24>] alloc_pages_vma+0xa4/0x220
      [  831.669569]  [<ffffffff811c5063>] handle_pte_fault+0xe63/0x1660
      [  831.669577]  [<ffffffff811c741a>] handle_mm_fault+0x2fa/0x640
      [  831.669593]  [<ffffffff81068df7>] __do_page_fault+0x217/0x4c0
      [  831.669608]  [<ffffffff8106914c>] trace_do_page_fault+0x3c/0x120
      [  831.669627]  [<ffffffff81621382>] async_page_fault+0x32/0x60
      [  831.672967] DWARF2 unwinder stuck at async_page_fault+0x32/0x60
      [  831.672967] 
      [  831.672971] Leftover inexact backtrace:
                     
      [  831.672989]  [<ffffffff8133945c>] ? copy_user_generic_string+0x2c/0x40
      [  831.673000]  [<ffffffff8127fb70>] ? int_seq_next+0x20/0x20
      [  831.673009]  [<ffffffff81231d84>] ? seq_read+0x2a4/0x3a0
      [  831.673015]  [<ffffffff81276e4c>] ? proc_reg_read+0x3c/0x70
      [  831.673017]  [<ffffffff8120f676>] ? __vfs_read+0x26/0x140
      [  831.673019]  [<ffffffff8120fb09>] ? rw_verify_area+0x49/0xc0
      [  831.673021]  [<ffffffff8120fbfa>] ? vfs_read+0x7a/0x120
      [  831.673022]  [<ffffffff81210d12>] ? SyS_read+0x42/0xa0
      [  831.673029]  [<ffffffff8161de61>] ? entry_SYSCALL_64_fastpath+0x20/0xe9
      [  831.673044] Mem-Info:
      [  831.673053] active_anon:1289 inactive_anon:1830 isolated_anon:0
                      active_file:103761 inactive_file:336285 isolated_file:31
                      unevictable:20 dirty:0 writeback:0 unstable:0
                      slab_reclaimable:2630 slab_unreclaimable:9715
                      mapped:7320 shmem:2179 pagetables:945 bounce:0
                      free:12214 free_pcp:31 free_cma:0
      [  831.673076] Node 0 DMA free:7724kB min:376kB low:468kB high:560kB active_anon:40kB inactive_anon:60kB active_file:3444kB inactive_file:3492kB unevictable:0kB isolated(anon):0kB isolated(file):124kB present:15992kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:308kB shmem:100kB slab_reclaimable:28kB slab_unreclaimable:404kB kernel_stack:16kB pagetables:108kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:281104 all_unreclaimable? yes
      [  831.673078] lowmem_reserve[]: 0 1843 1843 1843 1843
      [  831.673083] Node 0 DMA32 free:41132kB min:44676kB low:55844kB high:67012kB active_anon:5116kB inactive_anon:7260kB active_file:411600kB inactive_file:1341648kB unevictable:80kB isolated(anon):0kB isolated(file):0kB present:2080744kB managed:1900752kB mlocked:80kB dirty:0kB writeback:0kB mapped:28972kB shmem:8616kB slab_reclaimable:10492kB slab_unreclaimable:38456kB kernel_stack:2560kB pagetables:3672kB unstable:0kB bounce:0kB free_pcp:124kB local_pcp:120kB free_cma:0kB writeback_tmp:0kB pages_scanned:11405892 all_unreclaimable? yes
      [  831.673085] lowmem_reserve[]: 0 0 0 0 0
      [  831.673093] Node 0 DMA: 15*4kB (UE) 12*8kB (UME) 13*16kB (UE) 6*32kB (U) 2*64kB (U) 1*128kB (E) 1*256kB (E) 1*512kB (E) 2*1024kB (ME) 2*2048kB (ME) 0*4096kB = 7724kB
      [  831.673100] Node 0 DMA32: 369*4kB (UME) 171*8kB (ME) 333*16kB (UE) 248*32kB (UME) 155*64kB (UE) 72*128kB (UE) 11*256kB (UM) 4*512kB (UM) 1*1024kB (U) 0*2048kB 0*4096kB = 41132kB
      [  831.673117] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
      [  831.673130] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
      [  831.673131] 14562 total pagecache pages
      [  831.673132] 89 pages in swap cache
      [  831.673136] Swap cache stats: add 6258, delete 6169, find 134/194
      [  831.673137] Free swap  = 14313840kB
      [  831.673137] Total swap = 14338044kB
      [  831.673137] 524184 pages RAM
      [  831.673138] 0 pages HighMem/MovableOnly
      [  831.673138] 45020 pages reserved
      [  831.673138] 0 pages hwpoisoned
      [  831.673139] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
      [  831.673373] [  359]     0   359    10933     1055      25       3     1114             0 systemd-journal
      [  831.673382] [  402]   495   402    13124      928      30       3      116             0 rpcbind
      [  831.673403] [  405]     0   405     9267      848      21       3      100         -1000 systemd-udevd
      [  831.673409] [  493]     0   493     4815      630      14       3       38             0 irqbalance
      [  831.673411] [  500]     0   500    29707     1804      57       3      204             0 sssd
      [  831.673421] [  510]   499   510    13452      892      29       3      150          -900 dbus-daemon
      [  831.673423] [  536]     0   536    36531     2054      71       3      289             0 sssd_be
      [  831.673435] [  539]     0   539     7448     1061      20       4      265             0 wickedd-dhcp4
      [  831.673444] [  540]     0   540     7448     1033      20       3      261             0 wickedd-auto4
      [  831.673452] [  541]     0   541     7447     1061      20       3      260             0 wickedd-dhcp6
      [  831.673481] [  560]     0   560    84317      916      37       3      264             0 rsyslogd
      [  831.673483] [  576]     0   576    31712     1810      67       3      186             0 sssd_nss
      [  831.673485] [  577]     0   577    26060     1524      55       3      171             0 sssd_pam
      [  831.673487] [  578]     0   578    24978     1424      52       3      189             0 sssd_ssh
      [  831.673574] [  767]     0   767     7480     1064      18       3      300             0 wickedd
      [  831.673580] [  770]     0   770     7455      990      19       3      276             0 wickedd-nanny
      [  831.673589] [ 1429]     0  1429     2141      457      10       3       24             0 xinetd
      [  831.673597] [ 1471]    74  1471     8408      993      17       3      128             0 ntpd
      [  831.673608] [ 1478]    74  1478     9461      558      18       3      148             0 ntpd
      [  831.673629] [ 1491]     0  1491    16586     1569      37       3      181         -1000 sshd
      [  831.673634] [ 1503]   493  1503    55352      609      20       3      144             0 munged
      [  831.673643] [ 1543]     0  1543     1664      436       9       3       26             0 agetty
      [  831.673660] [ 1547]     0  1547     1664      459       9       3       30             0 agetty
      [  831.673662] [ 1557]     0  1557   147212     1543      61       4      345             0 automount
      [  831.673671] [ 1604]     0  1604     5513      603      16       4       64             0 systemd-logind
      [  831.673673] [ 1826]     0  1826     8861      838      21       3       98             0 master
      [  831.673675] [ 1846]    51  1846    12439     1043      27       3      106             0 pickup
      [  831.673683] [ 1847]    51  1847    12536     1329      27       3      176             0 qmgr
      [  831.673691] [ 1893]     0  1893     5198      564      17       3      151             0 cron
      [  831.673798] [15007]     0 15007    17465      869      36       3        0             0 in.mrshd
      [  831.673809] [15008]     0 15008     2894      661      11       3       12             0 bash
      [  831.673817] [15013]     0 15013     2894      494      11       3        2             0 bash
      [  831.673831] [15014]     0 15014     3034      751      12       3        0             0 run_dd.sh
      [  831.673833] [15047]     0 15047     1062      191       7       3        0             0 dd
      [  831.673838] Kernel panic - not syncing: Out of memory: system-wide panic_on_oom is enabled
                     
      [  831.673840] CPU: 0 PID: 493 Comm: irqbalance Tainted: G           OE   N  4.4.162-94.69-default #1
      [  831.673840] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
      [  831.673842]  0000000000000000 ffffffff8132cdc0 ffffffff81a28bf0 ffff88007b2c79e8
      [  831.673843]  ffffffff81193c21 0000000000000010 ffff88007b2c79f8 ffff88007b2c7998
      [  831.673845]  0000000000000426 ffffffff81a2cec3 000000000000004f 0000000000000000
      [  831.673845] Call Trace:
      [  831.673851]  [<ffffffff81019b09>] dump_trace+0x59/0x340
      [  831.673854]  [<ffffffff81019eda>] show_stack_log_lvl+0xea/0x170
      [  831.673856]  [<ffffffff8101acb1>] show_stack+0x21/0x40
      [  831.673859]  [<ffffffff8132cdc0>] dump_stack+0x5c/0x7c
      [  831.673867]  [<ffffffff81193c21>] panic+0xd2/0x232
      [  831.673870]  [<ffffffff81199d60>] check_panic_on_oom+0x50/0x50
      [  831.673873]  [<ffffffff81199eda>] out_of_memory+0x17a/0x4a0
      [  831.673879]  [<ffffffff8119e928>] __alloc_pages_nodemask+0xaf8/0xb70
      [  831.673883]  [<ffffffff811e8b24>] alloc_pages_vma+0xa4/0x220
      [  831.673886]  [<ffffffff811c5063>] handle_pte_fault+0xe63/0x1660
      [  831.673890]  [<ffffffff811c741a>] handle_mm_fault+0x2fa/0x640
      [  831.673893]  [<ffffffff81068df7>] __do_page_fault+0x217/0x4c0
      [  831.673897]  [<ffffffff8106914c>] trace_do_page_fault+0x3c/0x120
      [  831.673900]  [<ffffffff81621382>] async_page_fault+0x32/0x60
      [  831.676117] DWARF2 unwinder stuck at async_page_fault+0x32/0x60
      [  831.676118] 
      [  831.676118] Leftover inexact backtrace:
                     
      [  831.676121]  [<ffffffff8133945c>] ? copy_user_generic_string+0x2c/0x40
      [  831.676122]  [<ffffffff8127fb70>] ? int_seq_next+0x20/0x20
      [  831.676124]  [<ffffffff81231d84>] ? seq_read+0x2a4/0x3a0
      [  831.676126]  [<ffffffff81276e4c>] ? proc_reg_read+0x3c/0x70
      [  831.676128]  [<ffffffff8120f676>] ? __vfs_read+0x26/0x140
      [  831.676130]  [<ffffffff8120fb09>] ? rw_verify_area+0x49/0xc0
      [  831.676131]  [<ffffffff8120fbfa>] ? vfs_read+0x7a/0x120
      [  831.676133]  [<ffffffff81210d12>] ? SyS_read+0x42/0xa0
      [  831.676135]  [<ffffffff8161de61>] ? entry_SYSCALL_64_fastpath+0x20/0xe9
      

      Here's another example of this crash: https://testing.whamcloud.com/test_sets/e2461cc8-5485-11e9-9646-52540065bddc .

      A similar crash for recovery-double-scale test_pairwise_fail; https://testing.whamcloud.com/test_sets/e2461cc8-5485-11e9-9646-52540065bddc

      Attachments

        Issue Links

          Activity

            People

              wc-triage WC Triage
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated: