Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12067

recovery-mds-scale test failover_mds crashes with OOM

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.10.7
    • None
    • SLES client failover testing
    • 3
    • 9223372036854775807

    Description

      recovery-mds-scale test_failover_mds crashes with OOM for SLES client failover testing.

      Looking at the suite_log for the failed test suite https://testing.whamcloud.com/test_sets/87ea5270-429d-11e9-a256-52540065bddc, we can see that one MDS failover takes place and we are checking client load after the failover. The last thing seen in the suite_log is

      Started lustre-MDT0000
      ==== Checking the clients loads AFTER failover -- failure NOT OK
      01:14:58 (1552122898) waiting for trevis-34vm7 network 5 secs ...
      01:14:58 (1552122898) network interface is UP
      CMD: trevis-34vm7 rc=0;
      			val=\$(/usr/sbin/lctl get_param -n catastrophe 2>&1);
      			if [[ \$? -eq 0 && \$val -ne 0 ]]; then
      				echo \$(hostname -s): \$val;
      				rc=\$val;
      			fi;
      			exit \$rc
      CMD: trevis-34vm7 ps auxwww | grep -v grep | grep -q run_dd.sh
      01:14:58 (1552122898) waiting for trevis-34vm8 network 5 secs ...
      01:14:58 (1552122898) network interface is UP
      CMD: trevis-34vm8 rc=0;
      			val=\$(/usr/sbin/lctl get_param -n catastrophe 2>&1);
      			if [[ \$? -eq 0 && \$val -ne 0 ]]; then
      				echo \$(hostname -s): \$val;
      				rc=\$val;
      			fi;
      			exit \$rc
      CMD: trevis-34vm8 ps auxwww | grep -v grep | grep -q run_tar.sh
      mds1 has failed over 1 times, and counting...
      sleeping 1125 seconds... 
      

      Looking at the kernel crash, we see

      [ 1358.821727] jbd2/vda1-8 invoked oom-killer: gfp_mask=0x1420848(GFP_NOFS|__GFP_NOFAIL|__GFP_HARDWALL|__GFP_MOVABLE), nodemask=0, order=0, oom_score_adj=0
      [ 1358.821760] jbd2/vda1-8 cpuset=/ mems_allowed=0
      [ 1358.821775] CPU: 0 PID: 273 Comm: jbd2/vda1-8 Tainted: G           OE   N  4.4.162-94.69-default #1
      [ 1358.821775] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
      [ 1358.821782]  0000000000000000 ffffffff8132cdc0 ffff88003733fb88 0000000000000000
      [ 1358.821784]  ffffffff8120b20e 0000000000000000 0000000000000000 0000000000000000
      [ 1358.821785]  0000000000000000 ffffffff810a1fb7 ffffffff81e9aa20 0000000000000000
      [ 1358.821786] Call Trace:
      [ 1358.821886]  [<ffffffff81019b09>] dump_trace+0x59/0x340
      [ 1358.821894]  [<ffffffff81019eda>] show_stack_log_lvl+0xea/0x170
      [ 1358.821897]  [<ffffffff8101acb1>] show_stack+0x21/0x40
      [ 1358.821921]  [<ffffffff8132cdc0>] dump_stack+0x5c/0x7c
      [ 1358.821953]  [<ffffffff8120b20e>] dump_header+0x82/0x215
      [ 1358.821981]  [<ffffffff81199d39>] check_panic_on_oom+0x29/0x50
      [ 1358.821993]  [<ffffffff81199eda>] out_of_memory+0x17a/0x4a0
      [ 1358.822000]  [<ffffffff8119e849>] __alloc_pages_nodemask+0xa19/0xb70
      [ 1358.822019]  [<ffffffff811e6caf>] alloc_pages_current+0x7f/0x100
      [ 1358.822036]  [<ffffffff81196dfd>] pagecache_get_page+0x4d/0x1c0
      [ 1358.822046]  [<ffffffff812443ce>] __getblk_slow+0xce/0x2e0
      [ 1358.822106]  [<ffffffffa01bda15>] jbd2_journal_get_descriptor_buffer+0x35/0x90 [jbd2]
      [ 1358.822127]  [<ffffffffa01b689d>] jbd2_journal_commit_transaction+0x8ed/0x1970 [jbd2]
      [ 1358.822136]  [<ffffffffa01bb3b2>] kjournald2+0xb2/0x260 [jbd2]
      [ 1358.822150]  [<ffffffff810a0e29>] kthread+0xc9/0xe0
      [ 1358.822190]  [<ffffffff8161e1f5>] ret_from_fork+0x55/0x80
      [ 1358.825553] DWARF2 unwinder stuck at ret_from_fork+0x55/0x80
      [ 1358.825554] 
      [ 1358.825558] Leftover inexact backtrace:
                     
      [ 1358.825573]  [<ffffffff810a0d60>] ? kthread_park+0x50/0x50
      [ 1358.825583] Mem-Info:
      [ 1358.825592] active_anon:1696 inactive_anon:1761 isolated_anon:0
                      active_file:219719 inactive_file:219874 isolated_file:0
                      unevictable:20 dirty:0 writeback:0 unstable:0
                      slab_reclaimable:2700 slab_unreclaimable:17655
                      mapped:5110 shmem:2179 pagetables:987 bounce:0
                      free:4229 free_pcp:48 free_cma:0
      [ 1358.825612] Node 0 DMA free:7480kB min:376kB low:468kB high:560kB active_anon:0kB inactive_anon:100kB active_file:3640kB inactive_file:3680kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15992kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:172kB shmem:100kB slab_reclaimable:32kB slab_unreclaimable:592kB kernel_stack:48kB pagetables:60kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:2866048 all_unreclaimable? yes
      [ 1358.825613] lowmem_reserve[]: 0 1843 1843 1843 1843
      [ 1358.825622] Node 0 DMA32 free:9436kB min:44676kB low:55844kB high:67012kB active_anon:6784kB inactive_anon:6944kB active_file:875236kB inactive_file:875816kB unevictable:80kB isolated(anon):0kB isolated(file):0kB present:2080744kB managed:1900752kB mlocked:80kB dirty:0kB writeback:0kB mapped:20268kB shmem:8616kB slab_reclaimable:10768kB slab_unreclaimable:70028kB kernel_stack:2656kB pagetables:3888kB unstable:0kB bounce:0kB free_pcp:192kB local_pcp:120kB free_cma:0kB writeback_tmp:0kB pages_scanned:17166780 all_unreclaimable? yes
      [ 1358.825624] lowmem_reserve[]: 0 0 0 0 0
      [ 1358.825639] Node 0 DMA: 2*4kB (E) 4*8kB (ME) 1*16kB (U) 2*32kB (UM) 3*64kB (UME) 4*128kB (UME) 2*256kB (UE) 2*512kB (ME) 1*1024kB (E) 2*2048kB (ME) 0*4096kB = 7480kB
      [ 1358.825645] Node 0 DMA32: 481*4kB (UME) 239*8kB (UME) 78*16kB (ME) 26*32kB (ME) 11*64kB (ME) 22*128kB (UE) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 9436kB
      [ 1358.825659] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
      [ 1358.825676] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
      [ 1358.825677] 7018 total pagecache pages
      [ 1358.825678] 27 pages in swap cache
      [ 1358.825682] Swap cache stats: add 6097, delete 6070, find 51/94
      [ 1358.825683] Free swap  = 14314056kB
      [ 1358.825683] Total swap = 14338044kB
      [ 1358.825684] 524184 pages RAM
      [ 1358.825684] 0 pages HighMem/MovableOnly
      [ 1358.825685] 45020 pages reserved
      [ 1358.825685] 0 pages hwpoisoned
      [ 1358.825685] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
      [ 1358.825896] [  349]     0   349    10933     1003      22       3     1114             0 systemd-journal
      [ 1358.825902] [  413]   495   413    13124      890      29       4      111             0 rpcbind
      [ 1358.825913] [  415]     0   415     9267      660      20       3      214         -1000 systemd-udevd
      [ 1358.825919] [  462]     0   462     4814      455      14       3       38             0 irqbalance
      [ 1358.825921] [  464]     0   464    29706     1245      59       4      197             0 sssd
      [ 1358.825934] [  476]   499   476    13452      785      28       3      150          -900 dbus-daemon
      [ 1358.825943] [  535]     0   535     7447     1016      19       3      261             0 wickedd-dhcp6
      [ 1358.825945] [  554]     0   554    36530     1435      70       3      264             0 sssd_be
      [ 1358.825953] [  563]     0   563     7448     1054      20       3      265             0 wickedd-dhcp4
      [ 1358.825962] [  564]     0   564     7448     1018      20       3      261             0 wickedd-auto4
      [ 1358.825970] [  565]     0   565    84317      749      37       4      259             0 rsyslogd
      [ 1358.825972] [  571]     0   571    31711     1112      65       3      175             0 sssd_nss
      [ 1358.825974] [  572]     0   572    26059     1058      55       3      169             0 sssd_pam
      [ 1358.825980] [  573]     0   573    24977     1041      51       3      161             0 sssd_ssh
      [ 1358.826054] [  761]     0   761     7480     1030      18       3      299             0 wickedd
      [ 1358.826060] [  764]     0   764     7455      974      21       3      276             0 wickedd-nanny
      [ 1358.826069] [ 1418]     0  1418     2141      455       9       3       26             0 xinetd
      [ 1358.826077] [ 1464]     0  1464    16586     1551      37       3      181         -1000 sshd
      [ 1358.826079] [ 1477]    74  1477     8408      842      18       3      131             0 ntpd
      [ 1358.826094] [ 1490]    74  1490     9461      497      21       3      150             0 ntpd
      [ 1358.826099] [ 1511]   493  1511    55352      609      21       3      149             0 munged
      [ 1358.826107] [ 1527]     0  1527     1664      365       8       3       29             0 agetty
      [ 1358.826115] [ 1529]     0  1529     1664      407       9       3       29             0 agetty
      [ 1358.826117] [ 1536]     0  1536   147212     1080      60       3      346             0 automount
      [ 1358.826125] [ 1611]     0  1611     5513      494      16       3       65             0 systemd-logind
      [ 1358.826127] [ 1828]     0  1828     8861      812      20       3       98             0 master
      [ 1358.826130] [ 1853]    51  1853    12439     1000      24       3      106             0 pickup
      [ 1358.826135] [ 1854]    51  1854    12529     1309      25       3      173             0 qmgr
      [ 1358.826143] [ 1883]     0  1883     5197      531      17       3      144             0 cron
      [ 1358.826250] [15652]     0 15652    17465      844      35       4        0             0 in.mrshd
      [ 1358.826258] [15653]     0 15653     2894      653      11       3        0             0 bash
      [ 1358.826266] [15658]     0 15658     2894      492      11       3        0             0 bash
      [ 1358.826274] [15659]     0 15659     3034      755      10       3        0             0 run_dd.sh
      [ 1358.826313] [16412]    51 16412    16918     1924      32       3        0             0 smtp
      [ 1358.826314] [16422]     0 16422     1062      198       8       3        0             0 dd
      [ 1358.826316] [16425]    51 16425    12447     1130      23       3        0             0 bounce
      [ 1358.826322] Kernel panic - not syncing: Out of memory: system-wide panic_on_oom is enabled
      

      We have seen several OOM kernel crashes for SLES in past testing such as LU-10319 and LU-9601.

      Attachments

        Issue Links

          Activity

            People

              wc-triage WC Triage
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated: