Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11410

recovery-mds-scale test failover_mds crashes with ‘ntpd invoked oom-killer: gfp_mask=0x14200ca(GFP_HIGHUSER_MOVABLE)’

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.12.0, Lustre 2.10.6, Lustre 2.12.1, Lustre 2.12.3, Lustre 2.12.4, Lustre 2.12.9
    • SLES12 SP3 clients
    • 3
    • 9223372036854775807

    Description

      A client crashes in recovery-mds-scale test_failover_mds. Looking at https://testing.whamcloud.com/test_sets/3e369a98-b8bf-11e8-a7de-52540065bddc, in the kernel crash log, we see

      [  766.998879] Lustre: DEBUG MARKER: mds1 has failed over 1 times, and counting...
      [  782.297602] Lustre: Evicted from MGS (at MGC10.9.6.25@tcp_1) after server handle changed from 0x66e2519c6be9cc2 to 0x89680f107ea4b814
      [  782.299262] Lustre: MGC10.9.6.25@tcp: Connection restored to MGC10.9.6.25@tcp_1 (at 10.9.6.26@tcp)
      [  782.362888] LustreError: 13367:0:(client.c:3000:ptlrpc_replay_interpret()) @@@ status 301, old was 0  req@ffff88006690c940 x1611617063142688/t4294967305(4294967305) o101->lustre-MDT0000-mdc-ffff88007bb5e800@10.9.6.26@tcp:12/10 lens 952/560 e 0 to 0 dl 1536957913 ref 2 fl Interpret:RP/4/0 rc 301/301
      [  845.630602] ntpd invoked oom-killer: gfp_mask=0x14200ca(GFP_HIGHUSER_MOVABLE), nodemask=0, order=0, oom_score_adj=0
      [  845.630613] ntpd cpuset=/ mems_allowed=0
      [  845.630628] CPU: 1 PID: 1461 Comm: ntpd Tainted: G           OE   N  4.4.143-94.47-default #1
      [  845.630629] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
      [  845.630634]  0000000000000000 ffffffff8132ad80 ffff88007be939a0 0000000000000000
      [  845.630637]  ffffffff8120935e 0000000000000000 0000000000000000 0000000000000000
      [  845.630639]  0000000000000000 ffffffff810a0927 ffffffff81e9aa20 0000000000000000
      [  845.630639] Call Trace:
      [  845.630702]  [<ffffffff81019ac9>] dump_trace+0x59/0x340
      [  845.630711]  [<ffffffff81019e9a>] show_stack_log_lvl+0xea/0x170
      [  845.630714]  [<ffffffff8101ac71>] show_stack+0x21/0x40
      [  845.630727]  [<ffffffff8132ad80>] dump_stack+0x5c/0x7c
      [  845.630748]  [<ffffffff8120935e>] dump_header+0x82/0x215
      [  845.630762]  [<ffffffff81198079>] check_panic_on_oom+0x29/0x50
      [  845.630770]  [<ffffffff8119821a>] out_of_memory+0x17a/0x4a0
      [  845.630777]  [<ffffffff8119cc48>] __alloc_pages_nodemask+0xaf8/0xb70
      [  845.630786]  [<ffffffff811e6cc4>] alloc_pages_vma+0xa4/0x220
      [  845.630799]  [<ffffffff811d70f0>] __read_swap_cache_async+0xf0/0x150
      [  845.630805]  [<ffffffff811d7164>] read_swap_cache_async+0x14/0x30
      [  845.630808]  [<ffffffff811d727d>] swapin_readahead+0xfd/0x190
      [  845.630814]  [<ffffffff811c3771>] handle_pte_fault+0x12b1/0x1670
      [  845.630820]  [<ffffffff811c56aa>] handle_mm_fault+0x2fa/0x640
      [  845.630828]  [<ffffffff81067d7a>] __do_page_fault+0x23a/0x4b0
      [  845.630838]  [<ffffffff8106809c>] trace_do_page_fault+0x3c/0x120
      [  845.630850]  [<ffffffff8161da62>] async_page_fault+0x32/0x60
      [  845.633602] DWARF2 unwinder stuck at async_page_fault+0x32/0x60
      [  845.633602] 
      [  845.633603] Leftover inexact backtrace:
                     
      [  845.633621]  [<ffffffff81338d61>] ? __clear_user+0x21/0x50
      [  845.633624]  [<ffffffff810230f2>] ? copy_fpstate_to_sigframe+0x112/0x1a0
      [  845.633625]  [<ffffffff810176d1>] ? do_signal+0x511/0x5b0
      [  845.633627]  [<ffffffff81067d9a>] ? __do_page_fault+0x25a/0x4b0
      [  845.633634]  [<ffffffff8107bf4e>] ? exit_to_usermode_loop+0x70/0xc2
      [  845.633638]  [<ffffffff81003ae5>] ? syscall_return_slowpath+0x85/0xa0
      [  845.633644]  [<ffffffff8161aa3a>] ? int_ret_from_sys_call+0x25/0xa3
      [  845.633661] Mem-Info:
      [  845.633668] active_anon:25 inactive_anon:41 isolated_anon:0
                      active_file:69559 inactive_file:371186 isolated_file:0
                      unevictable:20 dirty:67 writeback:850 unstable:0
                      slab_reclaimable:2733 slab_unreclaimable:8762
                      mapped:7090 shmem:26 pagetables:966 bounce:0
                      free:13103 free_pcp:0 free_cma:0
      [  845.633676] Node 0 DMA free:7736kB min:376kB low:468kB high:560kB active_anon:100kB inactive_anon:104kB active_file:744kB inactive_file:6464kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15992kB managed:15904kB mlocked:0kB dirty:0kB writeback:96kB mapped:256kB shmem:104kB slab_reclaimable:20kB slab_unreclaimable:284kB kernel_stack:32kB pagetables:12kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:46916 all_unreclaimable? yes
      [  845.633678] lowmem_reserve[]: 0 1843 1843 1843 1843
      [  845.633683] Node 0 DMA32 free:44676kB min:44676kB low:55844kB high:67012kB active_anon:0kB inactive_anon:60kB active_file:277492kB inactive_file:1478280kB unevictable:80kB isolated(anon):0kB isolated(file):0kB present:2080744kB managed:1900772kB mlocked:80kB dirty:268kB writeback:3304kB mapped:28104kB shmem:0kB slab_reclaimable:10912kB slab_unreclaimable:34764kB kernel_stack:2608kB pagetables:3852kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:11784060 all_unreclaimable? yes
      [  845.633686] lowmem_reserve[]: 0 0 0 0 0
      [  845.633694] Node 0 DMA: 8*4kB (UME) 5*8kB (ME) 3*16kB (UE) 2*32kB (UE) 2*64kB (U) 2*128kB (UE) 2*256kB (ME) 3*512kB (UME) 1*1024kB (E) 2*2048kB (ME) 0*4096kB = 7736kB
      [  845.633700] Node 0 DMA32: 916*4kB (UME) 559*8kB (UME) 471*16kB (UME) 268*32kB (UME) 162*64kB (UE) 65*128kB (UM) 7*256kB (UM) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 44728kB
      [  845.633713] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
      [  845.633720] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
      [  845.633721] 13843 total pagecache pages
      [  845.633722] 0 pages in swap cache
      [  845.633722] Swap cache stats: add 10178, delete 10178, find 72/95
      [  845.633723] Free swap  = 14297524kB
      [  845.633725] Total swap = 14338044kB
      [  845.633726] 524184 pages RAM
      [  845.633726] 0 pages HighMem/MovableOnly
      [  845.633726] 45015 pages reserved
      [  845.633727] 0 pages hwpoisoned
      [  845.633727] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
      [  845.633824] [  361]     0   361    10933      686      24       3     1623             0 systemd-journal
      [  845.633827] [  400]   495   400    13124      940      29       3      116             0 rpcbind
      [  845.633836] [  404]     0   404     9268      710      21       3      218         -1000 systemd-udevd
      [  845.633839] [  479]     0   479     4814      612      14       3       58             0 irqbalance
      [  845.633847] [  484]   499   484    13452      876      26       3      146          -900 dbus-daemon
      [  845.633849] [  528]     0   528    25126     1025      50       3      207             0 sssd
      [  845.633851] [  533]     0   533    32270     1881      64       3      288             0 sssd_be
      [  845.633857] [  538]     0   538     7447     1060      19       3      260             0 wickedd-dhcp6
      [  845.633860] [  548]     0   548    25845     1531      54       4      180             0 sssd_nss
      [  845.633862] [  549]     0   549    20713     1198      45       3      175             0 sssd_pam
      [  845.633863] [  550]     0   550    19112     1161      43       3      166             0 sssd_ssh
      [  845.633870] [  552]     0   552     7448     1028      18       3      256             0 wickedd-auto4
      [  845.633876] [  553]     0   553     7448     1085      20       3      265             0 wickedd-dhcp4
      [  845.633880] [  556]     0   556    84318      905      38       3      269             0 rsyslogd
      [  845.633927] [  761]     0   761     7480     1056      20       3      287             0 wickedd
      [  845.633934] [  764]     0   764     7455     1032      18       3      276             0 wickedd-nanny
      [  845.633939] [ 1418]     0  1418     2141      422      10       3       40             0 xinetd
      [  845.633944] [ 1461]    74  1461     8408      974      17       3      164             0 ntpd
      [  845.633953] [ 1473]    74  1473     9461      590      18       3      153             0 ntpd
      [  845.633957] [ 1477]     0  1477    16586     1539      35       3      180         -1000 sshd
      [  845.633964] [ 1492]   493  1492    55352      616      20       3      231             0 munged
      [  845.633972] [ 1517]     0  1517     1664      438       8       3       30             0 agetty
      [  845.633977] [ 1518]     0  1518     1664      419       9       3       29             0 agetty
      [  845.633981] [ 1534]     0  1534   147220     1574      59       3      347             0 automount
      [  845.633986] [ 1570]     0  1570     5513      629      16       3       64             0 systemd-logind
      [  845.633988] [ 1809]     0  1809     8861      822      20       3      109             0 master
      [  845.633990] [ 1820]    51  1820    12439     1046      25       3      108             0 pickup
      [  845.633992] [ 1823]    51  1823    12536     1354      26       3      174             0 qmgr
      [  845.633994] [ 1864]     0  1864     5197      532      18       3      150             0 cron
      [  845.634043] [15714]     0 15714    17465      669      35       3      174             0 in.mrshd
      [  845.634047] [15715]     0 15715     2894      572      10       3       77             0 bash
      [  845.634051] [15720]     0 15720     2894      427      10       3       78             0 bash
      [  845.634053] [15721]     0 15721     3034      585      12       3      219             0 run_dd.sh
      [  845.634057] [16387]    51 16387    12675     1323      25       3      335             0 trivial-rewrite
      [  845.634059] [16388]    51 16388    16918     1732      35       3      244             0 smtp
      [  845.634064] [16434]     0 16434     1062      182       8       3       26             0 dd
      [  845.634068] [16437]    51 16437    12447     1022      24       3      109             0 bounce
      [  845.634074] Kernel panic - not syncing: Out of memory: system-wide panic_on_oom is enabled
                     
      [  845.634075] CPU: 1 PID: 1461 Comm: ntpd Tainted: G           OE   N  4.4.143-94.47-default #1
      [  845.634076] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
      [  845.634078]  0000000000000000 ffffffff8132ad80 ffffffff81a28298 ffff88007be938c8
      [  845.634079]  ffffffff81191f31 0000000000000010 ffff88007be938d8 ffff88007be93878
      [  845.634081]  000000000000309f ffffffff81a2c56b 0000000000000000 0000000000000000
      [  845.634081] Call Trace:
      [  845.634087]  [<ffffffff81019ac9>] dump_trace+0x59/0x340
      [  845.634090]  [<ffffffff81019e9a>] show_stack_log_lvl+0xea/0x170
      [  845.634092]  [<ffffffff8101ac71>] show_stack+0x21/0x40
      [  845.634095]  [<ffffffff8132ad80>] dump_stack+0x5c/0x7c
      [  845.634101]  [<ffffffff81191f31>] panic+0xd2/0x232
      [  845.634104]  [<ffffffff811980a0>] check_panic_on_oom+0x50/0x50
      [  845.634106]  [<ffffffff8119821a>] out_of_memory+0x17a/0x4a0
      [  845.634112]  [<ffffffff8119cc48>] __alloc_pages_nodemask+0xaf8/0xb70
      [  845.634116]  [<ffffffff811e6cc4>] alloc_pages_vma+0xa4/0x220
      [  845.634119]  [<ffffffff811d70f0>] __read_swap_cache_async+0xf0/0x150
      [  845.634123]  [<ffffffff811d7164>] read_swap_cache_async+0x14/0x30
      [  845.634125]  [<ffffffff811d727d>] swapin_readahead+0xfd/0x190
      [  845.634128]  [<ffffffff811c3771>] handle_pte_fault+0x12b1/0x1670
      [  845.634132]  [<ffffffff811c56aa>] handle_mm_fault+0x2fa/0x640
      [  845.634135]  [<ffffffff81067d7a>] __do_page_fault+0x23a/0x4b0
      [  845.634139]  [<ffffffff8106809c>] trace_do_page_fault+0x3c/0x120
      [  845.634141]  [<ffffffff8161da62>] async_page_fault+0x32/0x60
      [  845.636233] DWARF2 unwinder stuck at async_page_fault+0x32/0x60
      [  845.636233] 
      [  845.636234] Leftover inexact backtrace:
                     
      [  845.636236]  [<ffffffff81338d61>] ? __clear_user+0x21/0x50
      [  845.636238]  [<ffffffff810230f2>] ? copy_fpstate_to_sigframe+0x112/0x1a0
      [  845.636239]  [<ffffffff810176d1>] ? do_signal+0x511/0x5b0
      [  845.636241]  [<ffffffff81067d9a>] ? __do_page_fault+0x25a/0x4b0
      [  845.636243]  [<ffffffff8107bf4e>] ? exit_to_usermode_loop+0x70/0xc2
      [  845.636246]  [<ffffffff81003ae5>] ? syscall_return_slowpath+0x85/0xa0
      [  845.636248]  [<ffffffff8161aa3a>] ? int_ret_from_sys_call+0x25/0xa3
      

      In the client (vm3) console log, we see where the vmcore is located

      [  782.297602] Lustre: Evicted from MGS (at MGC10.9.6.25@tcp_1) after server handle changed from 0x66e2519c6be9cc2 to 0x89680f107ea4b814
      [  782.299262] Lustre: MGC10.9.6.25@tcp: Connection restored to MGC10.9.6.25@tcp_1 (at 10.9.6.26@tcp)
      [  782.362888] LustreError: 13367:0:(client.c:3000:ptlrpc_replay_interpret()) @@@ status 301, old was 0  req@ffff88006690c940 x1611617063142688/t4294967305(4294967305) o101->lustre-MDT0000-mdc-ffff88007bb5e800@10.9.6.26@tcp:12/10 lens 952/560 e 0 to 0 dl 1536957913 ref 2 fl Interpret:RP/4/0 rc 301/301
      [  845.630602] ntpd invoked oom-killer: gfp_mask=0x14200ca(GFP_HIGHUSER_MOVABLE), nodemask=0, order=0, oom_score_adj=0
      [  845.630613] ntpd cpuset=/ mems_allowed=0
      [  845.630628] CPU: 1 PID: 1461 Comm: ntpd Tainted: G           OE   N  4.4.143-94.47-default #1
      [ [    2.826867] RPC: Registered named UNIX socket transport module.
      [    2.826869] RPC: Registered udp transport module.
      …
      The dumpfile is saved to /mnt/trevis-2.trevis.whamcloud.com/export/scratch/dumps/trevis-45vm3.trevis.whamcloud.com/10.9.6.21-2018-09-14-13:48/vmcore.
      
      makedumpfile Completed.
      -------------------------------------------------------------------------------
      

      All failures are seen on SLES12 SP3 server/client and SLES12 SP3 client/CentOS 7 server testing.

      We’ve seen this crash a few times in the past
      https://testing.whamcloud.com/test_sets/9a10e716-b882-11e8-b86b-52540065bddc
      https://testing.whamcloud.com/test_sets/74705770-b9f5-11e8-8c12-52540065bddc
      https://testing.whamcloud.com/test_sets/6f5b3134-bc38-11e8-a7de-52540065bddc

      Attachments

        Issue Links

          Activity

            People

              wc-triage WC Triage
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated: