Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11794

recovery-mds-scale test failover_mds crashes with '<hostname> crashed during recovery-mds-scale test_failover_mds'

    XMLWordPrintable

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Minor
    • None
    • Lustre 2.12.0
    • 3
    • 9223372036854775807

    Description

      recovery-mds-scale test_failover_mds crashes with 'trevis-16vm3 crashed during recovery-mds-scale test_failover_mds'

      Looking at the kernel crash from https://testing.whamcloud.com/test_sets/4db0c228-fd1b-11e8-8512-52540065bddc , we see

      [  733.722957] Lustre: DEBUG MARKER: mds1 has failed over 1 times, and counting...
      [  746.732703] Lustre: Evicted from MGS (at MGC10.9.4.191@tcp_1) after server handle changed from 0xd73c2b185ff898b4 to 0x7b37c6365056206
      [  746.734484] Lustre: MGC10.9.4.191@tcp: Connection restored to MGC10.9.4.191@tcp_1 (at 10.9.4.192@tcp)
      [  746.736017] LustreError: 13384:0:(client.c:3023:ptlrpc_replay_interpret()) @@@ status 301, old was 0  req@ffff88006542b980 x1619485946545808/t4294967305(4294967305) o101->lustre-MDT0000-mdc-ffff88007ae8a000@10.9.4.192@tcp:12/10 lens 712/560 e 0 to 0 dl 1544462286 ref 2 fl Interpret:RP/4/0 rc 301/301
      [  747.973965] Lustre: lustre-MDT0000-mdc-ffff88007ae8a000: Connection restored to 10.9.4.192@tcp (at 10.9.4.192@tcp)
      [  794.635505] Lustre: 13385:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1544462204/real 1544462204]  req@ffff88003aa1acc0 x1619485946562912/t0(0) o400->lustre-MDT0000-mdc-ffff88007ae8a000@10.9.4.192@tcp:12/10 lens 224/224 e 0 to 1 dl 1544462212 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      [  794.635528] Lustre: 13385:0:(client.c:2132:ptlrpc_expire_one_request()) Skipped 1 previous similar message
      [ 1438.045823] kworker/0:2H: page allocation failure: order:0, mode:0x1284020(GFP_ATOMIC|__GFP_COMP|__GFP_NOTRACK)
      [ 1438.045839] CPU: 0 PID: 1516 Comm: kworker/0:2H Tainted: G           OE   N  4.4.162-94.69-default #1
      [ 1438.045840] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
      [ 1438.045875] Workqueue: kblockd blk_mq_run_work_fn
      [ 1438.045878]  0000000000000000 ffffffff8132cdc0 0000000000000000 ffff88007b6479a8
      [ 1438.045879]  ffffffff8119ddc2 0128402000000030 0000000000000046 002c422000000000
      [ 1438.045881]  00000001005d4200 0000000000000000 0000000000000000 0000000000000020
      [ 1438.045881] Call Trace:
      [ 1438.045928]  [<ffffffff81019b09>] dump_trace+0x59/0x340
      [ 1438.045934]  [<ffffffff81019eda>] show_stack_log_lvl+0xea/0x170
      [ 1438.045936]  [<ffffffff8101acb1>] show_stack+0x21/0x40
      [ 1438.045944]  [<ffffffff8132cdc0>] dump_stack+0x5c/0x7c
      [ 1438.045964]  [<ffffffff8119ddc2>] warn_alloc_failed+0xe2/0x150
      [ 1438.045975]  [<ffffffff8119e23b>] __alloc_pages_nodemask+0x40b/0xb70
      [ 1438.045988]  [<ffffffff811edbed>] kmem_getpages+0x4d/0xf0
      [ 1438.045995]  [<ffffffff811ef3fb>] fallback_alloc+0x19b/0x240
      [ 1438.045998]  [<ffffffff811f13fa>] __kmalloc+0x26a/0x4b0
      [ 1438.046022]  [<ffffffffa0184708>] alloc_indirect.isra.4+0x18/0x50 [virtio_ring]
      [ 1438.046031]  [<ffffffffa0184a02>] virtqueue_add_sgs+0x2c2/0x410 [virtio_ring]
      [ 1438.046044]  [<ffffffffa0053417>] __virtblk_add_req+0xa7/0x170 [virtio_blk]
      [ 1438.046048]  [<ffffffffa00535fa>] virtio_queue_rq+0x11a/0x270 [virtio_blk]
      [ 1438.046051]  [<ffffffff813073e5>] blk_mq_dispatch_rq_list+0xd5/0x1e0
      [ 1438.046057]  [<ffffffff8130761e>] blk_mq_process_rq_list+0x12e/0x150
      [ 1438.046065]  [<ffffffff8109ad14>] process_one_work+0x154/0x420
      [ 1438.046073]  [<ffffffff8109b906>] worker_thread+0x116/0x4a0
      [ 1438.046079]  [<ffffffff810a0e29>] kthread+0xc9/0xe0
      [ 1438.046099]  [<ffffffff8161e1f5>] ret_from_fork+0x55/0x80
      [ 1438.048865] DWARF2 unwinder stuck at ret_from_fork+0x55/0x80
      [ 1438.048865] 
      [ 1438.048868] Leftover inexact backtrace:
                     
      [ 1438.048877]  [<ffffffff810a0d60>] ? kthread_park+0x50/0x50
      [ 1438.048878] Mem-Info:
      [ 1438.048882] active_anon:961 inactive_anon:982 isolated_anon:64
                      active_file:179814 inactive_file:247899 isolated_file:32
                      unevictable:20 dirty:3366 writeback:60306 unstable:0
                      slab_reclaimable:3317 slab_unreclaimable:29811
                      mapped:7232 shmem:752 pagetables:890 bounce:0
                      free:2433 free_pcp:169 free_cma:0
      [ 1438.048896] Node 0 DMA free:364kB min:376kB low:468kB high:560kB active_anon:76kB inactive_anon:80kB active_file:1836kB inactive_file:5360kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15992kB managed:15904kB mlocked:0kB dirty:40kB writeback:920kB mapped:292kB shmem:80kB slab_reclaimable:68kB slab_unreclaimable:7712kB kernel_stack:16kB pagetables:16kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:45812 all_unreclaimable? yes
      [ 1438.048898] lowmem_reserve[]: 0 1843 1843 1843 1843
      [ 1438.048907] Node 0 DMA32 free:9368kB min:44676kB low:55844kB high:67012kB active_anon:3768kB inactive_anon:3848kB active_file:717420kB inactive_file:986236kB unevictable:80kB isolated(anon):256kB isolated(file):128kB present:2080744kB managed:1900752kB mlocked:80kB dirty:13424kB writeback:240304kB mapped:28636kB shmem:2928kB slab_reclaimable:13200kB slab_unreclaimable:111532kB kernel_stack:2608kB pagetables:3544kB unstable:0kB bounce:0kB free_pcp:676kB local_pcp:676kB free_cma:0kB writeback_tmp:0kB pages_scanned:103620 all_unreclaimable? no
      [ 1438.048908] lowmem_reserve[]: 0 0 0 0 0
      [ 1438.048915] Node 0 DMA: 2*4kB (M) 0*8kB 0*16kB 1*32kB (U) 1*64kB (U) 0*128kB 1*256kB (U) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 360kB
      [ 1438.048921] Node 0 DMA32: 140*4kB (UME) 69*8kB (UM) 42*16kB (UM) 77*32kB (UM) 29*64kB (UM) 2*128kB (M) 4*256kB (M) 4*512kB (M) 0*1024kB 0*2048kB 0*4096kB = 9432kB
      [ 1438.048932] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
      [ 1438.048939] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
      [ 1438.048940] 76857 total pagecache pages
      [ 1438.048941] 344 pages in swap cache
      [ 1438.048943] Swap cache stats: add 7967, delete 7623, find 157/240
      [ 1438.048944] Free swap  = 14306948kB
      [ 1438.048944] Total swap = 14338044kB
      [ 1438.048945] 524184 pages RAM
      [ 1438.048945] 0 pages HighMem/MovableOnly
      [ 1438.048945] 45020 pages reserved
      [ 1438.048946] 0 pages hwpoisoned
      

      There are no other recovery-mds-scale test_failover_mds crashes with the same stack trace in the past month.

      Attachments

        Issue Links

          Activity

            People

              wc-triage WC Triage
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: