Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11228

recovery-mds-scale test failover_mds crashes with ‘socknal_sd00_01: page allocation failure: order:0, mode:0x1284020(GFP_ATOMIC|__GFP_COMP|__GFP_NOTRACK)’

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.10.5, Lustre 2.12.8
    • None
    • 3
    • 9223372036854775807

    Description

      recovery-mds-scale test_failover_mds crashes. The following are the last lines seen in the test_log

      ==== Checking the clients loads AFTER failover -- failure NOT OK
      14:03:26 (1533416606) waiting for trevis-13vm3 network 5 secs ...
      14:03:26 (1533416606) network interface is UP
      CMD: trevis-13vm3 rc=0;
      			val=\$(/usr/sbin/lctl get_param -n catastrophe 2>&1);
      			if [[ \$? -eq 0 && \$val -ne 0 ]]; then
      				echo \$(hostname -s): \$val;
      				rc=\$val;
      			fi;
      			exit \$rc
      CMD: trevis-13vm3 ps auxwww | grep -v grep | grep -q run_dd.sh
      14:03:26 (1533416606) waiting for trevis-13vm4 network 5 secs ...
      14:03:26 (1533416606) network interface is UP
      CMD: trevis-13vm4 rc=0;
      			val=\$(/usr/sbin/lctl get_param -n catastrophe 2>&1);
      			if [[ \$? -eq 0 && \$val -ne 0 ]]; then
      				echo \$(hostname -s): \$val;
      				rc=\$val;
      			fi;
      			exit \$rc
      CMD: trevis-13vm4 ps auxwww | grep -v grep | grep -q run_tar.sh
      mds1 has failed over 2 times, and counting...
      sleeping 1100 seconds... 
      

      For the crash at https://testing.whamcloud.com/test_sets/9ce4d6de-9896-11e8-a9f7-52540065bddc, we see the following stack trace from the client(vm3) console log

      [ 1852.682252] Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh
      [ 1852.962495] Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds1 has failed over 2 times, and counting...
      [ 1853.137368] Lustre: lustre-MDT0000-mdc-ffff88007a72e000: Connection restored to 10.9.4.152@tcp (at 10.9.4.152@tcp)
      [ 1853.259454] Lustre: DEBUG MARKER: mds1 has failed over 2 times, and counting...
      [ 1892.312595] socknal_sd00_01: page allocation failure: order:0, mode:0x1284020(GFP_ATOMIC|__GFP_COMP|__GFP_NOTRACK)
      [ 1892.312602] CPU: 0 PID: 13369 Comm: socknal_sd00_01 Tainted: G           OE   N  4.4.132-94.33-default #1
      [ 1892.312603] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
      [ 1892.312607]  0000000000000000 ffffffff813284a0 0000000000000000 ffff88007b713750
      [ 1892.312608]  ffffffff8119b1e2 0128402000000004 0000000400000000 ffff88007fc1d828
      [ 1892.312609]  ffff88007ffcf490 ffff88007fc1d828 0000000400000000 ffff88007b7137d8
      [ 1892.312610] Call Trace:
      [ 1892.312669]  [<ffffffff81019b59>] dump_trace+0x59/0x340
      [ 1892.312673]  [<ffffffff81019f2a>] show_stack_log_lvl+0xea/0x170
      [ 1892.312675]  [<ffffffff8101ad01>] show_stack+0x21/0x40
      [ 1892.312686]  [<ffffffff813284a0>] dump_stack+0x5c/0x7c
      [ 1892.312707]  [<ffffffff8119b1e2>] warn_alloc_failed+0xe2/0x150
      [ 1892.312718]  [<ffffffff8119b659>] __alloc_pages_nodemask+0x409/0xb80
      [ 1892.312730]  [<ffffffff811ea06d>] kmem_getpages+0x4d/0xf0
      [ 1892.312737]  [<ffffffff811eb8d5>] fallback_alloc+0x205/0x260
      [ 1892.312741]  [<ffffffff811ec3f6>] kmem_cache_alloc_trace+0x1f6/0x460
      [ 1892.312748]  [<ffffffff812384fb>] wb_start_writeback+0x3b/0xe0
      [ 1892.312756]  [<ffffffff81238a96>] wakeup_flusher_threads+0xc6/0x150
      [ 1892.312759]  [<ffffffff811a9571>] do_try_to_free_pages+0x241/0x450
      [ 1892.312765]  [<ffffffff811a983a>] try_to_free_pages+0xba/0x170
      [ 1892.312767]  [<ffffffff8119b843>] __alloc_pages_nodemask+0x5f3/0xb80
      [ 1892.312770]  [<ffffffff811ea06d>] kmem_getpages+0x4d/0xf0
      [ 1892.312772]  [<ffffffff811eb869>] fallback_alloc+0x199/0x260
      [ 1892.312775]  [<ffffffff811ec6e0>] kmem_cache_alloc_node_trace+0x80/0x490
      [ 1892.312792]  [<ffffffff815075ce>] __kmalloc_reserve.isra.34+0x2e/0x80
      [ 1892.312806]  [<ffffffff81508b13>] __alloc_skb+0x73/0x270
      [ 1892.312815]  [<ffffffff81567194>] sk_stream_alloc_skb+0x44/0x170
      [ 1892.312824]  [<ffffffff815676b8>] tcp_sendpage+0x3f8/0x610
      [ 1892.312863]  [<ffffffffa0be7df9>] ksocknal_lib_send_kiov+0x99/0x240 [ksocklnd]
      [ 1892.312885]  [<ffffffffa0be1da7>] ksocknal_process_transmit+0x2b7/0xb60 [ksocklnd]
      [ 1892.312891]  [<ffffffffa0be64b1>] ksocknal_scheduler+0x231/0x660 [ksocklnd]
      [ 1892.312901]  [<ffffffff8109ebc9>] kthread+0xc9/0xe0
      [ 1892.312919]  [<ffffffff81617805>] ret_from_fork+0x55/0x80
      [ 1892.315523] DWARF2 unwinder stuck at ret_from_fork+0x55/0x80
      [ 1892.315524] 
      [ 1892.315527] Leftover inexact backtrace:
      [ 1892.315527] 
      [ 1892.315534]  [<ffffffff8109eb00>] ? kthread_park+0x50/0x50
      [ 1892.315535] Mem-Info:
      [ 1892.315539] active_anon:1643 inactive_anon:1687 isolated_anon:0
      [ 1892.315539]  active_file:127143 inactive_file:312793 isolated_file:0
      [ 1892.315539]  unevictable:20 dirty:0 writeback:31 unstable:0
      [ 1892.315539]  slab_reclaimable:2757 slab_unreclaimable:21743
      [ 1892.315539]  mapped:7385 shmem:2179 pagetables:887 bounce:0
      [ 1892.315539]  free:7 free_pcp:0 free_cma:0
      [ 1892.315551] Node 0 DMA free:40kB min:376kB low:468kB high:560kB active_anon:28kB inactive_anon:96kB active_file:1408kB inactive_file:5712kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15992kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:456kB shmem:104kB slab_reclaimable:28kB slab_unreclaimable:7892kB kernel_stack:0kB pagetables:28kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:1191008 all_unreclaimable? yes
      [ 1892.315553] lowmem_reserve[]: 0 1836 1836 1836 1836
      [ 1892.315560] Node 0 DMA32 free:0kB min:44676kB low:55844kB high:67012kB active_anon:6544kB inactive_anon:6652kB active_file:507164kB inactive_file:1245460kB unevictable:80kB isolated(anon):0kB isolated(file):0kB present:2080744kB managed:1900784kB mlocked:80kB dirty:0kB writeback:124kB mapped:29084kB shmem:8612kB slab_reclaimable:11000kB slab_unreclaimable:79080kB kernel_stack:2672kB pagetables:3520kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:16195004 all_unreclaimable? yes
      [ 1892.315562] lowmem_reserve[]: 0 0 0 0 0
      [ 1892.315567] Node 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
      [ 1892.315571] Node 0 DMA32: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
      [ 1892.315582] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
      [ 1892.315591] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
      [ 1892.315591] 14900 total pagecache pages
      [ 1892.315593] 60 pages in swap cache
      [ 1892.315595] Swap cache stats: add 5726, delete 5666, find 66/70
      [ 1892.315596] Free swap  = 14315232kB
      [ 1892.315596] Total swap = 14338044kB
      [ 1892.315597] 524184 pages RAM
      [ 1892.315597] 0 pages HighMem/MovableOnly
      [ 1892.315598] 45012 pages reserved
      [ 1892.315598] 0 pages hwpoisoned
      

      There are many recent recovery-mds-scale crashes, but the following have a similar stack trace as described above
      https://testing.whamcloud.com/test_sets/7c353280-9882-11e8-b0aa-52540065bddc
      https://testing.whamcloud.com/test_sets/739182d0-9645-11e8-a9f7-52540065bddc

      Attachments

        Issue Links

          Activity

            People

              bobijam Zhenyu Xu
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated: