Details
-
Bug
-
Resolution: Unresolved
-
Minor
-
None
-
Lustre 2.10.5, Lustre 2.12.8
-
None
-
3
-
9223372036854775807
Description
recovery-mds-scale test_failover_mds crashes. The following are the last lines seen in the test_log
==== Checking the clients loads AFTER failover -- failure NOT OK 14:03:26 (1533416606) waiting for trevis-13vm3 network 5 secs ... 14:03:26 (1533416606) network interface is UP CMD: trevis-13vm3 rc=0; val=\$(/usr/sbin/lctl get_param -n catastrophe 2>&1); if [[ \$? -eq 0 && \$val -ne 0 ]]; then echo \$(hostname -s): \$val; rc=\$val; fi; exit \$rc CMD: trevis-13vm3 ps auxwww | grep -v grep | grep -q run_dd.sh 14:03:26 (1533416606) waiting for trevis-13vm4 network 5 secs ... 14:03:26 (1533416606) network interface is UP CMD: trevis-13vm4 rc=0; val=\$(/usr/sbin/lctl get_param -n catastrophe 2>&1); if [[ \$? -eq 0 && \$val -ne 0 ]]; then echo \$(hostname -s): \$val; rc=\$val; fi; exit \$rc CMD: trevis-13vm4 ps auxwww | grep -v grep | grep -q run_tar.sh mds1 has failed over 2 times, and counting... sleeping 1100 seconds...
For the crash at https://testing.whamcloud.com/test_sets/9ce4d6de-9896-11e8-a9f7-52540065bddc, we see the following stack trace from the client(vm3) console log
[ 1852.682252] Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh [ 1852.962495] Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds1 has failed over 2 times, and counting... [ 1853.137368] Lustre: lustre-MDT0000-mdc-ffff88007a72e000: Connection restored to 10.9.4.152@tcp (at 10.9.4.152@tcp) [ 1853.259454] Lustre: DEBUG MARKER: mds1 has failed over 2 times, and counting... [ 1892.312595] socknal_sd00_01: page allocation failure: order:0, mode:0x1284020(GFP_ATOMIC|__GFP_COMP|__GFP_NOTRACK) [ 1892.312602] CPU: 0 PID: 13369 Comm: socknal_sd00_01 Tainted: G OE N 4.4.132-94.33-default #1 [ 1892.312603] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [ 1892.312607] 0000000000000000 ffffffff813284a0 0000000000000000 ffff88007b713750 [ 1892.312608] ffffffff8119b1e2 0128402000000004 0000000400000000 ffff88007fc1d828 [ 1892.312609] ffff88007ffcf490 ffff88007fc1d828 0000000400000000 ffff88007b7137d8 [ 1892.312610] Call Trace: [ 1892.312669] [<ffffffff81019b59>] dump_trace+0x59/0x340 [ 1892.312673] [<ffffffff81019f2a>] show_stack_log_lvl+0xea/0x170 [ 1892.312675] [<ffffffff8101ad01>] show_stack+0x21/0x40 [ 1892.312686] [<ffffffff813284a0>] dump_stack+0x5c/0x7c [ 1892.312707] [<ffffffff8119b1e2>] warn_alloc_failed+0xe2/0x150 [ 1892.312718] [<ffffffff8119b659>] __alloc_pages_nodemask+0x409/0xb80 [ 1892.312730] [<ffffffff811ea06d>] kmem_getpages+0x4d/0xf0 [ 1892.312737] [<ffffffff811eb8d5>] fallback_alloc+0x205/0x260 [ 1892.312741] [<ffffffff811ec3f6>] kmem_cache_alloc_trace+0x1f6/0x460 [ 1892.312748] [<ffffffff812384fb>] wb_start_writeback+0x3b/0xe0 [ 1892.312756] [<ffffffff81238a96>] wakeup_flusher_threads+0xc6/0x150 [ 1892.312759] [<ffffffff811a9571>] do_try_to_free_pages+0x241/0x450 [ 1892.312765] [<ffffffff811a983a>] try_to_free_pages+0xba/0x170 [ 1892.312767] [<ffffffff8119b843>] __alloc_pages_nodemask+0x5f3/0xb80 [ 1892.312770] [<ffffffff811ea06d>] kmem_getpages+0x4d/0xf0 [ 1892.312772] [<ffffffff811eb869>] fallback_alloc+0x199/0x260 [ 1892.312775] [<ffffffff811ec6e0>] kmem_cache_alloc_node_trace+0x80/0x490 [ 1892.312792] [<ffffffff815075ce>] __kmalloc_reserve.isra.34+0x2e/0x80 [ 1892.312806] [<ffffffff81508b13>] __alloc_skb+0x73/0x270 [ 1892.312815] [<ffffffff81567194>] sk_stream_alloc_skb+0x44/0x170 [ 1892.312824] [<ffffffff815676b8>] tcp_sendpage+0x3f8/0x610 [ 1892.312863] [<ffffffffa0be7df9>] ksocknal_lib_send_kiov+0x99/0x240 [ksocklnd] [ 1892.312885] [<ffffffffa0be1da7>] ksocknal_process_transmit+0x2b7/0xb60 [ksocklnd] [ 1892.312891] [<ffffffffa0be64b1>] ksocknal_scheduler+0x231/0x660 [ksocklnd] [ 1892.312901] [<ffffffff8109ebc9>] kthread+0xc9/0xe0 [ 1892.312919] [<ffffffff81617805>] ret_from_fork+0x55/0x80 [ 1892.315523] DWARF2 unwinder stuck at ret_from_fork+0x55/0x80 [ 1892.315524] [ 1892.315527] Leftover inexact backtrace: [ 1892.315527] [ 1892.315534] [<ffffffff8109eb00>] ? kthread_park+0x50/0x50 [ 1892.315535] Mem-Info: [ 1892.315539] active_anon:1643 inactive_anon:1687 isolated_anon:0 [ 1892.315539] active_file:127143 inactive_file:312793 isolated_file:0 [ 1892.315539] unevictable:20 dirty:0 writeback:31 unstable:0 [ 1892.315539] slab_reclaimable:2757 slab_unreclaimable:21743 [ 1892.315539] mapped:7385 shmem:2179 pagetables:887 bounce:0 [ 1892.315539] free:7 free_pcp:0 free_cma:0 [ 1892.315551] Node 0 DMA free:40kB min:376kB low:468kB high:560kB active_anon:28kB inactive_anon:96kB active_file:1408kB inactive_file:5712kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15992kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:456kB shmem:104kB slab_reclaimable:28kB slab_unreclaimable:7892kB kernel_stack:0kB pagetables:28kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:1191008 all_unreclaimable? yes [ 1892.315553] lowmem_reserve[]: 0 1836 1836 1836 1836 [ 1892.315560] Node 0 DMA32 free:0kB min:44676kB low:55844kB high:67012kB active_anon:6544kB inactive_anon:6652kB active_file:507164kB inactive_file:1245460kB unevictable:80kB isolated(anon):0kB isolated(file):0kB present:2080744kB managed:1900784kB mlocked:80kB dirty:0kB writeback:124kB mapped:29084kB shmem:8612kB slab_reclaimable:11000kB slab_unreclaimable:79080kB kernel_stack:2672kB pagetables:3520kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:16195004 all_unreclaimable? yes [ 1892.315562] lowmem_reserve[]: 0 0 0 0 0 [ 1892.315567] Node 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB [ 1892.315571] Node 0 DMA32: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB [ 1892.315582] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB [ 1892.315591] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB [ 1892.315591] 14900 total pagecache pages [ 1892.315593] 60 pages in swap cache [ 1892.315595] Swap cache stats: add 5726, delete 5666, find 66/70 [ 1892.315596] Free swap = 14315232kB [ 1892.315596] Total swap = 14338044kB [ 1892.315597] 524184 pages RAM [ 1892.315597] 0 pages HighMem/MovableOnly [ 1892.315598] 45012 pages reserved [ 1892.315598] 0 pages hwpoisoned
There are many recent recovery-mds-scale crashes, but the following have a similar stack trace as described above
https://testing.whamcloud.com/test_sets/7c353280-9882-11e8-b0aa-52540065bddc
https://testing.whamcloud.com/test_sets/739182d0-9645-11e8-a9f7-52540065bddc