Loading...

Type: Bug
Resolution: Unresolved
Priority: Minor
Fix Version/s: None
Affects Version/s: Lustre 2.10.5, Lustre 2.12.8
Labels:
None

Severity:
3
Rank (Obsolete):
9223372036854775807

recovery-mds-scale test_failover_mds crashes. The following are the last lines seen in the test_log

==== Checking the clients loads AFTER failover -- failure NOT OK
14:03:26 (1533416606) waiting for trevis-13vm3 network 5 secs ...
14:03:26 (1533416606) network interface is UP
CMD: trevis-13vm3 rc=0;
			val=\$(/usr/sbin/lctl get_param -n catastrophe 2>&1);
			if [[ \$? -eq 0 && \$val -ne 0 ]]; then
				echo \$(hostname -s): \$val;
				rc=\$val;
			fi;
			exit \$rc
CMD: trevis-13vm3 ps auxwww | grep -v grep | grep -q run_dd.sh
14:03:26 (1533416606) waiting for trevis-13vm4 network 5 secs ...
14:03:26 (1533416606) network interface is UP
CMD: trevis-13vm4 rc=0;
			val=\$(/usr/sbin/lctl get_param -n catastrophe 2>&1);
			if [[ \$? -eq 0 && \$val -ne 0 ]]; then
				echo \$(hostname -s): \$val;
				rc=\$val;
			fi;
			exit \$rc
CMD: trevis-13vm4 ps auxwww | grep -v grep | grep -q run_tar.sh
mds1 has failed over 2 times, and counting...
sleeping 1100 seconds...

For the crash at https://testing.whamcloud.com/test_sets/9ce4d6de-9896-11e8-a9f7-52540065bddc, we see the following stack trace from the client(vm3) console log

[ 1852.682252] Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh
[ 1852.962495] Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds1 has failed over 2 times, and counting...
[ 1853.137368] Lustre: lustre-MDT0000-mdc-ffff88007a72e000: Connection restored to 10.9.4.152@tcp (at 10.9.4.152@tcp)
[ 1853.259454] Lustre: DEBUG MARKER: mds1 has failed over 2 times, and counting...
[ 1892.312595] socknal_sd00_01: page allocation failure: order:0, mode:0x1284020(GFP_ATOMIC|__GFP_COMP|__GFP_NOTRACK)
[ 1892.312602] CPU: 0 PID: 13369 Comm: socknal_sd00_01 Tainted: G           OE   N  4.4.132-94.33-default #1
[ 1892.312603] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[ 1892.312607]  0000000000000000 ffffffff813284a0 0000000000000000 ffff88007b713750
[ 1892.312608]  ffffffff8119b1e2 0128402000000004 0000000400000000 ffff88007fc1d828
[ 1892.312609]  ffff88007ffcf490 ffff88007fc1d828 0000000400000000 ffff88007b7137d8
[ 1892.312610] Call Trace:
[ 1892.312669]  [<ffffffff81019b59>] dump_trace+0x59/0x340
[ 1892.312673]  [<ffffffff81019f2a>] show_stack_log_lvl+0xea/0x170
[ 1892.312675]  [<ffffffff8101ad01>] show_stack+0x21/0x40
[ 1892.312686]  [<ffffffff813284a0>] dump_stack+0x5c/0x7c
[ 1892.312707]  [<ffffffff8119b1e2>] warn_alloc_failed+0xe2/0x150
[ 1892.312718]  [<ffffffff8119b659>] __alloc_pages_nodemask+0x409/0xb80
[ 1892.312730]  [<ffffffff811ea06d>] kmem_getpages+0x4d/0xf0
[ 1892.312737]  [<ffffffff811eb8d5>] fallback_alloc+0x205/0x260
[ 1892.312741]  [<ffffffff811ec3f6>] kmem_cache_alloc_trace+0x1f6/0x460
[ 1892.312748]  [<ffffffff812384fb>] wb_start_writeback+0x3b/0xe0
[ 1892.312756]  [<ffffffff81238a96>] wakeup_flusher_threads+0xc6/0x150
[ 1892.312759]  [<ffffffff811a9571>] do_try_to_free_pages+0x241/0x450
[ 1892.312765]  [<ffffffff811a983a>] try_to_free_pages+0xba/0x170
[ 1892.312767]  [<ffffffff8119b843>] __alloc_pages_nodemask+0x5f3/0xb80
[ 1892.312770]  [<ffffffff811ea06d>] kmem_getpages+0x4d/0xf0
[ 1892.312772]  [<ffffffff811eb869>] fallback_alloc+0x199/0x260
[ 1892.312775]  [<ffffffff811ec6e0>] kmem_cache_alloc_node_trace+0x80/0x490
[ 1892.312792]  [<ffffffff815075ce>] __kmalloc_reserve.isra.34+0x2e/0x80
[ 1892.312806]  [<ffffffff81508b13>] __alloc_skb+0x73/0x270
[ 1892.312815]  [<ffffffff81567194>] sk_stream_alloc_skb+0x44/0x170
[ 1892.312824]  [<ffffffff815676b8>] tcp_sendpage+0x3f8/0x610
[ 1892.312863]  [<ffffffffa0be7df9>] ksocknal_lib_send_kiov+0x99/0x240 [ksocklnd]
[ 1892.312885]  [<ffffffffa0be1da7>] ksocknal_process_transmit+0x2b7/0xb60 [ksocklnd]
[ 1892.312891]  [<ffffffffa0be64b1>] ksocknal_scheduler+0x231/0x660 [ksocklnd]
[ 1892.312901]  [<ffffffff8109ebc9>] kthread+0xc9/0xe0
[ 1892.312919]  [<ffffffff81617805>] ret_from_fork+0x55/0x80
[ 1892.315523] DWARF2 unwinder stuck at ret_from_fork+0x55/0x80
[ 1892.315524] 
[ 1892.315527] Leftover inexact backtrace:
[ 1892.315527] 
[ 1892.315534]  [<ffffffff8109eb00>] ? kthread_park+0x50/0x50
[ 1892.315535] Mem-Info:
[ 1892.315539] active_anon:1643 inactive_anon:1687 isolated_anon:0
[ 1892.315539]  active_file:127143 inactive_file:312793 isolated_file:0
[ 1892.315539]  unevictable:20 dirty:0 writeback:31 unstable:0
[ 1892.315539]  slab_reclaimable:2757 slab_unreclaimable:21743
[ 1892.315539]  mapped:7385 shmem:2179 pagetables:887 bounce:0
[ 1892.315539]  free:7 free_pcp:0 free_cma:0
[ 1892.315551] Node 0 DMA free:40kB min:376kB low:468kB high:560kB active_anon:28kB inactive_anon:96kB active_file:1408kB inactive_file:5712kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15992kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:456kB shmem:104kB slab_reclaimable:28kB slab_unreclaimable:7892kB kernel_stack:0kB pagetables:28kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:1191008 all_unreclaimable? yes
[ 1892.315553] lowmem_reserve[]: 0 1836 1836 1836 1836
[ 1892.315560] Node 0 DMA32 free:0kB min:44676kB low:55844kB high:67012kB active_anon:6544kB inactive_anon:6652kB active_file:507164kB inactive_file:1245460kB unevictable:80kB isolated(anon):0kB isolated(file):0kB present:2080744kB managed:1900784kB mlocked:80kB dirty:0kB writeback:124kB mapped:29084kB shmem:8612kB slab_reclaimable:11000kB slab_unreclaimable:79080kB kernel_stack:2672kB pagetables:3520kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:16195004 all_unreclaimable? yes
[ 1892.315562] lowmem_reserve[]: 0 0 0 0 0
[ 1892.315567] Node 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
[ 1892.315571] Node 0 DMA32: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
[ 1892.315582] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[ 1892.315591] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[ 1892.315591] 14900 total pagecache pages
[ 1892.315593] 60 pages in swap cache
[ 1892.315595] Swap cache stats: add 5726, delete 5666, find 66/70
[ 1892.315596] Free swap  = 14315232kB
[ 1892.315596] Total swap = 14338044kB
[ 1892.315597] 524184 pages RAM
[ 1892.315597] 0 pages HighMem/MovableOnly
[ 1892.315598] 45012 pages reserved
[ 1892.315598] 0 pages hwpoisoned

There are many recent recovery-mds-scale crashes, but the following have a similar stack trace as described above
https://testing.whamcloud.com/test_sets/7c353280-9882-11e8-b0aa-52540065bddc
https://testing.whamcloud.com/test_sets/739182d0-9645-11e8-a9f7-52540065bddc

mentioned in: Page Loading...; Page Loading...; Page Loading...

Details

Description

Attachments

Issue Links

Activity

People

Dates