Details
-
Bug
-
Resolution: Cannot Reproduce
-
Minor
-
None
-
Lustre 2.12.0
-
3
-
9223372036854775807
Description
recovery-mds-scale test_failover_mds crashes with 'trevis-16vm3 crashed during recovery-mds-scale test_failover_mds'
Looking at the kernel crash from https://testing.whamcloud.com/test_sets/4db0c228-fd1b-11e8-8512-52540065bddc , we see
[ 733.722957] Lustre: DEBUG MARKER: mds1 has failed over 1 times, and counting... [ 746.732703] Lustre: Evicted from MGS (at MGC10.9.4.191@tcp_1) after server handle changed from 0xd73c2b185ff898b4 to 0x7b37c6365056206 [ 746.734484] Lustre: MGC10.9.4.191@tcp: Connection restored to MGC10.9.4.191@tcp_1 (at 10.9.4.192@tcp) [ 746.736017] LustreError: 13384:0:(client.c:3023:ptlrpc_replay_interpret()) @@@ status 301, old was 0 req@ffff88006542b980 x1619485946545808/t4294967305(4294967305) o101->lustre-MDT0000-mdc-ffff88007ae8a000@10.9.4.192@tcp:12/10 lens 712/560 e 0 to 0 dl 1544462286 ref 2 fl Interpret:RP/4/0 rc 301/301 [ 747.973965] Lustre: lustre-MDT0000-mdc-ffff88007ae8a000: Connection restored to 10.9.4.192@tcp (at 10.9.4.192@tcp) [ 794.635505] Lustre: 13385:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1544462204/real 1544462204] req@ffff88003aa1acc0 x1619485946562912/t0(0) o400->lustre-MDT0000-mdc-ffff88007ae8a000@10.9.4.192@tcp:12/10 lens 224/224 e 0 to 1 dl 1544462212 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 [ 794.635528] Lustre: 13385:0:(client.c:2132:ptlrpc_expire_one_request()) Skipped 1 previous similar message [ 1438.045823] kworker/0:2H: page allocation failure: order:0, mode:0x1284020(GFP_ATOMIC|__GFP_COMP|__GFP_NOTRACK) [ 1438.045839] CPU: 0 PID: 1516 Comm: kworker/0:2H Tainted: G OE N 4.4.162-94.69-default #1 [ 1438.045840] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [ 1438.045875] Workqueue: kblockd blk_mq_run_work_fn [ 1438.045878] 0000000000000000 ffffffff8132cdc0 0000000000000000 ffff88007b6479a8 [ 1438.045879] ffffffff8119ddc2 0128402000000030 0000000000000046 002c422000000000 [ 1438.045881] 00000001005d4200 0000000000000000 0000000000000000 0000000000000020 [ 1438.045881] Call Trace: [ 1438.045928] [<ffffffff81019b09>] dump_trace+0x59/0x340 [ 1438.045934] [<ffffffff81019eda>] show_stack_log_lvl+0xea/0x170 [ 1438.045936] [<ffffffff8101acb1>] show_stack+0x21/0x40 [ 1438.045944] [<ffffffff8132cdc0>] dump_stack+0x5c/0x7c [ 1438.045964] [<ffffffff8119ddc2>] warn_alloc_failed+0xe2/0x150 [ 1438.045975] [<ffffffff8119e23b>] __alloc_pages_nodemask+0x40b/0xb70 [ 1438.045988] [<ffffffff811edbed>] kmem_getpages+0x4d/0xf0 [ 1438.045995] [<ffffffff811ef3fb>] fallback_alloc+0x19b/0x240 [ 1438.045998] [<ffffffff811f13fa>] __kmalloc+0x26a/0x4b0 [ 1438.046022] [<ffffffffa0184708>] alloc_indirect.isra.4+0x18/0x50 [virtio_ring] [ 1438.046031] [<ffffffffa0184a02>] virtqueue_add_sgs+0x2c2/0x410 [virtio_ring] [ 1438.046044] [<ffffffffa0053417>] __virtblk_add_req+0xa7/0x170 [virtio_blk] [ 1438.046048] [<ffffffffa00535fa>] virtio_queue_rq+0x11a/0x270 [virtio_blk] [ 1438.046051] [<ffffffff813073e5>] blk_mq_dispatch_rq_list+0xd5/0x1e0 [ 1438.046057] [<ffffffff8130761e>] blk_mq_process_rq_list+0x12e/0x150 [ 1438.046065] [<ffffffff8109ad14>] process_one_work+0x154/0x420 [ 1438.046073] [<ffffffff8109b906>] worker_thread+0x116/0x4a0 [ 1438.046079] [<ffffffff810a0e29>] kthread+0xc9/0xe0 [ 1438.046099] [<ffffffff8161e1f5>] ret_from_fork+0x55/0x80 [ 1438.048865] DWARF2 unwinder stuck at ret_from_fork+0x55/0x80 [ 1438.048865] [ 1438.048868] Leftover inexact backtrace: [ 1438.048877] [<ffffffff810a0d60>] ? kthread_park+0x50/0x50 [ 1438.048878] Mem-Info: [ 1438.048882] active_anon:961 inactive_anon:982 isolated_anon:64 active_file:179814 inactive_file:247899 isolated_file:32 unevictable:20 dirty:3366 writeback:60306 unstable:0 slab_reclaimable:3317 slab_unreclaimable:29811 mapped:7232 shmem:752 pagetables:890 bounce:0 free:2433 free_pcp:169 free_cma:0 [ 1438.048896] Node 0 DMA free:364kB min:376kB low:468kB high:560kB active_anon:76kB inactive_anon:80kB active_file:1836kB inactive_file:5360kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15992kB managed:15904kB mlocked:0kB dirty:40kB writeback:920kB mapped:292kB shmem:80kB slab_reclaimable:68kB slab_unreclaimable:7712kB kernel_stack:16kB pagetables:16kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:45812 all_unreclaimable? yes [ 1438.048898] lowmem_reserve[]: 0 1843 1843 1843 1843 [ 1438.048907] Node 0 DMA32 free:9368kB min:44676kB low:55844kB high:67012kB active_anon:3768kB inactive_anon:3848kB active_file:717420kB inactive_file:986236kB unevictable:80kB isolated(anon):256kB isolated(file):128kB present:2080744kB managed:1900752kB mlocked:80kB dirty:13424kB writeback:240304kB mapped:28636kB shmem:2928kB slab_reclaimable:13200kB slab_unreclaimable:111532kB kernel_stack:2608kB pagetables:3544kB unstable:0kB bounce:0kB free_pcp:676kB local_pcp:676kB free_cma:0kB writeback_tmp:0kB pages_scanned:103620 all_unreclaimable? no [ 1438.048908] lowmem_reserve[]: 0 0 0 0 0 [ 1438.048915] Node 0 DMA: 2*4kB (M) 0*8kB 0*16kB 1*32kB (U) 1*64kB (U) 0*128kB 1*256kB (U) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 360kB [ 1438.048921] Node 0 DMA32: 140*4kB (UME) 69*8kB (UM) 42*16kB (UM) 77*32kB (UM) 29*64kB (UM) 2*128kB (M) 4*256kB (M) 4*512kB (M) 0*1024kB 0*2048kB 0*4096kB = 9432kB [ 1438.048932] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB [ 1438.048939] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB [ 1438.048940] 76857 total pagecache pages [ 1438.048941] 344 pages in swap cache [ 1438.048943] Swap cache stats: add 7967, delete 7623, find 157/240 [ 1438.048944] Free swap = 14306948kB [ 1438.048944] Total swap = 14338044kB [ 1438.048945] 524184 pages RAM [ 1438.048945] 0 pages HighMem/MovableOnly [ 1438.048945] 45020 pages reserved [ 1438.048946] 0 pages hwpoisoned
There are no other recovery-mds-scale test_failover_mds crashes with the same stack trace in the past month.