[LU-11794] recovery-mds-scale test failover_mds crashes with '<hostname> crashed during recovery-mds-scale test_failover_mds' - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Cannot Reproduce
Priority: Minor
Fix Version/s: None
Affects Version/s: Lustre 2.12.0
Labels:
- failover

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

recovery-mds-scale test_failover_mds crashes with 'trevis-16vm3 crashed during recovery-mds-scale test_failover_mds'

Looking at the kernel crash from https://testing.whamcloud.com/test_sets/4db0c228-fd1b-11e8-8512-52540065bddc , we see

[  733.722957] Lustre: DEBUG MARKER: mds1 has failed over 1 times, and counting...
[  746.732703] Lustre: Evicted from MGS (at MGC10.9.4.191@tcp_1) after server handle changed from 0xd73c2b185ff898b4 to 0x7b37c6365056206
[  746.734484] Lustre: MGC10.9.4.191@tcp: Connection restored to MGC10.9.4.191@tcp_1 (at 10.9.4.192@tcp)
[  746.736017] LustreError: 13384:0:(client.c:3023:ptlrpc_replay_interpret()) @@@ status 301, old was 0  req@ffff88006542b980 x1619485946545808/t4294967305(4294967305) o101->lustre-MDT0000-mdc-ffff88007ae8a000@10.9.4.192@tcp:12/10 lens 712/560 e 0 to 0 dl 1544462286 ref 2 fl Interpret:RP/4/0 rc 301/301
[  747.973965] Lustre: lustre-MDT0000-mdc-ffff88007ae8a000: Connection restored to 10.9.4.192@tcp (at 10.9.4.192@tcp)
[  794.635505] Lustre: 13385:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1544462204/real 1544462204]  req@ffff88003aa1acc0 x1619485946562912/t0(0) o400->lustre-MDT0000-mdc-ffff88007ae8a000@10.9.4.192@tcp:12/10 lens 224/224 e 0 to 1 dl 1544462212 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
[  794.635528] Lustre: 13385:0:(client.c:2132:ptlrpc_expire_one_request()) Skipped 1 previous similar message
[ 1438.045823] kworker/0:2H: page allocation failure: order:0, mode:0x1284020(GFP_ATOMIC|__GFP_COMP|__GFP_NOTRACK)
[ 1438.045839] CPU: 0 PID: 1516 Comm: kworker/0:2H Tainted: G           OE   N  4.4.162-94.69-default #1
[ 1438.045840] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[ 1438.045875] Workqueue: kblockd blk_mq_run_work_fn
[ 1438.045878]  0000000000000000 ffffffff8132cdc0 0000000000000000 ffff88007b6479a8
[ 1438.045879]  ffffffff8119ddc2 0128402000000030 0000000000000046 002c422000000000
[ 1438.045881]  00000001005d4200 0000000000000000 0000000000000000 0000000000000020
[ 1438.045881] Call Trace:
[ 1438.045928]  [<ffffffff81019b09>] dump_trace+0x59/0x340
[ 1438.045934]  [<ffffffff81019eda>] show_stack_log_lvl+0xea/0x170
[ 1438.045936]  [<ffffffff8101acb1>] show_stack+0x21/0x40
[ 1438.045944]  [<ffffffff8132cdc0>] dump_stack+0x5c/0x7c
[ 1438.045964]  [<ffffffff8119ddc2>] warn_alloc_failed+0xe2/0x150
[ 1438.045975]  [<ffffffff8119e23b>] __alloc_pages_nodemask+0x40b/0xb70
[ 1438.045988]  [<ffffffff811edbed>] kmem_getpages+0x4d/0xf0
[ 1438.045995]  [<ffffffff811ef3fb>] fallback_alloc+0x19b/0x240
[ 1438.045998]  [<ffffffff811f13fa>] __kmalloc+0x26a/0x4b0
[ 1438.046022]  [<ffffffffa0184708>] alloc_indirect.isra.4+0x18/0x50 [virtio_ring]
[ 1438.046031]  [<ffffffffa0184a02>] virtqueue_add_sgs+0x2c2/0x410 [virtio_ring]
[ 1438.046044]  [<ffffffffa0053417>] __virtblk_add_req+0xa7/0x170 [virtio_blk]
[ 1438.046048]  [<ffffffffa00535fa>] virtio_queue_rq+0x11a/0x270 [virtio_blk]
[ 1438.046051]  [<ffffffff813073e5>] blk_mq_dispatch_rq_list+0xd5/0x1e0
[ 1438.046057]  [<ffffffff8130761e>] blk_mq_process_rq_list+0x12e/0x150
[ 1438.046065]  [<ffffffff8109ad14>] process_one_work+0x154/0x420
[ 1438.046073]  [<ffffffff8109b906>] worker_thread+0x116/0x4a0
[ 1438.046079]  [<ffffffff810a0e29>] kthread+0xc9/0xe0
[ 1438.046099]  [<ffffffff8161e1f5>] ret_from_fork+0x55/0x80
[ 1438.048865] DWARF2 unwinder stuck at ret_from_fork+0x55/0x80
[ 1438.048865] 
[ 1438.048868] Leftover inexact backtrace:
               
[ 1438.048877]  [<ffffffff810a0d60>] ? kthread_park+0x50/0x50
[ 1438.048878] Mem-Info:
[ 1438.048882] active_anon:961 inactive_anon:982 isolated_anon:64
                active_file:179814 inactive_file:247899 isolated_file:32
                unevictable:20 dirty:3366 writeback:60306 unstable:0
                slab_reclaimable:3317 slab_unreclaimable:29811
                mapped:7232 shmem:752 pagetables:890 bounce:0
                free:2433 free_pcp:169 free_cma:0
[ 1438.048896] Node 0 DMA free:364kB min:376kB low:468kB high:560kB active_anon:76kB inactive_anon:80kB active_file:1836kB inactive_file:5360kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15992kB managed:15904kB mlocked:0kB dirty:40kB writeback:920kB mapped:292kB shmem:80kB slab_reclaimable:68kB slab_unreclaimable:7712kB kernel_stack:16kB pagetables:16kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:45812 all_unreclaimable? yes
[ 1438.048898] lowmem_reserve[]: 0 1843 1843 1843 1843
[ 1438.048907] Node 0 DMA32 free:9368kB min:44676kB low:55844kB high:67012kB active_anon:3768kB inactive_anon:3848kB active_file:717420kB inactive_file:986236kB unevictable:80kB isolated(anon):256kB isolated(file):128kB present:2080744kB managed:1900752kB mlocked:80kB dirty:13424kB writeback:240304kB mapped:28636kB shmem:2928kB slab_reclaimable:13200kB slab_unreclaimable:111532kB kernel_stack:2608kB pagetables:3544kB unstable:0kB bounce:0kB free_pcp:676kB local_pcp:676kB free_cma:0kB writeback_tmp:0kB pages_scanned:103620 all_unreclaimable? no
[ 1438.048908] lowmem_reserve[]: 0 0 0 0 0
[ 1438.048915] Node 0 DMA: 2*4kB (M) 0*8kB 0*16kB 1*32kB (U) 1*64kB (U) 0*128kB 1*256kB (U) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 360kB
[ 1438.048921] Node 0 DMA32: 140*4kB (UME) 69*8kB (UM) 42*16kB (UM) 77*32kB (UM) 29*64kB (UM) 2*128kB (M) 4*256kB (M) 4*512kB (M) 0*1024kB 0*2048kB 0*4096kB = 9432kB
[ 1438.048932] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[ 1438.048939] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[ 1438.048940] 76857 total pagecache pages
[ 1438.048941] 344 pages in swap cache
[ 1438.048943] Swap cache stats: add 7967, delete 7623, find 157/240
[ 1438.048944] Free swap  = 14306948kB
[ 1438.048944] Total swap = 14338044kB
[ 1438.048945] 524184 pages RAM
[ 1438.048945] 0 pages HighMem/MovableOnly
[ 1438.048945] 45020 pages reserved
[ 1438.048946] 0 pages hwpoisoned

There are no other recovery-mds-scale test_failover_mds crashes with the same stack trace in the past month.

Attachments

Issue Links

mentioned in: Page Loading...; Page Loading...

recovery-mds-scale test failover_mds crashes with '<hostname> crashed during recovery-mds-scale test_failover_mds'

Details

Description

Attachments

Issue Links

Activity

People

Dates