Details
Description
A client crashes in recovery-mds-scale test_failover_mds. Looking at https://testing.whamcloud.com/test_sets/3e369a98-b8bf-11e8-a7de-52540065bddc, in the kernel crash log, we see
[ 766.998879] Lustre: DEBUG MARKER: mds1 has failed over 1 times, and counting... [ 782.297602] Lustre: Evicted from MGS (at MGC10.9.6.25@tcp_1) after server handle changed from 0x66e2519c6be9cc2 to 0x89680f107ea4b814 [ 782.299262] Lustre: MGC10.9.6.25@tcp: Connection restored to MGC10.9.6.25@tcp_1 (at 10.9.6.26@tcp) [ 782.362888] LustreError: 13367:0:(client.c:3000:ptlrpc_replay_interpret()) @@@ status 301, old was 0 req@ffff88006690c940 x1611617063142688/t4294967305(4294967305) o101->lustre-MDT0000-mdc-ffff88007bb5e800@10.9.6.26@tcp:12/10 lens 952/560 e 0 to 0 dl 1536957913 ref 2 fl Interpret:RP/4/0 rc 301/301 [ 845.630602] ntpd invoked oom-killer: gfp_mask=0x14200ca(GFP_HIGHUSER_MOVABLE), nodemask=0, order=0, oom_score_adj=0 [ 845.630613] ntpd cpuset=/ mems_allowed=0 [ 845.630628] CPU: 1 PID: 1461 Comm: ntpd Tainted: G OE N 4.4.143-94.47-default #1 [ 845.630629] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [ 845.630634] 0000000000000000 ffffffff8132ad80 ffff88007be939a0 0000000000000000 [ 845.630637] ffffffff8120935e 0000000000000000 0000000000000000 0000000000000000 [ 845.630639] 0000000000000000 ffffffff810a0927 ffffffff81e9aa20 0000000000000000 [ 845.630639] Call Trace: [ 845.630702] [<ffffffff81019ac9>] dump_trace+0x59/0x340 [ 845.630711] [<ffffffff81019e9a>] show_stack_log_lvl+0xea/0x170 [ 845.630714] [<ffffffff8101ac71>] show_stack+0x21/0x40 [ 845.630727] [<ffffffff8132ad80>] dump_stack+0x5c/0x7c [ 845.630748] [<ffffffff8120935e>] dump_header+0x82/0x215 [ 845.630762] [<ffffffff81198079>] check_panic_on_oom+0x29/0x50 [ 845.630770] [<ffffffff8119821a>] out_of_memory+0x17a/0x4a0 [ 845.630777] [<ffffffff8119cc48>] __alloc_pages_nodemask+0xaf8/0xb70 [ 845.630786] [<ffffffff811e6cc4>] alloc_pages_vma+0xa4/0x220 [ 845.630799] [<ffffffff811d70f0>] __read_swap_cache_async+0xf0/0x150 [ 845.630805] [<ffffffff811d7164>] read_swap_cache_async+0x14/0x30 [ 845.630808] [<ffffffff811d727d>] swapin_readahead+0xfd/0x190 [ 845.630814] [<ffffffff811c3771>] handle_pte_fault+0x12b1/0x1670 [ 845.630820] [<ffffffff811c56aa>] handle_mm_fault+0x2fa/0x640 [ 845.630828] [<ffffffff81067d7a>] __do_page_fault+0x23a/0x4b0 [ 845.630838] [<ffffffff8106809c>] trace_do_page_fault+0x3c/0x120 [ 845.630850] [<ffffffff8161da62>] async_page_fault+0x32/0x60 [ 845.633602] DWARF2 unwinder stuck at async_page_fault+0x32/0x60 [ 845.633602] [ 845.633603] Leftover inexact backtrace: [ 845.633621] [<ffffffff81338d61>] ? __clear_user+0x21/0x50 [ 845.633624] [<ffffffff810230f2>] ? copy_fpstate_to_sigframe+0x112/0x1a0 [ 845.633625] [<ffffffff810176d1>] ? do_signal+0x511/0x5b0 [ 845.633627] [<ffffffff81067d9a>] ? __do_page_fault+0x25a/0x4b0 [ 845.633634] [<ffffffff8107bf4e>] ? exit_to_usermode_loop+0x70/0xc2 [ 845.633638] [<ffffffff81003ae5>] ? syscall_return_slowpath+0x85/0xa0 [ 845.633644] [<ffffffff8161aa3a>] ? int_ret_from_sys_call+0x25/0xa3 [ 845.633661] Mem-Info: [ 845.633668] active_anon:25 inactive_anon:41 isolated_anon:0 active_file:69559 inactive_file:371186 isolated_file:0 unevictable:20 dirty:67 writeback:850 unstable:0 slab_reclaimable:2733 slab_unreclaimable:8762 mapped:7090 shmem:26 pagetables:966 bounce:0 free:13103 free_pcp:0 free_cma:0 [ 845.633676] Node 0 DMA free:7736kB min:376kB low:468kB high:560kB active_anon:100kB inactive_anon:104kB active_file:744kB inactive_file:6464kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15992kB managed:15904kB mlocked:0kB dirty:0kB writeback:96kB mapped:256kB shmem:104kB slab_reclaimable:20kB slab_unreclaimable:284kB kernel_stack:32kB pagetables:12kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:46916 all_unreclaimable? yes [ 845.633678] lowmem_reserve[]: 0 1843 1843 1843 1843 [ 845.633683] Node 0 DMA32 free:44676kB min:44676kB low:55844kB high:67012kB active_anon:0kB inactive_anon:60kB active_file:277492kB inactive_file:1478280kB unevictable:80kB isolated(anon):0kB isolated(file):0kB present:2080744kB managed:1900772kB mlocked:80kB dirty:268kB writeback:3304kB mapped:28104kB shmem:0kB slab_reclaimable:10912kB slab_unreclaimable:34764kB kernel_stack:2608kB pagetables:3852kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:11784060 all_unreclaimable? yes [ 845.633686] lowmem_reserve[]: 0 0 0 0 0 [ 845.633694] Node 0 DMA: 8*4kB (UME) 5*8kB (ME) 3*16kB (UE) 2*32kB (UE) 2*64kB (U) 2*128kB (UE) 2*256kB (ME) 3*512kB (UME) 1*1024kB (E) 2*2048kB (ME) 0*4096kB = 7736kB [ 845.633700] Node 0 DMA32: 916*4kB (UME) 559*8kB (UME) 471*16kB (UME) 268*32kB (UME) 162*64kB (UE) 65*128kB (UM) 7*256kB (UM) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 44728kB [ 845.633713] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB [ 845.633720] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB [ 845.633721] 13843 total pagecache pages [ 845.633722] 0 pages in swap cache [ 845.633722] Swap cache stats: add 10178, delete 10178, find 72/95 [ 845.633723] Free swap = 14297524kB [ 845.633725] Total swap = 14338044kB [ 845.633726] 524184 pages RAM [ 845.633726] 0 pages HighMem/MovableOnly [ 845.633726] 45015 pages reserved [ 845.633727] 0 pages hwpoisoned [ 845.633727] [ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name [ 845.633824] [ 361] 0 361 10933 686 24 3 1623 0 systemd-journal [ 845.633827] [ 400] 495 400 13124 940 29 3 116 0 rpcbind [ 845.633836] [ 404] 0 404 9268 710 21 3 218 -1000 systemd-udevd [ 845.633839] [ 479] 0 479 4814 612 14 3 58 0 irqbalance [ 845.633847] [ 484] 499 484 13452 876 26 3 146 -900 dbus-daemon [ 845.633849] [ 528] 0 528 25126 1025 50 3 207 0 sssd [ 845.633851] [ 533] 0 533 32270 1881 64 3 288 0 sssd_be [ 845.633857] [ 538] 0 538 7447 1060 19 3 260 0 wickedd-dhcp6 [ 845.633860] [ 548] 0 548 25845 1531 54 4 180 0 sssd_nss [ 845.633862] [ 549] 0 549 20713 1198 45 3 175 0 sssd_pam [ 845.633863] [ 550] 0 550 19112 1161 43 3 166 0 sssd_ssh [ 845.633870] [ 552] 0 552 7448 1028 18 3 256 0 wickedd-auto4 [ 845.633876] [ 553] 0 553 7448 1085 20 3 265 0 wickedd-dhcp4 [ 845.633880] [ 556] 0 556 84318 905 38 3 269 0 rsyslogd [ 845.633927] [ 761] 0 761 7480 1056 20 3 287 0 wickedd [ 845.633934] [ 764] 0 764 7455 1032 18 3 276 0 wickedd-nanny [ 845.633939] [ 1418] 0 1418 2141 422 10 3 40 0 xinetd [ 845.633944] [ 1461] 74 1461 8408 974 17 3 164 0 ntpd [ 845.633953] [ 1473] 74 1473 9461 590 18 3 153 0 ntpd [ 845.633957] [ 1477] 0 1477 16586 1539 35 3 180 -1000 sshd [ 845.633964] [ 1492] 493 1492 55352 616 20 3 231 0 munged [ 845.633972] [ 1517] 0 1517 1664 438 8 3 30 0 agetty [ 845.633977] [ 1518] 0 1518 1664 419 9 3 29 0 agetty [ 845.633981] [ 1534] 0 1534 147220 1574 59 3 347 0 automount [ 845.633986] [ 1570] 0 1570 5513 629 16 3 64 0 systemd-logind [ 845.633988] [ 1809] 0 1809 8861 822 20 3 109 0 master [ 845.633990] [ 1820] 51 1820 12439 1046 25 3 108 0 pickup [ 845.633992] [ 1823] 51 1823 12536 1354 26 3 174 0 qmgr [ 845.633994] [ 1864] 0 1864 5197 532 18 3 150 0 cron [ 845.634043] [15714] 0 15714 17465 669 35 3 174 0 in.mrshd [ 845.634047] [15715] 0 15715 2894 572 10 3 77 0 bash [ 845.634051] [15720] 0 15720 2894 427 10 3 78 0 bash [ 845.634053] [15721] 0 15721 3034 585 12 3 219 0 run_dd.sh [ 845.634057] [16387] 51 16387 12675 1323 25 3 335 0 trivial-rewrite [ 845.634059] [16388] 51 16388 16918 1732 35 3 244 0 smtp [ 845.634064] [16434] 0 16434 1062 182 8 3 26 0 dd [ 845.634068] [16437] 51 16437 12447 1022 24 3 109 0 bounce [ 845.634074] Kernel panic - not syncing: Out of memory: system-wide panic_on_oom is enabled [ 845.634075] CPU: 1 PID: 1461 Comm: ntpd Tainted: G OE N 4.4.143-94.47-default #1 [ 845.634076] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [ 845.634078] 0000000000000000 ffffffff8132ad80 ffffffff81a28298 ffff88007be938c8 [ 845.634079] ffffffff81191f31 0000000000000010 ffff88007be938d8 ffff88007be93878 [ 845.634081] 000000000000309f ffffffff81a2c56b 0000000000000000 0000000000000000 [ 845.634081] Call Trace: [ 845.634087] [<ffffffff81019ac9>] dump_trace+0x59/0x340 [ 845.634090] [<ffffffff81019e9a>] show_stack_log_lvl+0xea/0x170 [ 845.634092] [<ffffffff8101ac71>] show_stack+0x21/0x40 [ 845.634095] [<ffffffff8132ad80>] dump_stack+0x5c/0x7c [ 845.634101] [<ffffffff81191f31>] panic+0xd2/0x232 [ 845.634104] [<ffffffff811980a0>] check_panic_on_oom+0x50/0x50 [ 845.634106] [<ffffffff8119821a>] out_of_memory+0x17a/0x4a0 [ 845.634112] [<ffffffff8119cc48>] __alloc_pages_nodemask+0xaf8/0xb70 [ 845.634116] [<ffffffff811e6cc4>] alloc_pages_vma+0xa4/0x220 [ 845.634119] [<ffffffff811d70f0>] __read_swap_cache_async+0xf0/0x150 [ 845.634123] [<ffffffff811d7164>] read_swap_cache_async+0x14/0x30 [ 845.634125] [<ffffffff811d727d>] swapin_readahead+0xfd/0x190 [ 845.634128] [<ffffffff811c3771>] handle_pte_fault+0x12b1/0x1670 [ 845.634132] [<ffffffff811c56aa>] handle_mm_fault+0x2fa/0x640 [ 845.634135] [<ffffffff81067d7a>] __do_page_fault+0x23a/0x4b0 [ 845.634139] [<ffffffff8106809c>] trace_do_page_fault+0x3c/0x120 [ 845.634141] [<ffffffff8161da62>] async_page_fault+0x32/0x60 [ 845.636233] DWARF2 unwinder stuck at async_page_fault+0x32/0x60 [ 845.636233] [ 845.636234] Leftover inexact backtrace: [ 845.636236] [<ffffffff81338d61>] ? __clear_user+0x21/0x50 [ 845.636238] [<ffffffff810230f2>] ? copy_fpstate_to_sigframe+0x112/0x1a0 [ 845.636239] [<ffffffff810176d1>] ? do_signal+0x511/0x5b0 [ 845.636241] [<ffffffff81067d9a>] ? __do_page_fault+0x25a/0x4b0 [ 845.636243] [<ffffffff8107bf4e>] ? exit_to_usermode_loop+0x70/0xc2 [ 845.636246] [<ffffffff81003ae5>] ? syscall_return_slowpath+0x85/0xa0 [ 845.636248] [<ffffffff8161aa3a>] ? int_ret_from_sys_call+0x25/0xa3
In the client (vm3) console log, we see where the vmcore is located
[ 782.297602] Lustre: Evicted from MGS (at MGC10.9.6.25@tcp_1) after server handle changed from 0x66e2519c6be9cc2 to 0x89680f107ea4b814 [ 782.299262] Lustre: MGC10.9.6.25@tcp: Connection restored to MGC10.9.6.25@tcp_1 (at 10.9.6.26@tcp) [ 782.362888] LustreError: 13367:0:(client.c:3000:ptlrpc_replay_interpret()) @@@ status 301, old was 0 req@ffff88006690c940 x1611617063142688/t4294967305(4294967305) o101->lustre-MDT0000-mdc-ffff88007bb5e800@10.9.6.26@tcp:12/10 lens 952/560 e 0 to 0 dl 1536957913 ref 2 fl Interpret:RP/4/0 rc 301/301 [ 845.630602] ntpd invoked oom-killer: gfp_mask=0x14200ca(GFP_HIGHUSER_MOVABLE), nodemask=0, order=0, oom_score_adj=0 [ 845.630613] ntpd cpuset=/ mems_allowed=0 [ 845.630628] CPU: 1 PID: 1461 Comm: ntpd Tainted: G OE N 4.4.143-94.47-default #1 [ [ 2.826867] RPC: Registered named UNIX socket transport module. [ 2.826869] RPC: Registered udp transport module. … The dumpfile is saved to /mnt/trevis-2.trevis.whamcloud.com/export/scratch/dumps/trevis-45vm3.trevis.whamcloud.com/10.9.6.21-2018-09-14-13:48/vmcore. makedumpfile Completed. -------------------------------------------------------------------------------
All failures are seen on SLES12 SP3 server/client and SLES12 SP3 client/CentOS 7 server testing.
We’ve seen this crash a few times in the past
https://testing.whamcloud.com/test_sets/9a10e716-b882-11e8-b86b-52540065bddc
https://testing.whamcloud.com/test_sets/74705770-b9f5-11e8-8c12-52540065bddc
https://testing.whamcloud.com/test_sets/6f5b3134-bc38-11e8-a7de-52540065bddc
Attachments
Issue Links
- is related to
-
LU-11724 recovery tests crash with ‘page allocation failure: order:0, mode:0x1080020(GFP_ATOMIC)’
- Open
-
LU-12067 recovery-mds-scale test failover_mds crashes with OOM
- Open
- mentioned in
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...