[LU-11228] recovery-mds-scale test failover_mds crashes with ‘socknal_sd00_01: page allocation failure: order:0, mode:0x1284020(GFP_ATOMIC|__GFP_COMP|__GFP_NOTRACK)’ Created: 08/Aug/18 Updated: 07/Dec/21 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.10.5, Lustre 2.12.8 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | James Nunez (Inactive) | Assignee: | Zhenyu Xu |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
recovery-mds-scale test_failover_mds crashes. The following are the last lines seen in the test_log ==== Checking the clients loads AFTER failover -- failure NOT OK 14:03:26 (1533416606) waiting for trevis-13vm3 network 5 secs ... 14:03:26 (1533416606) network interface is UP CMD: trevis-13vm3 rc=0; val=\$(/usr/sbin/lctl get_param -n catastrophe 2>&1); if [[ \$? -eq 0 && \$val -ne 0 ]]; then echo \$(hostname -s): \$val; rc=\$val; fi; exit \$rc CMD: trevis-13vm3 ps auxwww | grep -v grep | grep -q run_dd.sh 14:03:26 (1533416606) waiting for trevis-13vm4 network 5 secs ... 14:03:26 (1533416606) network interface is UP CMD: trevis-13vm4 rc=0; val=\$(/usr/sbin/lctl get_param -n catastrophe 2>&1); if [[ \$? -eq 0 && \$val -ne 0 ]]; then echo \$(hostname -s): \$val; rc=\$val; fi; exit \$rc CMD: trevis-13vm4 ps auxwww | grep -v grep | grep -q run_tar.sh mds1 has failed over 2 times, and counting... sleeping 1100 seconds... For the crash at https://testing.whamcloud.com/test_sets/9ce4d6de-9896-11e8-a9f7-52540065bddc, we see the following stack trace from the client(vm3) console log [ 1852.682252] Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh [ 1852.962495] Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds1 has failed over 2 times, and counting... [ 1853.137368] Lustre: lustre-MDT0000-mdc-ffff88007a72e000: Connection restored to 10.9.4.152@tcp (at 10.9.4.152@tcp) [ 1853.259454] Lustre: DEBUG MARKER: mds1 has failed over 2 times, and counting... [ 1892.312595] socknal_sd00_01: page allocation failure: order:0, mode:0x1284020(GFP_ATOMIC|__GFP_COMP|__GFP_NOTRACK) [ 1892.312602] CPU: 0 PID: 13369 Comm: socknal_sd00_01 Tainted: G OE N 4.4.132-94.33-default #1 [ 1892.312603] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [ 1892.312607] 0000000000000000 ffffffff813284a0 0000000000000000 ffff88007b713750 [ 1892.312608] ffffffff8119b1e2 0128402000000004 0000000400000000 ffff88007fc1d828 [ 1892.312609] ffff88007ffcf490 ffff88007fc1d828 0000000400000000 ffff88007b7137d8 [ 1892.312610] Call Trace: [ 1892.312669] [<ffffffff81019b59>] dump_trace+0x59/0x340 [ 1892.312673] [<ffffffff81019f2a>] show_stack_log_lvl+0xea/0x170 [ 1892.312675] [<ffffffff8101ad01>] show_stack+0x21/0x40 [ 1892.312686] [<ffffffff813284a0>] dump_stack+0x5c/0x7c [ 1892.312707] [<ffffffff8119b1e2>] warn_alloc_failed+0xe2/0x150 [ 1892.312718] [<ffffffff8119b659>] __alloc_pages_nodemask+0x409/0xb80 [ 1892.312730] [<ffffffff811ea06d>] kmem_getpages+0x4d/0xf0 [ 1892.312737] [<ffffffff811eb8d5>] fallback_alloc+0x205/0x260 [ 1892.312741] [<ffffffff811ec3f6>] kmem_cache_alloc_trace+0x1f6/0x460 [ 1892.312748] [<ffffffff812384fb>] wb_start_writeback+0x3b/0xe0 [ 1892.312756] [<ffffffff81238a96>] wakeup_flusher_threads+0xc6/0x150 [ 1892.312759] [<ffffffff811a9571>] do_try_to_free_pages+0x241/0x450 [ 1892.312765] [<ffffffff811a983a>] try_to_free_pages+0xba/0x170 [ 1892.312767] [<ffffffff8119b843>] __alloc_pages_nodemask+0x5f3/0xb80 [ 1892.312770] [<ffffffff811ea06d>] kmem_getpages+0x4d/0xf0 [ 1892.312772] [<ffffffff811eb869>] fallback_alloc+0x199/0x260 [ 1892.312775] [<ffffffff811ec6e0>] kmem_cache_alloc_node_trace+0x80/0x490 [ 1892.312792] [<ffffffff815075ce>] __kmalloc_reserve.isra.34+0x2e/0x80 [ 1892.312806] [<ffffffff81508b13>] __alloc_skb+0x73/0x270 [ 1892.312815] [<ffffffff81567194>] sk_stream_alloc_skb+0x44/0x170 [ 1892.312824] [<ffffffff815676b8>] tcp_sendpage+0x3f8/0x610 [ 1892.312863] [<ffffffffa0be7df9>] ksocknal_lib_send_kiov+0x99/0x240 [ksocklnd] [ 1892.312885] [<ffffffffa0be1da7>] ksocknal_process_transmit+0x2b7/0xb60 [ksocklnd] [ 1892.312891] [<ffffffffa0be64b1>] ksocknal_scheduler+0x231/0x660 [ksocklnd] [ 1892.312901] [<ffffffff8109ebc9>] kthread+0xc9/0xe0 [ 1892.312919] [<ffffffff81617805>] ret_from_fork+0x55/0x80 [ 1892.315523] DWARF2 unwinder stuck at ret_from_fork+0x55/0x80 [ 1892.315524] [ 1892.315527] Leftover inexact backtrace: [ 1892.315527] [ 1892.315534] [<ffffffff8109eb00>] ? kthread_park+0x50/0x50 [ 1892.315535] Mem-Info: [ 1892.315539] active_anon:1643 inactive_anon:1687 isolated_anon:0 [ 1892.315539] active_file:127143 inactive_file:312793 isolated_file:0 [ 1892.315539] unevictable:20 dirty:0 writeback:31 unstable:0 [ 1892.315539] slab_reclaimable:2757 slab_unreclaimable:21743 [ 1892.315539] mapped:7385 shmem:2179 pagetables:887 bounce:0 [ 1892.315539] free:7 free_pcp:0 free_cma:0 [ 1892.315551] Node 0 DMA free:40kB min:376kB low:468kB high:560kB active_anon:28kB inactive_anon:96kB active_file:1408kB inactive_file:5712kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15992kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:456kB shmem:104kB slab_reclaimable:28kB slab_unreclaimable:7892kB kernel_stack:0kB pagetables:28kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:1191008 all_unreclaimable? yes [ 1892.315553] lowmem_reserve[]: 0 1836 1836 1836 1836 [ 1892.315560] Node 0 DMA32 free:0kB min:44676kB low:55844kB high:67012kB active_anon:6544kB inactive_anon:6652kB active_file:507164kB inactive_file:1245460kB unevictable:80kB isolated(anon):0kB isolated(file):0kB present:2080744kB managed:1900784kB mlocked:80kB dirty:0kB writeback:124kB mapped:29084kB shmem:8612kB slab_reclaimable:11000kB slab_unreclaimable:79080kB kernel_stack:2672kB pagetables:3520kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:16195004 all_unreclaimable? yes [ 1892.315562] lowmem_reserve[]: 0 0 0 0 0 [ 1892.315567] Node 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB [ 1892.315571] Node 0 DMA32: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB [ 1892.315582] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB [ 1892.315591] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB [ 1892.315591] 14900 total pagecache pages [ 1892.315593] 60 pages in swap cache [ 1892.315595] Swap cache stats: add 5726, delete 5666, find 66/70 [ 1892.315596] Free swap = 14315232kB [ 1892.315596] Total swap = 14338044kB [ 1892.315597] 524184 pages RAM [ 1892.315597] 0 pages HighMem/MovableOnly [ 1892.315598] 45012 pages reserved [ 1892.315598] 0 pages hwpoisoned There are many recent recovery-mds-scale crashes, but the following have a similar stack trace as described above |
| Comments |
| Comment by James Nunez (Inactive) [ 08/Aug/18 ] |
|
We have other crashes that may be related. For example, recovery-random-scale test fail_client_mds crash at https://testing.whamcloud.com/test_sets/9ce99840-9896-11e8-a9f7-52540065bddc with stack trace [ 1009.370664] Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh [ 1009.479318] Lustre: DEBUG MARKER: /usr/sbin/lctl mark Number of failovers: [ 1009.479318] mds1 failed over 1 times and counting... [ 1010.865426] Lustre: DEBUG MARKER: Number of failovers: [ 1308.013260] socknal_sd00_01: page allocation failure: order:0, mode:0x1080020(GFP_ATOMIC) [ 1308.013278] CPU: 0 PID: 13373 Comm: socknal_sd00_01 Tainted: G OE N 4.4.132-94.33-default #1 [ 1308.013278] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [ 1308.013286] 0000000000000000 ffffffff813284a0 0000000000000000 ffff88007fc03d00 [ 1308.013288] ffffffff8119b1e2 0108002000000030 ffffffff81f1da40 ffffffff81f1da40 [ 1308.013291] ffffffff8155814b ffffffff815528a0 ffffffff8157b250 0000000000000002 [ 1308.013292] Call Trace: [ 1308.013350] [<ffffffff81019b59>] dump_trace+0x59/0x340 [ 1308.013354] [<ffffffff81019f2a>] show_stack_log_lvl+0xea/0x170 [ 1308.013356] [<ffffffff8101ad01>] show_stack+0x21/0x40 [ 1308.013367] [<ffffffff813284a0>] dump_stack+0x5c/0x7c [ 1308.013387] [<ffffffff8119b1e2>] warn_alloc_failed+0xe2/0x150 [ 1308.013397] [<ffffffff8119b659>] __alloc_pages_nodemask+0x409/0xb80 [ 1308.013400] [<ffffffff8119bf0a>] __alloc_page_frag+0x10a/0x120 [ 1308.013416] [<ffffffff8150e1b2>] __napi_alloc_skb+0x82/0xd0 [ 1308.013444] [<ffffffffa02a23a4>] cp_rx_poll+0x1b4/0x550 [8139cp] [ 1308.013468] [<ffffffff8151cf9c>] net_rx_action+0x15c/0x370 [ 1308.013478] [<ffffffff810854bc>] __do_softirq+0xec/0x300 [ 1308.013487] [<ffffffff8108598a>] irq_exit+0xfa/0x110 [ 1308.013498] [<ffffffff8161b0f1>] do_IRQ+0x51/0xe0 [ 1308.013510] [<ffffffff81617fab>] common_interrupt+0xeb/0xeb [ 1308.016119] DWARF2 unwinder stuck at ret_from_intr+0x0/0x1f [ 1308.016119] [ 1308.016123] Leftover inexact backtrace: [ 1308.016123] [ 1308.016134] <IRQ> <EOI> [<ffffffff811a7544>] ? shrink_page_list+0x644/0x7f0 [ 1308.016136] [<ffffffff811a7cb0>] ? shrink_inactive_list+0x1f0/0x4f0 [ 1308.016139] [<ffffffff811a8aeb>] ? shrink_zone_memcg+0x2bb/0x6a0 [ 1308.016141] [<ffffffff811a8f87>] ? shrink_zone+0xb7/0x260 [ 1308.016143] [<ffffffff811a948d>] ? do_try_to_free_pages+0x15d/0x450 [ 1308.016144] [<ffffffff811a983a>] ? try_to_free_pages+0xba/0x170 [ 1308.016146] [<ffffffff8119b843>] ? __alloc_pages_nodemask+0x5f3/0xb80 [ 1308.016155] [<ffffffff811ea06d>] ? kmem_getpages+0x4d/0xf0 [ 1308.016157] [<ffffffff811eb869>] ? fallback_alloc+0x199/0x260 [ 1308.016158] [<ffffffff811eaf5b>] ? kmem_cache_alloc_node+0x7b/0x480 [ 1308.016160] [<ffffffff81508ae8>] ? __alloc_skb+0x48/0x270 [ 1308.016168] [<ffffffff81567194>] ? sk_stream_alloc_skb+0x44/0x170 [ 1308.016170] [<ffffffff815676b8>] ? tcp_sendpage+0x3f8/0x610 [ 1308.016175] [<ffffffff810b6d9b>] ? set_next_entity+0x4b/0x720 [ 1308.016177] [<ffffffff810b9e95>] ? put_prev_entity+0x35/0x6b0 [ 1308.016207] [<ffffffffa0656df9>] ? ksocknal_lib_send_kiov+0x99/0x240 [ksocklnd] [ 1308.016211] [<ffffffff810a7388>] ? finish_task_switch+0x78/0x2f0 [ 1308.016215] [<ffffffffa0650da7>] ? ksocknal_process_transmit+0x2b7/0xb60 [ksocklnd] [ 1308.016219] [<ffffffffa06554b1>] ? ksocknal_scheduler+0x231/0x660 [ksocklnd] [ 1308.016225] [<ffffffff810c44a0>] ? prepare_to_wait_event+0xf0/0xf0 [ 1308.016229] [<ffffffffa0655280>] ? ksocknal_recv+0x2a0/0x2a0 [ksocklnd] [ 1308.016233] [<ffffffff8109ebc9>] ? kthread+0xc9/0xe0 [ 1308.016235] [<ffffffff8109eb00>] ? kthread_park+0x50/0x50 [ 1308.016237] [<ffffffff81617805>] ? ret_from_fork+0x55/0x80 [ 1308.016238] [<ffffffff8109eb00>] ? kthread_park+0x50/0x50 |
| Comment by Andreas Dilger [ 10/Aug/18 ] |
|
It looks like all of the pages are consumed by the page cache - 0.5GB for active file, and 1.2GB for inactive file. There should be no problem to drop inactive file pages from memory, which makes me again think there is something wrong with how CLIO is interacting with the page cache. Maybe the threads that are supposed to be doing background page cleaning are not being notified that there are dirty pages? Maybe our pages are not being added to the right lists? Maybe our pages are keeping too many references and cannot be freed when found? Maybe we have too much dirty in-flight data and the writer is not being throttled correctly (there was some work with the "NFS dirty inflight" accounting that was disabled)? |
| Comment by James Nunez (Inactive) [ 15/Aug/18 ] |
|
We have a very different stack trace at https://testing.whamcloud.com/test_sets/12c3e76e-a08f-11e8-a9f7-52540065bddc, but we do see the following in the kernel crash [ 789.847234] kworker/u4:1: page allocation failure: order:0, mode:0x1284020(GFP_ATOMIC|__GFP_COMP|__GFP_NOTRACK) The stack trace in the kernel crash is [ 789.838046] active_anon:1604 inactive_anon:1945 isolated_anon:0
active_file:141338 inactive_file:292666 isolated_file:32
unevictable:20 dirty:0 writeback:32764 unstable:0
slab_reclaimable:2640 slab_unreclaimable:27603
mapped:7046 shmem:2177 pagetables:887 bounce:0
free:0 free_pcp:0 free_cma:0
[ 789.838051] Node 0 DMA free:0kB min:376kB low:468kB high:560kB active_anon:0kB inactive_anon:120kB active_file:1852kB inactive_file:5476kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15992kB managed:15904kB mlocked:0kB dirty:0kB writeback:568kB mapped:216kB shmem:120kB slab_reclaimable:76kB slab_unreclaimable:7964kB kernel_stack:32kB pagetables:24kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:24464992 all_unreclaimable? yes
[ 789.838052] lowmem_reserve[]: 0 1843 1843 1843 1843
[ 789.838057] Node 0 DMA32 free:0kB min:44676kB low:55844kB high:67012kB active_anon:6416kB inactive_anon:7660kB active_file:563500kB inactive_file:1165188kB unevictable:80kB isolated(anon):0kB isolated(file):128kB present:2080744kB managed:1900784kB mlocked:80kB dirty:0kB writeback:130488kB mapped:27968kB shmem:8588kB slab_reclaimable:10484kB slab_unreclaimable:102448kB kernel_stack:2656kB pagetables:3524kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:1025743296 all_unreclaimable? yes
[ 789.838058] lowmem_reserve[]: 0 0 0 0 0
[ 789.838063] Node 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
[ 789.838067] Node 0 DMA32: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
[ 789.838068] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[ 789.838069] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[ 789.838069] 43168 total pagecache pages
[ 789.838070] 232 pages in swap cache
[ 789.838070] Swap cache stats: add 6239, delete 6007, find 269/355
[ 789.838071] Free swap = 14313664kB
[ 789.838071] Total swap = 14338044kB
[ 789.838072] 524184 pages RAM
[ 789.838072] 0 pages HighMem/MovableOnly
[ 789.838072] 45012 pages reserved
[ 789.838072] 0 pages hwpoisoned
[ 789.838087] kworker/u4:1: page allocation failure: order:0, mode:0x1284020(GFP_ATOMIC|__GFP_COMP|__GFP_NOTRACK)
[ 789.838088] CPU: 0 PID: 53 Comm: kworker/u4:1 Tainted: G OE N 4.4.132-94.33-default #1
[ 789.838089] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[ 789.838091] Workqueue: writeback wb_workfn (flush-lustre-ffff88007a)
[ 789.838092] 0000000000000000 ffffffff813284a0 0000000000000000 ffff88007a8a7100
[ 789.838094] ffffffff8119b1e2 0128402000000004 0000000400000000 ffff88007fc1d828
[ 789.838095] ffff88007ffcf490 ffff88007fc1d828 0000000400000000 ffff88007a8a7188
[ 789.838095] Call Trace:
[ 789.838100] [<ffffffff81019b59>] dump_trace+0x59/0x340
[ 789.838103] [<ffffffff81019f2a>] show_stack_log_lvl+0xea/0x170
[ 789.838105] [<ffffffff8101ad01>] show_stack+0x21/0x40
[ 789.838108] [<ffffffff813284a0>] dump_stack+0x5c/0x7c
[ 789.838110] [<ffffffff8119b1e2>] warn_alloc_failed+0xe2/0x150
[ 789.838113] [<ffffffff8119b659>] __alloc_pages_nodemask+0x409/0xb80
[ 789.838116] [<ffffffff811ea06d>] kmem_getpages+0x4d/0xf0
[ 789.838118] [<ffffffff811eb8d5>] fallback_alloc+0x205/0x260
[ 789.838122] [<ffffffff811ec3f6>] kmem_cache_alloc_trace+0x1f6/0x460
[ 789.838125] [<ffffffff812384fb>] wb_start_writeback+0x3b/0xe0
[ 789.838128] [<ffffffff81238a96>] wakeup_flusher_threads+0xc6/0x150
[ 789.838130] [<ffffffff811a9571>] do_try_to_free_pages+0x241/0x450
[ 789.838132] [<ffffffff811a983a>] try_to_free_pages+0xba/0x170
[ 789.838135] [<ffffffff8119b843>] __alloc_pages_nodemask+0x5f3/0xb80
[ 789.838137] [<ffffffff811ea06d>] kmem_getpages+0x4d/0xf0
[ 789.838139] [<ffffffff811eb869>] fallback_alloc+0x199/0x260
[ 789.838142] [<ffffffff811ebf99>] kmem_cache_alloc+0x1f9/0x460
[ 789.838166] [<ffffffffa0a789c6>] ptlrpc_request_cache_alloc+0x26/0x100 [ptlrpc]
[ 789.838190] [<ffffffffa0a78abe>] ptlrpc_request_alloc_internal+0x1e/0x420 [ptlrpc]
[ 789.838198] [<ffffffffa0dd4de7>] osc_brw_prep_request+0x217/0xf90 [osc]
[ 789.838206] [<ffffffffa0dd8524>] osc_build_rpc+0x4b4/0xef0 [osc]
[ 789.838214] [<ffffffffa0df0136>] osc_io_unplug0+0xf06/0x1a10 [osc]
[ 789.838222] [<ffffffffa0df927b>] osc_cache_writeback_range+0xc6b/0x12e0 [osc]
[ 789.838230] [<ffffffffa0de5f48>] osc_io_fsync_start+0x88/0x3a0 [osc]
[ 789.838252] [<ffffffffa0848e18>] cl_io_start+0x58/0x110 [obdclass]
[ 789.838259] [<ffffffffa0c281e7>] lov_io_call.isra.4+0x77/0x120 [lov]
[ 789.838279] [<ffffffffa0848e18>] cl_io_start+0x58/0x110 [obdclass]
[ 789.838299] [<ffffffffa084ae72>] cl_io_loop+0x102/0xc30 [obdclass]
[ 789.838313] [<ffffffffa0cd95ba>] cl_sync_file_range+0x28a/0x310 [lustre]
[ 789.838327] [<ffffffffa0cf9e83>] ll_writepages+0x73/0x1d0 [lustre]
[ 789.838330] [<ffffffff812373fd>] __writeback_single_inode+0x3d/0x340
[ 789.838333] [<ffffffff81237bce>] writeback_sb_inodes+0x21e/0x4c0
[ 789.838336] [<ffffffff81237ef1>] __writeback_inodes_wb+0x81/0xb0
[ 789.838339] [<ffffffff81238176>] wb_writeback+0x256/0x2e0
[ 789.838341] [<ffffffff81238725>] wb_workfn+0xa5/0x350
[ 789.838344] [<ffffffff81098ac4>] process_one_work+0x154/0x410
[ 789.838347] [<ffffffff810996a6>] worker_thread+0x116/0x4a0
[ 789.838349] [<ffffffff8109ebc9>] kthread+0xc9/0xe0
[ 789.838352] [<ffffffff81617805>] ret_from_fork+0x55/0x80
[ 789.840324] DWARF2 unwinder stuck at ret_from_fork+0x55/0x80
[ 789.840325]
[ 789.840325] Leftover inexact backtrace:
[ 789.840327] [<ffffffff8109eb00>] ? kthread_park+0x50/0x50
|
| Comment by Alena Nikitenko [ 03/Dec/21 ] |
|
Found somewhat similar case in 2.12.8 testing: https://testing.whamcloud.com/test_sets/a55bdc1c-0dac-4efc-97d4-8bbb13cfece1 [46646.539896] Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh [46647.719288] Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds1 has failed over 39 times, and counting... [46647.924413] Lustre: DEBUG MARKER: mds1 has failed over 39 times, and counting... [46909.013358] socknal_sd00_01: page allocation failure: order:0, mode:0x20 [46909.016917] CPU: 0 PID: 11608 Comm: socknal_sd00_01 Kdump: loaded Tainted: G OE ------------ 3.10.0-1160.45.1.el7.x86_64 #1 [46909.019101] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [46909.020117] Call Trace: [46909.020626] <IRQ> [<ffffffff81783539>] dump_stack+0x19/0x1b [46909.021841] [<ffffffff811c4bc0>] warn_alloc_failed+0x110/0x180 [46909.022939] [<ffffffff810e0c46>] ? select_task_rq_fair+0x5a6/0x760 [46909.024075] [<ffffffff811c975f>] __alloc_pages_nodemask+0x9df/0xbe0 [46909.025176] [<ffffffff811c9b68>] page_frag_alloc+0x158/0x170 [46909.026216] [<ffffffff81644fa6>] __netdev_alloc_skb+0xa6/0x110 [46909.027530] [<ffffffffc01a91ee>] page_to_skb+0x4e/0x1f0 [virtio_net] [46909.028719] [<ffffffffc01ab4b9>] virtnet_poll+0x2c9/0x750 [virtio_net] [46909.029884] [<ffffffff816571cf>] net_rx_action+0x26f/0x390 [46909.030875] [<ffffffff810a4bf5>] __do_softirq+0xf5/0x280 [46909.031909] [<ffffffff817994ec>] call_softirq+0x1c/0x30 [46909.032860] [<ffffffff8102f715>] do_softirq+0x65/0xa0 [46909.033905] [<ffffffff810a4f75>] irq_exit+0x105/0x110 [46909.034845] [<ffffffff8179a8d6>] do_IRQ+0x56/0xf0 [46909.035695] [<ffffffff8178c36a>] common_interrupt+0x16a/0x16a [46909.036722] <EOI> [<ffffffff811d3ff2>] ? shrink_inactive_list+0x142/0x5c0 [46909.038035] [<ffffffff811d4b45>] shrink_lruvec+0x375/0x730 [46909.039009] [<ffffffff811d4f76>] shrink_zone+0x76/0x1a0 [46909.040001] [<ffffffff811d5460>] do_try_to_free_pages+0xf0/0x520 [46909.041143] [<ffffffff811d598c>] try_to_free_pages+0xfc/0x180 [46909.042161] [<ffffffff811c95b1>] __alloc_pages_nodemask+0x831/0xbe0 [46909.043327] [<ffffffff812193b8>] alloc_pages_current+0x98/0x110 [46909.044503] [<ffffffff81227b5d>] new_slab+0x44d/0x4e0 [46909.045469] [<ffffffff81227fbc>] ___slab_alloc+0x3cc/0x520 [46909.046477] [<ffffffff816401ad>] ? __alloc_skb+0x8d/0x2d0 [46909.047437] [<ffffffff816401ad>] ? __alloc_skb+0x8d/0x2d0 [46909.048419] [<ffffffff8177fe65>] __slab_alloc+0x40/0x5c [46909.049396] [<ffffffff8122b1c8>] __kmalloc_node_track_caller+0xb8/0x290 [46909.050703] [<ffffffff816401ad>] ? __alloc_skb+0x8d/0x2d0 [46909.051690] [<ffffffff8163f111>] __kmalloc_reserve.isra.32+0x31/0x90 [46909.052818] [<ffffffff8164017d>] ? __alloc_skb+0x5d/0x2d0 [46909.053783] [<ffffffff816401ad>] __alloc_skb+0x8d/0x2d0 [46909.054731] [<ffffffff816b2e62>] sk_stream_alloc_skb+0x52/0x1b0 [46909.055861] [<ffffffff816b33a1>] tcp_sendpage+0x3e1/0x5c0 [46909.057009] [<ffffffffc099077b>] ksocknal_lib_send_kiov+0xdb/0x2e0 [ksocklnd] [46909.058418] [<ffffffffc0991232>] ? ksocknal_lib_send_iov+0xd2/0x140 [ksocklnd] [46909.059700] [<ffffffff8106d39e>] ? kvm_clock_get_cycles+0x1e/0x20 [46909.060811] [<ffffffffc0989d1e>] ksocknal_process_transmit+0x39e/0xc10 [ksocklnd] [46909.062319] [<ffffffffc098e750>] ksocknal_scheduler+0x320/0xd50 [ksocklnd] [46909.063632] [<ffffffff810c6f50>] ? wake_up_atomic_t+0x30/0x30 [46909.064654] [<ffffffffc098e430>] ? ksocknal_recv+0x2a0/0x2a0 [ksocklnd] [46909.065854] [<ffffffff810c5e61>] kthread+0xd1/0xe0 [46909.066875] [<ffffffff810c5d90>] ? insert_kthread_work+0x40/0x40 [46909.067966] [<ffffffff81795df7>] ret_from_fork_nospec_begin+0x21/0x21 [46909.069140] [<ffffffff810c5d90>] ? insert_kthread_work+0x40/0x40 |