[LU-10812] recovery-random-scale test_fail_client_mds: test_fail_client_mds returned 4 Created: 13/Mar/18  Updated: 06/Aug/18

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.11.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Maloo Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for sarah_lw <wei3.liu@intel.com>

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/85ed0988-278a-11e8-9e0e-52540065bddc

test_fail_client_mds failed with the following error:

test_fail_client_mds returned 5

client and server: lustre-master tag-2.10.59 SLES12SP3

client 2 hit ll_ping: page allocation failure

console log

[  972.544170] Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads BEFORE failover -- failure NOT OK              ELAPSED=357 DURATION=86400 PERIOD=1200
[ 1005.987638] Lustre: DEBUG MARKER: ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=357 DURATION=86400 PERIOD=1200
[ 1022.308773] Lustre: DEBUG MARKER: rc=0;
[ 1022.308773] 			val=$(/usr/sbin/lctl get_param -n catastrophe 2>&1);
[ 1022.308773] 			if [[ $? -eq 0 && $val -ne 0 ]]; then
[ 1022.308773] 				echo $(hostname -s): $val;
[ 1022.308773] 				rc=$val;
[ 1022.308773] 			fi;
[ 1022.308773] 			exit $rc
[ 1022.357893] Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh
[ 1022.566140] Lustre: DEBUG MARKER: /usr/sbin/lctl mark FAIL CLIENT onyx-47vm4...
[ 1023.976580] Lustre: DEBUG MARKER: FAIL CLIENT onyx-47vm4...
[ 1033.244860] Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds1
[ 1099.399100] Lustre: DEBUG MARKER: Starting failover on mds1
[ 1123.785219] Lustre: 1838:0:(client.c:2100:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1520989165/real 1520989165]  req@ffff880040416c40 x1594871608337664/t0(0) o400->MGC10.2.8.216@tcp@10.2.8.217@tcp:26/25 lens 224/224 e 0 to 1 dl 1520989172 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
[ 1123.785238] LustreError: 166-1: MGC10.2.8.216@tcp: Connection to MGS (at 10.2.8.217@tcp) was lost; in progress operations using this service will fail
[ 1158.804380] Lustre: 1838:0:(client.c:2100:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1520989206/real 1520989211]  req@ffff880064c4d080 x1594871608390224/t0(0) o400->lustre-MDT0000-mdc-ffff88007a3ec800@10.2.8.217@tcp:12/10 lens 224/224 e 0 to 1 dl 1520989255 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1
[ 1158.804393] Lustre: lustre-MDT0000-mdc-ffff88007a3ec800: Connection to lustre-MDT0000 (at 10.2.8.217@tcp) was lost; in progress operations using this service will wait for recovery to complete
[ 1178.804585] Lustre: Evicted from MGS (at MGC10.2.8.216@tcp_0) after server handle changed from 0x620469374b42c3ff to 0x3e7afd771be5d16d
[ 1178.806122] Lustre: MGC10.2.8.216@tcp: Connection restored to MGC10.2.8.216@tcp_0 (at 10.2.8.216@tcp)
[ 1178.809735] LustreError: 1836:0:(client.c:2972:ptlrpc_replay_interpret()) @@@ status 301, old was 0  req@ffff88005eb3bc40 x1594871608322400/t8590014200(8590014200) o101->lustre-MDT0000-mdc-ffff88007a3ec800@10.2.8.216@tcp:12/10 lens 952/544 e 0 to 0 dl 1520989279 ref 2 fl Interpret:RP/4/0 rc 301/301
[ 1179.747452] Lustre: 1838:0:(client.c:2100:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1520989176/real 1520989176]  req@ffff880069295cc0 x1594871608341840/t0(0) o400->lustre-MDT0000-mdc-ffff88007a3ec800@10.2.8.216@tcp:12/10 lens 224/224 e 0 to 1 dl 1520989225 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
[ 1179.747464] Lustre: 1838:0:(client.c:2100:ptlrpc_expire_one_request()) Skipped 3 previous similar messages
[ 1180.738657] Lustre: lustre-MDT0000-mdc-ffff88007a3ec800: Connection restored to 10.2.8.216@tcp (at 10.2.8.216@tcp)
[ 1214.574355] Lustre: 1837:0:(client.c:2100:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1520989191/real 1520989191]  req@ffff880060bc8380 x1594871608349568/t0(0) o400->lustre-MDT0000-mdc-ffff88007a3ec800@10.2.8.216@tcp:12/10 lens 224/224 e 0 to 1 dl 1520989240 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
[ 1214.574362] Lustre: 1837:0:(client.c:2100:ptlrpc_expire_one_request()) Skipped 5 previous similar messages
[ 1332.924289] Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-47vm4: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
[ 1333.499978] Lustre: DEBUG MARKER: onyx-47vm4: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
[ 1366.096335] Lustre: DEBUG MARKER: /usr/sbin/lctl mark Started client load: tar on onyx-47vm4
[ 1366.593473] Lustre: DEBUG MARKER: Started client load: tar on onyx-47vm4
[ 1368.645975] Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads AFTER failed client reintegrated -- failure NOT OK
[ 1369.798370] Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failed client reintegrated -- failure NOT OK
[ 1370.023893] Lustre: DEBUG MARKER: rc=0;
[ 1370.023893] 			val=$(/usr/sbin/lctl get_param -n catastrophe 2>&1);
[ 1370.023893] 			if [[ $? -eq 0 && $val -ne 0 ]]; then
[ 1370.023893] 				echo $(hostname -s): $val;
[ 1370.023893] 				rc=$val;
[ 1370.023893] 			fi;
[ 1370.023893] 			exit $rc
[ 1370.080186] Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh
[ 1370.190671] Lustre: DEBUG MARKER: /usr/sbin/lctl mark Number of failovers:
[ 1370.190671] mds1 failed over 2 times                and counting...
[ 1370.405207] Lustre: DEBUG MARKER: Number of failovers:
[ 1859.772170] ll_ping: page allocation failure: order:0, mode:0x1080020(GFP_ATOMIC)
[ 1859.772182] CPU: 0 PID: 1839 Comm: ll_ping Tainted: G           OE   N  4.4.114-94.11-default #1
[ 1859.772183] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[ 1859.772189]  0000000000000000 ffffffff813274b0 0000000000000000 ffff88007fc03d00
[ 1859.772190]  ffffffff8119a732 0108002000000030 000000000000008a 0000000000000400
[ 1859.772191]  0000000000000194 0000000000000003 0000000000000001 0000000000000000
[ 1859.772192] Call Trace:
[ 1859.772207]  [<ffffffff81019b59>] dump_trace+0x59/0x340
[ 1859.772211]  [<ffffffff81019f2a>] show_stack_log_lvl+0xea/0x170
[ 1859.772213]  [<ffffffff8101acd1>] show_stack+0x21/0x40
[ 1859.772219]  [<ffffffff813274b0>] dump_stack+0x5c/0x7c
[ 1859.772223]  [<ffffffff8119a732>] warn_alloc_failed+0xe2/0x150
[ 1859.772226]  [<ffffffff8119aba9>] __alloc_pages_nodemask+0x409/0xb80
[ 1859.772229]  [<ffffffff8119b45a>] __alloc_page_frag+0x10a/0x120
[ 1859.772235]  [<ffffffff8150c0f2>] __napi_alloc_skb+0x82/0xd0
[ 1859.772242]  [<ffffffffa02db354>] cp_rx_poll+0x1b4/0x550 [8139cp]
[ 1859.772251]  [<ffffffff8151aebc>] net_rx_action+0x15c/0x370
[ 1859.772255]  [<ffffffff810852bc>] __do_softirq+0xec/0x300
[ 1859.772258]  [<ffffffff8108578a>] irq_exit+0xfa/0x110
[ 1859.772265]  [<ffffffff816181f1>] do_IRQ+0x51/0xe0
[ 1859.772268]  [<ffffffff816156c9>] common_interrupt+0xc9/0xc9
[ 1859.774527] DWARF2 unwinder stuck at ret_from_intr+0x0/0x1b
[ 1859.774528] 
[ 1859.774528] Leftover inexact backtrace:
[ 1859.774528] 
[ 1859.774534]  <IRQ>  <EOI>  [<ffffffff811a6820>] ? shrink_page_list+0x440/0x7f0
[ 1859.774536]  [<ffffffff811a7190>] ? shrink_inactive_list+0x1f0/0x4f0
[ 1859.774538]  [<ffffffff811a7fcb>] ? shrink_zone_memcg+0x2bb/0x6a0
[ 1859.774540]  [<ffffffff811a8467>] ? shrink_zone+0xb7/0x260
[ 1859.774542]  [<ffffffff811a896d>] ? do_try_to_free_pages+0x15d/0x450
[ 1859.774543]  [<ffffffff811a8d1a>] ? try_to_free_pages+0xba/0x170
[ 1859.774545]  [<ffffffff8119ad93>] ? __alloc_pages_nodemask+0x5f3/0xb80
[ 1859.774547]  [<ffffffff811e94cd>] ? kmem_getpages+0x4d/0xf0
[ 1859.774548]  [<ffffffff811eacc9>] ? fallback_alloc+0x199/0x260
[ 1859.774550]  [<ffffffff811eb3f9>] ? kmem_cache_alloc+0x1f9/0x460
[ 1859.774598]  [<ffffffffa0ad3dc6>] ? ptlrpc_request_cache_alloc+0x26/0x100 [ptlrpc]
[ 1859.774621]  [<ffffffffa0ad3ebe>] ? ptlrpc_request_alloc_internal+0x1e/0x420 [ptlrpc]
[ 1859.774644]  [<ffffffffa0adc408>] ? ptlrpc_request_alloc_pack+0x18/0x50 [ptlrpc]
[ 1859.774669]  [<ffffffffa0af7c8c>] ? ptlrpc_prep_ping+0x1c/0x40 [ptlrpc]
[ 1859.774693]  [<ffffffffa0af8105>] ? ptlrpc_pinger_main+0x335/0xa90 [ptlrpc]
[ 1859.774696]  [<ffffffff810ab000>] ? wake_up_q+0x80/0x80
[ 1859.774718]  [<ffffffffa0af7dd0>] ? ptlrpc_obd_ping+0x120/0x120 [ptlrpc]
[ 1859.774720]  [<ffffffff8109e8f9>] ? kthread+0xc9/0xe0
[ 1859.774722]  [<ffffffff8109e830>] ? kthread_park+0x50/0x50
[ 1859.774723]  [<ffffffff81614f45>] ? ret_from_fork+0x55/0x80
[ 1859.774725]  [<ffffffff8109e830>] ? kthread_park+0x50/0x50
[ 1860.067158] ll_ping: page allocation failure: order:0, mode:0x1080020(GFP_ATOMIC)
[ 1860.067163] CPU: 0 PID: 1839 Comm: ll_ping Tainted: G           OE   N  4.4.114-94.11-default #1
[ 1860.067170] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[ 1860.067174]  0000000000000000 ffffffff813274b0 0000000000000000 ffff88007fc03d00
[ 1860.067175]  ffffffff8119a732 0108002000000030 0000000000000005 0000000000000400
[ 1860.067177]  00000000000002f3 0000000000000001 0000000000000001 0000000000000000
[ 1860.067177] Call Trace:
[ 1860.067191]  [<ffffffff81019b59>] dump_trace+0x59/0x340
[ 1860.067195]  [<ffffffff81019f2a>] show_stack_log_lvl+0xea/0x170
[ 1860.067197]  [<ffffffff8101acd1>] show_stack+0x21/0x40
[ 1860.067201]  [<ffffffff813274b0>] dump_stack+0x5c/0x7c
[ 1860.067205]  [<ffffffff8119a732>] warn_alloc_failed+0xe2/0x150
[ 1860.067209]  [<ffffffff8119aba9>] __alloc_pages_nodemask+0x409/0xb80
[ 1860.067212]  [<ffffffff8119b45a>] __alloc_page_frag+0x10a/0x120
[ 1860.067216]  [<ffffffff8150c0f2>] __napi_alloc_skb+0x82/0xd0
[ 1860.067222]  [<ffffffffa02db354>] cp_rx_poll+0x1b4/0x550 [8139cp]
[ 1860.067232]  [<ffffffff8151aebc>] net_rx_action+0x15c/0x370
[ 1860.067236]  [<ffffffff810852bc>] __do_softirq+0xec/0x300
[ 1860.067238]  [<ffffffff8108578a>] irq_exit+0xfa/0x110
[ 1860.067245]  [<ffffffff816181f1>] do_IRQ+0x51/0xe0
[ 1860.067248]  [<ffffffff816156c9>] common_interrupt+0xc9/0xc9
[ 1860.069507] DWARF2 unwinder stuck at ret_from_intr+0x0/0x1b
[ 1860.069508] 

VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
recovery-random-scale test_fail_client_mds - test_fail_client_mds returned 5



 Comments   
Comment by Sarah Liu [ 14/Mar/18 ]
[ 1369.798370] Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failed client reintegrated -- failure NOT OK
[ 1370.023893] Lustre: DEBUG MARKER: rc=0;
[ 1370.023893] val=$(/usr/sbin/lctl get_param -n catastrophe 2>&1);
[ 1370.023893] if [[ $? -eq 0 && $val -ne 0 ]]; then
[ 1370.023893] echo $(hostname -s): $val;
[ 1370.023893] rc=$val;
[ 1370.023893] fi;
[ 1370.023893] exit $rc
[ 1370.080186] Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh
[ 1370.190671] Lustre: DEBUG MARKER: /usr/sbin/lctl mark Number of failovers:
[ 1370.190671] mds1 failed over 2 times and counting...
[ 1370.405207] Lustre: DEBUG MARKER: Number of failovers:
[ 1859.772170] ll_ping: page allocation failure: order:0, mode:0x1080020(GFP_ATOMIC)
[ 1859.772182] CPU: 0 PID: 1839 Comm: ll_ping Tainted: G OE N 4.4.114-94.11-default #1
[ 1859.772183] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[ 1859.772189] 0000000000000000 ffffffff813274b0 0000000000000000 ffff88007fc03d00
[ 1859.772190] ffffffff8119a732 0108002000000030 000000000000008a 0000000000000400
[ 1859.772191] 0000000000000194 0000000000000003 0000000000000001 0000000000000000
[ 1859.772192] Call Trace:
[ 1859.772207] [<ffffffff81019b59>] dump_trace+0x59/0x340
[ 1859.772211] [<ffffffff81019f2a>] show_stack_log_lvl+0xea/0x170
[ 1859.772213] [<ffffffff8101acd1>] show_stack+0x21/0x40
[ 1859.772219] [<ffffffff813274b0>] dump_stack+0x5c/0x7c
[ 1859.772223] [<ffffffff8119a732>] warn_alloc_failed+0xe2/0x150
[ 1859.772226] [<ffffffff8119aba9>] __alloc_pages_nodemask+0x409/0xb80
[ 1859.772229] [<ffffffff8119b45a>] __alloc_page_frag+0x10a/0x120
[ 1859.772235] [<ffffffff8150c0f2>] __napi_alloc_skb+0x82/0xd0
[ 1859.772242] [<ffffffffa02db354>] cp_rx_poll+0x1b4/0x550 [8139cp]
[ 1859.772251] [<ffffffff8151aebc>] net_rx_action+0x15c/0x370
[ 1859.772255] [<ffffffff810852bc>] __do_softirq+0xec/0x300
[ 1859.772258] [<ffffffff8108578a>] irq_exit+0xfa/0x110
[ 1859.772265] [<ffffffff816181f1>] do_IRQ+0x51/0xe0
[ 1859.772268] [<ffffffff816156c9>] common_interrupt+0xc9/0xc9
[ 1859.774527] DWARF2 unwinder stuck at ret_from_intr+0x0/0x1b
[ 1859.774528] 
[ 1859.774528] Leftover inexact backtrace:
[ 1859.774528] 
[ 1859.774534] <IRQ> <EOI> [<ffffffff811a6820>] ? shrink_page_list+0x440/0x7f0
[ 1859.774536] [<ffffffff811a7190>] ? shrink_inactive_list+0x1f0/0x4f0
[ 1859.774538] [<ffffffff811a7fcb>] ? shrink_zone_memcg+0x2bb/0x6a0
[ 1859.774540] [<ffffffff811a8467>] ? shrink_zone+0xb7/0x260
[ 1859.774542] [<ffffffff811a896d>] ? do_try_to_free_pages+0x15d/0x450
[ 1859.774543] [<ffffffff811a8d1a>] ? try_to_free_pages+0xba/0x170
[ 1859.774545] [<ffffffff8119ad93>] ? __alloc_pages_nodemask+0x5f3/0xb80
[ 1859.774547] [<ffffffff811e94cd>] ? kmem_getpages+0x4d/0xf0
[ 1859.774548] [<ffffffff811eacc9>] ? fallback_alloc+0x199/0x260
[ 1859.774550] [<ffffffff811eb3f9>] ? kmem_cache_alloc+0x1f9/0x460
[ 1859.774598] [<ffffffffa0ad3dc6>] ? ptlrpc_request_cache_alloc+0x26/0x100 [ptlrpc]
[ 1859.774621] [<ffffffffa0ad3ebe>] ? ptlrpc_request_alloc_internal+0x1e/0x420 [ptlrpc]
[ 1859.774644] [<ffffffffa0adc408>] ? ptlrpc_request_alloc_pack+0x18/0x50 [ptlrpc]
[ 1859.774669] [<ffffffffa0af7c8c>] ? ptlrpc_prep_ping+0x1c/0x40 [ptlrpc]
[ 1859.774693] [<ffffffffa0af8105>] ? ptlrpc_pinger_main+0x335/0xa90 [ptlrpc]
[ 1859.774696] [<ffffffff810ab000>] ? wake_up_q+0x80/0x80
[ 1859.774718] [<ffffffffa0af7dd0>] ? ptlrpc_obd_ping+0x120/0x120 [ptlrpc]
[ 1859.774720] [<ffffffff8109e8f9>] ? kthread+0xc9/0xe0
[ 1859.774722] [<ffffffff8109e830>] ? kthread_park+0x50/0x50
[ 1859.774723] [<ffffffff81614f45>] ? ret_from_fork+0x55/0x80
[ 1859.774725] [<ffffffff8109e830>] ? kthread_park+0x50/0x50
[ 1860.067158] ll_ping: page allocation failure: order:0, mode:0x1080020(GFP_ATOMIC)
[ 1860.067163] CPU: 0 PID: 1839 Comm: ll_ping Tainted: G OE N 4.4.114-94.11-default #1
[ 1860.067170] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[ 1860.067174] 0000000000000000 ffffffff813274b0 0000000000000000 ffff88007fc03d00
[ 1860.067175] ffffffff8119a732 0108002000000030 0000000000000005 0000000000000400
[ 1860.067177] 00000000000002f3 0000000000000001 0000000000000001 0000000000000000
[ 1860.067177] Call Trace:
[ 1860.067191] [<ffffffff81019b59>] dump_trace+0x59/0x340
[ 1860.067195] [<ffffffff81019f2a>] show_stack_log_lvl+0xea/0x170
[ 1860.067197] [<ffffffff8101acd1>] show_stack+0x21/0x40
[ 1860.067201] [<ffffffff813274b0>] dump_stack+0x5c/0x7c
[ 1860.067205] [<ffffffff8119a732>] warn_alloc_failed+0xe2/0x150
[ 1860.067209] [<ffffffff8119aba9>] __alloc_pages_nodemask+0x409/0xb80
[ 1860.067212] [<ffffffff8119b45a>] __alloc_page_frag+0x10a/0x120
[ 1860.067216] [<ffffffff8150c0f2>] __napi_alloc_skb+0x82/0xd0
[ 1860.067222] [<ffffffffa02db354>] cp_rx_poll+0x1b4/0x550 [8139cp]
[ 1860.067232] [<ffffffff8151aebc>] net_rx_action+0x15c/0x370
[ 1860.067236] [<ffffffff810852bc>] __do_softirq+0xec/0x300
[ 1860.067238] [<ffffffff8108578a>] irq_exit+0xfa/0x110
[ 1860.067245] [<ffffffff816181f1>] do_IRQ+0x51/0xe0
[ 1860.067248] [<ffffffff816156c9>] common_interrupt+0xc9/0xc9
[ 1860.069507] DWARF2 unwinder stuck at ret_from_intr+0x0/0x1b
[ 1860.069508] 
[ 1860.069508] Leftover inexact backtrace:
[ 1860.069508] 
[ 1860.069514] <IRQ> <EOI> [<ffffffff811a71d6>] ? shrink_inactive_list+0x236/0x4f0
[ 1860.069516] [<ffffffff811a71d2>] ? shrink_inactive_list+0x232/0x4f0
[ 1860.069518] [<ffffffff811a7fcb>] ? shrink_zone_memcg+0x2bb/0x6a0
[ 1860.069520] [<ffffffff811a8467>] ? shrink_zone+0xb7/0x260
[ 1860.069522] [<ffffffff811a896d>] ? do_try_to_free_pages+0x15d/0x450
[ 1860.069524] [<ffffffff811a8d1a>] ? try_to_free_pages+0xba/0x170
[ 1860.069525] [<ffffffff8119ad93>] ? __alloc_pages_nodemask+0x5f3/0xb80
[ 1860.069528] [<ffffffff811e94cd>] ? kmem_getpages+0x4d/0xf0
[ 1860.069529] [<ffffffff811eacc9>] ? fallback_alloc+0x199/0x260
[ 1860.069530] [<ffffffff811eb3f9>] ? kmem_cache_alloc+0x1f9/0x460
[ 1860.069581] [<ffffffffa0ad3dc6>] ? ptlrpc_request_cache_alloc+0x26/0x100 [ptlrpc]
[ 1860.069603] [<ffffffffa0ad3ebe>] ? ptlrpc_request_alloc_internal+0x1e/0x420 [ptlrpc]
[ 1860.069626] [<ffffffffa0adc408>] ? ptlrpc_request_alloc_pack+0x18/0x50 [ptlrpc]
[ 1860.069651] [<ffffffffa0af7c8c>] ? ptlrpc_prep_ping+0x1c/0x40 [ptlrpc]
[ 1860.069675] [<ffffffffa0af8105>] ? ptlrpc_pinger_main+0x335/0xa90 [ptlrpc]
[ 1860.069678] [<ffffffff810ab000>] ? wake_up_q+0x80/0x80
[ 1860.069700] [<ffffffffa0af7dd0>] ? ptlrpc_obd_ping+0x120/0x120 [ptlrpc]
[ 1860.069702] [<ffffffff8109e8f9>] ? kthread+0xc9/0xe0
[ 1860.069704] [<ffffffff8109e830>] ? kthread_park+0x50/0x50
[ 1860.069705] [<ffffffff81614f45>] ? ret_from_fork+0x55/0x80
[ 1860.069707] [<ffffffff8109e830>] ? kthread_park+0x50/0x50
[ 1860.325324] ll_ping: page allocation failure: order:0, mode:0x1284020(GFP_ATOMIC|__GFP_COMP|__GFP_NOTRACK)
[ 1860.325328] CPU: 0 PID: 1839 Comm: ll_ping Tainted: G OE N 4.4.114-94.11-default #1
[ 1860.325328] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[ 1860.325332] 0000000000000000 ffffffff813274b0 0000000000000000 ffff8800372638b0
[ 1860.325333] ffffffff8119a732 0128402000000004 0000000400000000 ffff88007fc19828
[ 1860.325335] ffff88007ffcf490 ffff88007fc19828 0000000400000000 ffff880037263938
[ 1860.325335] Call Trace:
[ 1860.325347] [<ffffffff81019b59>] dump_trace+0x59/0x340
[ 1860.325351] [<ffffffff81019f2a>] show_stack_log_lvl+0xea/0x170
[ 1860.325353] [<ffffffff8101acd1>] show_stack+0x21/0x40
[ 1860.325358] [<ffffffff813274b0>] dump_stack+0x5c/0x7c
[ 1860.325362] [<ffffffff8119a732>] warn_alloc_failed+0xe2/0x150
[ 1860.325365] [<ffffffff8119aba9>] __alloc_pages_nodemask+0x409/0xb80
[ 1860.325369] [<ffffffff811e94cd>] kmem_getpages+0x4d/0xf0
[ 1860.325372] [<ffffffff811ead35>] fallback_alloc+0x205/0x260
[ 1860.325376] [<ffffffff811eb856>] kmem_cache_alloc_trace+0x1f6/0x460
[ 1860.325381] [<ffffffff8123792b>] wb_start_writeback+0x3b/0xe0
[ 1860.325384] [<ffffffff81237ed6>] wakeup_flusher_threads+0xc6/0x150
[ 1860.325388] [<ffffffff811a8a51>] do_try_to_free_pages+0x241/0x450
[ 1860.325391] [<ffffffff811a8d1a>] try_to_free_pages+0xba/0x170
[ 1860.325394] [<ffffffff8119ad93>] __alloc_pages_nodemask+0x5f3/0xb80
[ 1860.325396] [<ffffffff811e94cd>] kmem_getpages+0x4d/0xf0
[ 1860.325398] [<ffffffff811eacc9>] fallback_alloc+0x199/0x260
[ 1860.325401] [<ffffffff811eb3f9>] kmem_cache_alloc+0x1f9/0x460
[ 1860.325452] [<ffffffffa0ad3dc6>] ptlrpc_request_cache_alloc+0x26/0x100 [ptlrpc]
[ 1860.325484] [<ffffffffa0ad3ebe>] ptlrpc_request_alloc_internal+0x1e/0x420 [ptlrpc]
[ 1860.325509] [<ffffffffa0adc408>] ptlrpc_request_alloc_pack+0x18/0x50 [ptlrpc]
[ 1860.325537] [<ffffffffa0af7c8c>] ptlrpc_prep_ping+0x1c/0x40 [ptlrpc]
[ 1860.325563] [<ffffffffa0af8105>] ptlrpc_pinger_main+0x335/0xa90 [ptlrpc]
[ 1860.325568] [<ffffffff8109e8f9>] kthread+0xc9/0xe0
[ 1860.325574] [<ffffffff81614f45>] ret_from_fork+0x55/0x80
[ 1860.327818] DWARF2 unwinder stuck at ret_from_fork+0x55/0x80
[ 1860.327818] 
[ 1860.327818] Leftover inexact backtrace:
[ 1860.327818] 
[ 1860.327821] [<ffffffff8109e830>] ? kthread_park+0x50/0x50
[ 1860.327822] Mem-Info:
[ 1860.327826] active_anon:1857 inactive_anon:1875 isolated_anon:0
[ 1860.327826] active_file:67798 inactive_file:369012 isolated_file:0
[ 1860.327826] unevictable:20 dirty:0 writeback:512 unstable:0
[ 1860.327826] slab_reclaimable:2671 slab_unreclaimable:22061
[ 1860.327826] mapped:4296 shmem:2177 pagetables:887 bounce:0
[ 1860.327826] free:0 free_pcp:4 free_cma:0
[ 1860.327831] Node 0 DMA free:0kB min:376kB low:468kB high:560kB active_anon:32kB inactive_anon:88kB active_file:708kB inactive_file:6696kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15992kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:72kB shmem:92kB slab_reclaimable:32kB slab_unreclaimable:7856kB kernel_stack:16kB pagetables:28kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:1176144 all_unreclaimable? yes
[ 1860.327833] lowmem_reserve[]: 0 1837 1837 1837 1837
[ 1860.327837] Node 0 DMA32 free:0kB min:44676kB low:55844kB high:67012kB active_anon:7396kB inactive_anon:7412kB active_file:270484kB inactive_file:1469352kB unevictable:80kB isolated(anon):0kB isolated(file):0kB present:2080744kB managed:1900872kB mlocked:80kB dirty:0kB writeback:2048kB mapped:17112kB shmem:8616kB slab_reclaimable:10652kB slab_unreclaimable:80388kB kernel_stack:2688kB pagetables:3520kB unstable:0kB bounce:0kB free_pcp:16kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:12966040 all_unreclaimable? yes
[ 1860.327839] lowmem_reserve[]: 0 0 0 0 0
[ 1860.327843] Node 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
[ 1860.327848] Node 0 DMA32: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
[ 1860.327849] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[ 1860.327850] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[ 1860.327850] 7606 total pagecache pages
[ 1860.327851] 169 pages in swap cache
[ 1860.327852] Swap cache stats: add 8260, delete 8091, find 594/907
[ 1860.327853] Free swap = 14307180kB
[ 1860.327853] Total swap = 14338044kB
[ 1860.327853] 524184 pages RAM
[ 1860.327854] 0 pages HighMem/MovableOnly
[ 1860.327854] 44990 pages reserved
[ 1860.327854] 0 pages hwpoisoned
[ 1860.327865] ll_ping: page allocation failure: order:0, mode:0x1284020(GFP_ATOMIC|__GFP_COMP|__GFP_NOTRACK)

Comment by Sarah Liu [ 14/Mar/18 ]

same error in recovery-mds-scale

https://testing.hpdd.intel.com/test_sets/85e9815a-278a-11e8-9e0e-52540065bddc

Comment by James Nunez (Inactive) [ 22/Mar/18 ]

Similar issue, but with dd having the page allocation error; https://testing.hpdd.intel.com/test_sets/794b9dac-2cf3-11e8-9e0e-52540065bddc

[ 756.319053] Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds1
[ 800.372031] dd: page allocation failure: order:0, mode:0x1080020(GFP_ATOMIC)
[ 800.372040] CPU: 0 PID: 19516 Comm: dd Tainted: G OE N 4.4.114-94.11-default #1
[ 800.372041] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007
[ 800.372045] 0000000000000000 ffffffff813274b0 0000000000000000 ffff88007fc03d00
[ 800.372046] ffffffff8119a732 0108002000000030 0000000000016f00 0000000000000000
[ 800.372048] 0000000000000082 ffff88007fc16440 ffff88007fc03cf8 0000000000000000
[ 800.372048] Call Trace:
[ 800.372092] [<ffffffff81019b59>] dump_trace+0x59/0x340
[ 800.372098] [<ffffffff81019f2a>] show_stack_log_lvl+0xea/0x170
[ 800.372101] [<ffffffff8101acd1>] show_stack+0x21/0x40
[ 800.372113] [<ffffffff813274b0>] dump_stack+0x5c/0x7c
[ 800.372131] [<ffffffff8119a732>] warn_alloc_failed+0xe2/0x150
[ 800.372141] [<ffffffff8119aba9>] __alloc_pages_nodemask+0x409/0xb80
[ 800.372144] [<ffffffff8119b45a>] __alloc_page_frag+0x10a/0x120
[ 800.372156] [<ffffffff8150c0f2>] __napi_alloc_skb+0x82/0xd0
[ 800.372175] [<ffffffffa0357354>] cp_rx_poll+0x1b4/0x550 [8139cp]
[ 800.372192] [<ffffffff8151aebc>] net_rx_action+0x15c/0x370
[ 800.372202] [<ffffffff810852bc>] __do_softirq+0xec/0x300
[ 800.372210] [<ffffffff8108578a>] irq_exit+0xfa/0x110
[ 800.372224] [<ffffffff816181f1>] do_IRQ+0x51/0xe0
[ 800.372232] [<ffffffff816156c9>] common_interrupt+0xc9/0xc9
[ 800.374978] DWARF2 unwinder stuck at ret_from_intr+0x0/0x1b
[ 800.374978] 
[ 800.374979] Leftover inexact backtrace:
[ 800.374979] 
[ 800.374991] <IRQ> <EOI> [<ffffffff811a726e>] ? shrink_inactive_list+0x2ce/0x4f0
[ 800.374993] [<ffffffff811a742f>] ? shrink_inactive_list+0x48f/0x4f0
[ 800.374995] [<ffffffff811a7fcb>] ? shrink_zone_memcg+0x2bb/0x6a0
[ 800.374997] [<ffffffff811a8467>] ? shrink_zone+0xb7/0x260
[ 800.374999] [<ffffffff811a896d>] ? do_try_to_free_pages+0x15d/0x450
[ 800.375001] [<ffffffff811a8d1a>] ? try_to_free_pages+0xba/0x170
[ 800.375002] [<ffffffff8119ad93>] ? __alloc_pages_nodemask+0x5f3/0xb80
[ 800.375012] [<ffffffff811e27ff>] ? alloc_pages_current+0x7f/0x100
[ 800.375018] [<ffffffff8119368d>] ? pagecache_get_page+0x4d/0x1d0
[ 800.375056] [<ffffffffa0e86c1d>] ? ll_write_begin+0xed/0xbd0 [lustre]
[ 800.375058] [<ffffffff811927b5>] ? generic_perform_write+0xc5/0x1b0
[ 800.375063] [<ffffffff81223aeb>] ? file_update_time+0x3b/0xf0
[ 800.375065] [<ffffffff81194524>] ? __generic_file_write_iter+0x184/0x1c0
[ 800.375081] [<ffffffffa0d2c241>] ? lov_object_maxbytes+0x31/0x40 [lov]
[ 800.375094] [<ffffffffa0e976ad>] ? vvp_io_write_start+0x44d/0x740 [lustre]
[ 800.375140] [<ffffffffa092f4f2>] ? cl_lock_request+0x62/0x1d0 [obdclass]
[ 800.375146] [<ffffffffa0d14557>] ? lov_io_call.isra.5+0x77/0x120 [lov]
[ 800.375166] [<ffffffffa0931278>] ? cl_io_start+0x58/0x110 [obdclass]
[ 800.375184] [<ffffffffa09332e4>] ? cl_io_loop+0x104/0xc30 [obdclass]
[ 800.375195] [<ffffffffa0e4f160>] ? ll_file_io_generic+0x490/0xb50 [lustre]
[ 800.375205] [<ffffffffa0e4fab1>] ? ll_file_write_iter+0xe1/0x3d0 [lustre]
[ 800.375212] [<ffffffff8120a360>] ? __vfs_write+0xd0/0x140
[ 800.375213] [<ffffffff8120afbd>] ? vfs_write+0x9d/0x190
[ 800.375215] [<ffffffff81617b99>] ? stuff_rsb+0x59/0xf0
[ 800.375217] [<ffffffff8120c032>] ? SyS_write+0x42/0xa0
[ 800.375219] [<ffffffff81617b68>] ? stuff_rsb+0x28/0xf0
[ 800.375220] [<ffffffff81614b0a>] ? entry_SYSCALL_64_fastpath+0x1e/0xb6
[ 800.791280] kswapd0: page allocation failure: order:0, mode:0x1080020(GFP_ATOMIC)
[ 800.791287] CPU: 0 PID: 32 Comm: kswapd0 Tainted: G OE N 4.4.114-94.11-default #1
[ 800.791288] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007
[ 800.791300] 0000000000000000 ffffffff813274b0 0000000000000000 ffff88007fc03d00
[ 800.791301] ffffffff8119a732 0108002000000030 00000000000003e0 0000000000000400
[ 800.791303] 00000000000003fa 0000000000000002 0000000200000001 0000000000000001
[ 800.791303] Call Trace:
[ 800.791321] [<ffffffff81019b59>] dump_trace+0x59/0x340
[ 800.791325] [<ffffffff81019f2a>] show_stack_log_lvl+0xea/0x170
[ 800.791328] [<ffffffff8101acd1>] show_stack+0x21/0x40
[ 800.791334] [<ffffffff813274b0>] dump_stack+0x5c/0x7c
[ 800.791340] [<ffffffff8119a732>] warn_alloc_failed+0xe2/0x150
[ 800.791343] [<ffffffff8119aba9>] __alloc_pages_nodemask+0x409/0xb80
[ 800.791346] [<ffffffff8119b45a>] __alloc_page_frag+0x10a/0x120
[ 800.791352] [<ffffffff8150c0f2>] __napi_alloc_skb+0x82/0xd0
[ 800.791362] [<ffffffffa0357354>] cp_rx_poll+0x1b4/0x550 [8139cp]
[ 800.791375] [<ffffffff8151aebc>] net_rx_action+0x15c/0x370
[ 800.791380] [<ffffffff810852bc>] __do_softirq+0xec/0x300
[ 800.791383] [<ffffffff8108578a>] irq_exit+0xfa/0x110
[ 800.791391] [<ffffffff816181f1>] do_IRQ+0x51/0xe0
[ 800.791395] [<ffffffff816156c9>] common_interrupt+0xc9/0xc9
[ 800.792004] DWARF2 unwinder stuck at ret_from_intr+0x0/0x1b
[ 800.792004] 
[ 800.792004] Leftover inexact backtrace:
[ 800.792004] 
[ 800.792004] <IRQ> <EOI> [<ffffffff811a71d6>] ? shrink_inactive_list+0x236/0x4f0
[ 800.792004] [<ffffffff811a71c6>] ? shrink_inactive_list+0x226/0x4f0
[ 800.792004] [<ffffffff811a7fcb>] ? shrink_zone_memcg+0x2bb/0x6a0
[ 800.792004] [<ffffffff811a8467>] ? shrink_zone+0xb7/0x260
[ 800.792004] [<ffffffff811a95be>] ? kswapd+0x48e/0x920
[ 800.792004] [<ffffffff811a9130>] ? mem_cgroup_shrink_node_zone+0x150/0x150
[ 800.792004] [<ffffffff8109e8f9>] ? kthread+0xc9/0xe0
[ 800.792004] [<ffffffff8109e830>] ? kthread_park+0x50/0x50

[ 800.792004] [<ffffffff81614f45>] ? ret_from_fork+0x55/0x80

[ 800.792004] [<ffffffff8109e830>] ? kthread_park+0x50/0x50

 

Generated at Sat Feb 10 02:38:24 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.