Details
-
Bug
-
Resolution: Unresolved
-
Minor
-
None
-
Lustre 2.11.0
-
None
-
3
-
9223372036854775807
Description
This issue was created by maloo for sarah_lw <wei3.liu@intel.com>
This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/85ed0988-278a-11e8-9e0e-52540065bddc
test_fail_client_mds failed with the following error:
test_fail_client_mds returned 5
client and server: lustre-master tag-2.10.59 SLES12SP3
client 2 hit ll_ping: page allocation failure
console log
[ 972.544170] Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=357 DURATION=86400 PERIOD=1200 [ 1005.987638] Lustre: DEBUG MARKER: ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=357 DURATION=86400 PERIOD=1200 [ 1022.308773] Lustre: DEBUG MARKER: rc=0; [ 1022.308773] val=$(/usr/sbin/lctl get_param -n catastrophe 2>&1); [ 1022.308773] if [[ $? -eq 0 && $val -ne 0 ]]; then [ 1022.308773] echo $(hostname -s): $val; [ 1022.308773] rc=$val; [ 1022.308773] fi; [ 1022.308773] exit $rc [ 1022.357893] Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh [ 1022.566140] Lustre: DEBUG MARKER: /usr/sbin/lctl mark FAIL CLIENT onyx-47vm4... [ 1023.976580] Lustre: DEBUG MARKER: FAIL CLIENT onyx-47vm4... [ 1033.244860] Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds1 [ 1099.399100] Lustre: DEBUG MARKER: Starting failover on mds1 [ 1123.785219] Lustre: 1838:0:(client.c:2100:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1520989165/real 1520989165] req@ffff880040416c40 x1594871608337664/t0(0) o400->MGC10.2.8.216@tcp@10.2.8.217@tcp:26/25 lens 224/224 e 0 to 1 dl 1520989172 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 [ 1123.785238] LustreError: 166-1: MGC10.2.8.216@tcp: Connection to MGS (at 10.2.8.217@tcp) was lost; in progress operations using this service will fail [ 1158.804380] Lustre: 1838:0:(client.c:2100:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1520989206/real 1520989211] req@ffff880064c4d080 x1594871608390224/t0(0) o400->lustre-MDT0000-mdc-ffff88007a3ec800@10.2.8.217@tcp:12/10 lens 224/224 e 0 to 1 dl 1520989255 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1 [ 1158.804393] Lustre: lustre-MDT0000-mdc-ffff88007a3ec800: Connection to lustre-MDT0000 (at 10.2.8.217@tcp) was lost; in progress operations using this service will wait for recovery to complete [ 1178.804585] Lustre: Evicted from MGS (at MGC10.2.8.216@tcp_0) after server handle changed from 0x620469374b42c3ff to 0x3e7afd771be5d16d [ 1178.806122] Lustre: MGC10.2.8.216@tcp: Connection restored to MGC10.2.8.216@tcp_0 (at 10.2.8.216@tcp) [ 1178.809735] LustreError: 1836:0:(client.c:2972:ptlrpc_replay_interpret()) @@@ status 301, old was 0 req@ffff88005eb3bc40 x1594871608322400/t8590014200(8590014200) o101->lustre-MDT0000-mdc-ffff88007a3ec800@10.2.8.216@tcp:12/10 lens 952/544 e 0 to 0 dl 1520989279 ref 2 fl Interpret:RP/4/0 rc 301/301 [ 1179.747452] Lustre: 1838:0:(client.c:2100:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1520989176/real 1520989176] req@ffff880069295cc0 x1594871608341840/t0(0) o400->lustre-MDT0000-mdc-ffff88007a3ec800@10.2.8.216@tcp:12/10 lens 224/224 e 0 to 1 dl 1520989225 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 [ 1179.747464] Lustre: 1838:0:(client.c:2100:ptlrpc_expire_one_request()) Skipped 3 previous similar messages [ 1180.738657] Lustre: lustre-MDT0000-mdc-ffff88007a3ec800: Connection restored to 10.2.8.216@tcp (at 10.2.8.216@tcp) [ 1214.574355] Lustre: 1837:0:(client.c:2100:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1520989191/real 1520989191] req@ffff880060bc8380 x1594871608349568/t0(0) o400->lustre-MDT0000-mdc-ffff88007a3ec800@10.2.8.216@tcp:12/10 lens 224/224 e 0 to 1 dl 1520989240 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 [ 1214.574362] Lustre: 1837:0:(client.c:2100:ptlrpc_expire_one_request()) Skipped 5 previous similar messages [ 1332.924289] Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-47vm4: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 [ 1333.499978] Lustre: DEBUG MARKER: onyx-47vm4: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 [ 1366.096335] Lustre: DEBUG MARKER: /usr/sbin/lctl mark Started client load: tar on onyx-47vm4 [ 1366.593473] Lustre: DEBUG MARKER: Started client load: tar on onyx-47vm4 [ 1368.645975] Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads AFTER failed client reintegrated -- failure NOT OK [ 1369.798370] Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failed client reintegrated -- failure NOT OK [ 1370.023893] Lustre: DEBUG MARKER: rc=0; [ 1370.023893] val=$(/usr/sbin/lctl get_param -n catastrophe 2>&1); [ 1370.023893] if [[ $? -eq 0 && $val -ne 0 ]]; then [ 1370.023893] echo $(hostname -s): $val; [ 1370.023893] rc=$val; [ 1370.023893] fi; [ 1370.023893] exit $rc [ 1370.080186] Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh [ 1370.190671] Lustre: DEBUG MARKER: /usr/sbin/lctl mark Number of failovers: [ 1370.190671] mds1 failed over 2 times and counting... [ 1370.405207] Lustre: DEBUG MARKER: Number of failovers: [ 1859.772170] ll_ping: page allocation failure: order:0, mode:0x1080020(GFP_ATOMIC) [ 1859.772182] CPU: 0 PID: 1839 Comm: ll_ping Tainted: G OE N 4.4.114-94.11-default #1 [ 1859.772183] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [ 1859.772189] 0000000000000000 ffffffff813274b0 0000000000000000 ffff88007fc03d00 [ 1859.772190] ffffffff8119a732 0108002000000030 000000000000008a 0000000000000400 [ 1859.772191] 0000000000000194 0000000000000003 0000000000000001 0000000000000000 [ 1859.772192] Call Trace: [ 1859.772207] [<ffffffff81019b59>] dump_trace+0x59/0x340 [ 1859.772211] [<ffffffff81019f2a>] show_stack_log_lvl+0xea/0x170 [ 1859.772213] [<ffffffff8101acd1>] show_stack+0x21/0x40 [ 1859.772219] [<ffffffff813274b0>] dump_stack+0x5c/0x7c [ 1859.772223] [<ffffffff8119a732>] warn_alloc_failed+0xe2/0x150 [ 1859.772226] [<ffffffff8119aba9>] __alloc_pages_nodemask+0x409/0xb80 [ 1859.772229] [<ffffffff8119b45a>] __alloc_page_frag+0x10a/0x120 [ 1859.772235] [<ffffffff8150c0f2>] __napi_alloc_skb+0x82/0xd0 [ 1859.772242] [<ffffffffa02db354>] cp_rx_poll+0x1b4/0x550 [8139cp] [ 1859.772251] [<ffffffff8151aebc>] net_rx_action+0x15c/0x370 [ 1859.772255] [<ffffffff810852bc>] __do_softirq+0xec/0x300 [ 1859.772258] [<ffffffff8108578a>] irq_exit+0xfa/0x110 [ 1859.772265] [<ffffffff816181f1>] do_IRQ+0x51/0xe0 [ 1859.772268] [<ffffffff816156c9>] common_interrupt+0xc9/0xc9 [ 1859.774527] DWARF2 unwinder stuck at ret_from_intr+0x0/0x1b [ 1859.774528] [ 1859.774528] Leftover inexact backtrace: [ 1859.774528] [ 1859.774534] <IRQ> <EOI> [<ffffffff811a6820>] ? shrink_page_list+0x440/0x7f0 [ 1859.774536] [<ffffffff811a7190>] ? shrink_inactive_list+0x1f0/0x4f0 [ 1859.774538] [<ffffffff811a7fcb>] ? shrink_zone_memcg+0x2bb/0x6a0 [ 1859.774540] [<ffffffff811a8467>] ? shrink_zone+0xb7/0x260 [ 1859.774542] [<ffffffff811a896d>] ? do_try_to_free_pages+0x15d/0x450 [ 1859.774543] [<ffffffff811a8d1a>] ? try_to_free_pages+0xba/0x170 [ 1859.774545] [<ffffffff8119ad93>] ? __alloc_pages_nodemask+0x5f3/0xb80 [ 1859.774547] [<ffffffff811e94cd>] ? kmem_getpages+0x4d/0xf0 [ 1859.774548] [<ffffffff811eacc9>] ? fallback_alloc+0x199/0x260 [ 1859.774550] [<ffffffff811eb3f9>] ? kmem_cache_alloc+0x1f9/0x460 [ 1859.774598] [<ffffffffa0ad3dc6>] ? ptlrpc_request_cache_alloc+0x26/0x100 [ptlrpc] [ 1859.774621] [<ffffffffa0ad3ebe>] ? ptlrpc_request_alloc_internal+0x1e/0x420 [ptlrpc] [ 1859.774644] [<ffffffffa0adc408>] ? ptlrpc_request_alloc_pack+0x18/0x50 [ptlrpc] [ 1859.774669] [<ffffffffa0af7c8c>] ? ptlrpc_prep_ping+0x1c/0x40 [ptlrpc] [ 1859.774693] [<ffffffffa0af8105>] ? ptlrpc_pinger_main+0x335/0xa90 [ptlrpc] [ 1859.774696] [<ffffffff810ab000>] ? wake_up_q+0x80/0x80 [ 1859.774718] [<ffffffffa0af7dd0>] ? ptlrpc_obd_ping+0x120/0x120 [ptlrpc] [ 1859.774720] [<ffffffff8109e8f9>] ? kthread+0xc9/0xe0 [ 1859.774722] [<ffffffff8109e830>] ? kthread_park+0x50/0x50 [ 1859.774723] [<ffffffff81614f45>] ? ret_from_fork+0x55/0x80 [ 1859.774725] [<ffffffff8109e830>] ? kthread_park+0x50/0x50 [ 1860.067158] ll_ping: page allocation failure: order:0, mode:0x1080020(GFP_ATOMIC) [ 1860.067163] CPU: 0 PID: 1839 Comm: ll_ping Tainted: G OE N 4.4.114-94.11-default #1 [ 1860.067170] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [ 1860.067174] 0000000000000000 ffffffff813274b0 0000000000000000 ffff88007fc03d00 [ 1860.067175] ffffffff8119a732 0108002000000030 0000000000000005 0000000000000400 [ 1860.067177] 00000000000002f3 0000000000000001 0000000000000001 0000000000000000 [ 1860.067177] Call Trace: [ 1860.067191] [<ffffffff81019b59>] dump_trace+0x59/0x340 [ 1860.067195] [<ffffffff81019f2a>] show_stack_log_lvl+0xea/0x170 [ 1860.067197] [<ffffffff8101acd1>] show_stack+0x21/0x40 [ 1860.067201] [<ffffffff813274b0>] dump_stack+0x5c/0x7c [ 1860.067205] [<ffffffff8119a732>] warn_alloc_failed+0xe2/0x150 [ 1860.067209] [<ffffffff8119aba9>] __alloc_pages_nodemask+0x409/0xb80 [ 1860.067212] [<ffffffff8119b45a>] __alloc_page_frag+0x10a/0x120 [ 1860.067216] [<ffffffff8150c0f2>] __napi_alloc_skb+0x82/0xd0 [ 1860.067222] [<ffffffffa02db354>] cp_rx_poll+0x1b4/0x550 [8139cp] [ 1860.067232] [<ffffffff8151aebc>] net_rx_action+0x15c/0x370 [ 1860.067236] [<ffffffff810852bc>] __do_softirq+0xec/0x300 [ 1860.067238] [<ffffffff8108578a>] irq_exit+0xfa/0x110 [ 1860.067245] [<ffffffff816181f1>] do_IRQ+0x51/0xe0 [ 1860.067248] [<ffffffff816156c9>] common_interrupt+0xc9/0xc9 [ 1860.069507] DWARF2 unwinder stuck at ret_from_intr+0x0/0x1b [ 1860.069508]
VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
recovery-random-scale test_fail_client_mds - test_fail_client_mds returned 5
Attachments
Issue Links
- mentioned in
-
Page Loading...