Loading...

Type: Bug
Resolution: Unresolved
Priority: Minor
Fix Version/s: None
Affects Version/s: Lustre 2.11.0
Labels:
None

Severity:
3
Rank (Obsolete):
9223372036854775807

This issue was created by maloo for sarah_lw <wei3.liu@intel.com>

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/85ed0988-278a-11e8-9e0e-52540065bddc

test_fail_client_mds failed with the following error:

test_fail_client_mds returned 5

client and server: lustre-master tag-2.10.59 SLES12SP3

client 2 hit ll_ping: page allocation failure

console log

[  972.544170] Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads BEFORE failover -- failure NOT OK              ELAPSED=357 DURATION=86400 PERIOD=1200
[ 1005.987638] Lustre: DEBUG MARKER: ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=357 DURATION=86400 PERIOD=1200
[ 1022.308773] Lustre: DEBUG MARKER: rc=0;
[ 1022.308773] 			val=$(/usr/sbin/lctl get_param -n catastrophe 2>&1);
[ 1022.308773] 			if [[ $? -eq 0 && $val -ne 0 ]]; then
[ 1022.308773] 				echo $(hostname -s): $val;
[ 1022.308773] 				rc=$val;
[ 1022.308773] 			fi;
[ 1022.308773] 			exit $rc
[ 1022.357893] Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh
[ 1022.566140] Lustre: DEBUG MARKER: /usr/sbin/lctl mark FAIL CLIENT onyx-47vm4...
[ 1023.976580] Lustre: DEBUG MARKER: FAIL CLIENT onyx-47vm4...
[ 1033.244860] Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds1
[ 1099.399100] Lustre: DEBUG MARKER: Starting failover on mds1
[ 1123.785219] Lustre: 1838:0:(client.c:2100:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1520989165/real 1520989165]  req@ffff880040416c40 x1594871608337664/t0(0) o400->MGC10.2.8.216@tcp@10.2.8.217@tcp:26/25 lens 224/224 e 0 to 1 dl 1520989172 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
[ 1123.785238] LustreError: 166-1: MGC10.2.8.216@tcp: Connection to MGS (at 10.2.8.217@tcp) was lost; in progress operations using this service will fail
[ 1158.804380] Lustre: 1838:0:(client.c:2100:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1520989206/real 1520989211]  req@ffff880064c4d080 x1594871608390224/t0(0) o400->lustre-MDT0000-mdc-ffff88007a3ec800@10.2.8.217@tcp:12/10 lens 224/224 e 0 to 1 dl 1520989255 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1
[ 1158.804393] Lustre: lustre-MDT0000-mdc-ffff88007a3ec800: Connection to lustre-MDT0000 (at 10.2.8.217@tcp) was lost; in progress operations using this service will wait for recovery to complete
[ 1178.804585] Lustre: Evicted from MGS (at MGC10.2.8.216@tcp_0) after server handle changed from 0x620469374b42c3ff to 0x3e7afd771be5d16d
[ 1178.806122] Lustre: MGC10.2.8.216@tcp: Connection restored to MGC10.2.8.216@tcp_0 (at 10.2.8.216@tcp)
[ 1178.809735] LustreError: 1836:0:(client.c:2972:ptlrpc_replay_interpret()) @@@ status 301, old was 0  req@ffff88005eb3bc40 x1594871608322400/t8590014200(8590014200) o101->lustre-MDT0000-mdc-ffff88007a3ec800@10.2.8.216@tcp:12/10 lens 952/544 e 0 to 0 dl 1520989279 ref 2 fl Interpret:RP/4/0 rc 301/301
[ 1179.747452] Lustre: 1838:0:(client.c:2100:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1520989176/real 1520989176]  req@ffff880069295cc0 x1594871608341840/t0(0) o400->lustre-MDT0000-mdc-ffff88007a3ec800@10.2.8.216@tcp:12/10 lens 224/224 e 0 to 1 dl 1520989225 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
[ 1179.747464] Lustre: 1838:0:(client.c:2100:ptlrpc_expire_one_request()) Skipped 3 previous similar messages
[ 1180.738657] Lustre: lustre-MDT0000-mdc-ffff88007a3ec800: Connection restored to 10.2.8.216@tcp (at 10.2.8.216@tcp)
[ 1214.574355] Lustre: 1837:0:(client.c:2100:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1520989191/real 1520989191]  req@ffff880060bc8380 x1594871608349568/t0(0) o400->lustre-MDT0000-mdc-ffff88007a3ec800@10.2.8.216@tcp:12/10 lens 224/224 e 0 to 1 dl 1520989240 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
[ 1214.574362] Lustre: 1837:0:(client.c:2100:ptlrpc_expire_one_request()) Skipped 5 previous similar messages
[ 1332.924289] Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-47vm4: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
[ 1333.499978] Lustre: DEBUG MARKER: onyx-47vm4: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
[ 1366.096335] Lustre: DEBUG MARKER: /usr/sbin/lctl mark Started client load: tar on onyx-47vm4
[ 1366.593473] Lustre: DEBUG MARKER: Started client load: tar on onyx-47vm4
[ 1368.645975] Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads AFTER failed client reintegrated -- failure NOT OK
[ 1369.798370] Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failed client reintegrated -- failure NOT OK
[ 1370.023893] Lustre: DEBUG MARKER: rc=0;
[ 1370.023893] 			val=$(/usr/sbin/lctl get_param -n catastrophe 2>&1);
[ 1370.023893] 			if [[ $? -eq 0 && $val -ne 0 ]]; then
[ 1370.023893] 				echo $(hostname -s): $val;
[ 1370.023893] 				rc=$val;
[ 1370.023893] 			fi;
[ 1370.023893] 			exit $rc
[ 1370.080186] Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh
[ 1370.190671] Lustre: DEBUG MARKER: /usr/sbin/lctl mark Number of failovers:
[ 1370.190671] mds1 failed over 2 times                and counting...
[ 1370.405207] Lustre: DEBUG MARKER: Number of failovers:
[ 1859.772170] ll_ping: page allocation failure: order:0, mode:0x1080020(GFP_ATOMIC)
[ 1859.772182] CPU: 0 PID: 1839 Comm: ll_ping Tainted: G           OE   N  4.4.114-94.11-default #1
[ 1859.772183] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[ 1859.772189]  0000000000000000 ffffffff813274b0 0000000000000000 ffff88007fc03d00
[ 1859.772190]  ffffffff8119a732 0108002000000030 000000000000008a 0000000000000400
[ 1859.772191]  0000000000000194 0000000000000003 0000000000000001 0000000000000000
[ 1859.772192] Call Trace:
[ 1859.772207]  [<ffffffff81019b59>] dump_trace+0x59/0x340
[ 1859.772211]  [<ffffffff81019f2a>] show_stack_log_lvl+0xea/0x170
[ 1859.772213]  [<ffffffff8101acd1>] show_stack+0x21/0x40
[ 1859.772219]  [<ffffffff813274b0>] dump_stack+0x5c/0x7c
[ 1859.772223]  [<ffffffff8119a732>] warn_alloc_failed+0xe2/0x150
[ 1859.772226]  [<ffffffff8119aba9>] __alloc_pages_nodemask+0x409/0xb80
[ 1859.772229]  [<ffffffff8119b45a>] __alloc_page_frag+0x10a/0x120
[ 1859.772235]  [<ffffffff8150c0f2>] __napi_alloc_skb+0x82/0xd0
[ 1859.772242]  [<ffffffffa02db354>] cp_rx_poll+0x1b4/0x550 [8139cp]
[ 1859.772251]  [<ffffffff8151aebc>] net_rx_action+0x15c/0x370
[ 1859.772255]  [<ffffffff810852bc>] __do_softirq+0xec/0x300
[ 1859.772258]  [<ffffffff8108578a>] irq_exit+0xfa/0x110
[ 1859.772265]  [<ffffffff816181f1>] do_IRQ+0x51/0xe0
[ 1859.772268]  [<ffffffff816156c9>] common_interrupt+0xc9/0xc9
[ 1859.774527] DWARF2 unwinder stuck at ret_from_intr+0x0/0x1b
[ 1859.774528] 
[ 1859.774528] Leftover inexact backtrace:
[ 1859.774528] 
[ 1859.774534]  <IRQ>  <EOI>  [<ffffffff811a6820>] ? shrink_page_list+0x440/0x7f0
[ 1859.774536]  [<ffffffff811a7190>] ? shrink_inactive_list+0x1f0/0x4f0
[ 1859.774538]  [<ffffffff811a7fcb>] ? shrink_zone_memcg+0x2bb/0x6a0
[ 1859.774540]  [<ffffffff811a8467>] ? shrink_zone+0xb7/0x260
[ 1859.774542]  [<ffffffff811a896d>] ? do_try_to_free_pages+0x15d/0x450
[ 1859.774543]  [<ffffffff811a8d1a>] ? try_to_free_pages+0xba/0x170
[ 1859.774545]  [<ffffffff8119ad93>] ? __alloc_pages_nodemask+0x5f3/0xb80
[ 1859.774547]  [<ffffffff811e94cd>] ? kmem_getpages+0x4d/0xf0
[ 1859.774548]  [<ffffffff811eacc9>] ? fallback_alloc+0x199/0x260
[ 1859.774550]  [<ffffffff811eb3f9>] ? kmem_cache_alloc+0x1f9/0x460
[ 1859.774598]  [<ffffffffa0ad3dc6>] ? ptlrpc_request_cache_alloc+0x26/0x100 [ptlrpc]
[ 1859.774621]  [<ffffffffa0ad3ebe>] ? ptlrpc_request_alloc_internal+0x1e/0x420 [ptlrpc]
[ 1859.774644]  [<ffffffffa0adc408>] ? ptlrpc_request_alloc_pack+0x18/0x50 [ptlrpc]
[ 1859.774669]  [<ffffffffa0af7c8c>] ? ptlrpc_prep_ping+0x1c/0x40 [ptlrpc]
[ 1859.774693]  [<ffffffffa0af8105>] ? ptlrpc_pinger_main+0x335/0xa90 [ptlrpc]
[ 1859.774696]  [<ffffffff810ab000>] ? wake_up_q+0x80/0x80
[ 1859.774718]  [<ffffffffa0af7dd0>] ? ptlrpc_obd_ping+0x120/0x120 [ptlrpc]
[ 1859.774720]  [<ffffffff8109e8f9>] ? kthread+0xc9/0xe0
[ 1859.774722]  [<ffffffff8109e830>] ? kthread_park+0x50/0x50
[ 1859.774723]  [<ffffffff81614f45>] ? ret_from_fork+0x55/0x80
[ 1859.774725]  [<ffffffff8109e830>] ? kthread_park+0x50/0x50
[ 1860.067158] ll_ping: page allocation failure: order:0, mode:0x1080020(GFP_ATOMIC)
[ 1860.067163] CPU: 0 PID: 1839 Comm: ll_ping Tainted: G           OE   N  4.4.114-94.11-default #1
[ 1860.067170] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[ 1860.067174]  0000000000000000 ffffffff813274b0 0000000000000000 ffff88007fc03d00
[ 1860.067175]  ffffffff8119a732 0108002000000030 0000000000000005 0000000000000400
[ 1860.067177]  00000000000002f3 0000000000000001 0000000000000001 0000000000000000
[ 1860.067177] Call Trace:
[ 1860.067191]  [<ffffffff81019b59>] dump_trace+0x59/0x340
[ 1860.067195]  [<ffffffff81019f2a>] show_stack_log_lvl+0xea/0x170
[ 1860.067197]  [<ffffffff8101acd1>] show_stack+0x21/0x40
[ 1860.067201]  [<ffffffff813274b0>] dump_stack+0x5c/0x7c
[ 1860.067205]  [<ffffffff8119a732>] warn_alloc_failed+0xe2/0x150
[ 1860.067209]  [<ffffffff8119aba9>] __alloc_pages_nodemask+0x409/0xb80
[ 1860.067212]  [<ffffffff8119b45a>] __alloc_page_frag+0x10a/0x120
[ 1860.067216]  [<ffffffff8150c0f2>] __napi_alloc_skb+0x82/0xd0
[ 1860.067222]  [<ffffffffa02db354>] cp_rx_poll+0x1b4/0x550 [8139cp]
[ 1860.067232]  [<ffffffff8151aebc>] net_rx_action+0x15c/0x370
[ 1860.067236]  [<ffffffff810852bc>] __do_softirq+0xec/0x300
[ 1860.067238]  [<ffffffff8108578a>] irq_exit+0xfa/0x110
[ 1860.067245]  [<ffffffff816181f1>] do_IRQ+0x51/0xe0
[ 1860.067248]  [<ffffffff816156c9>] common_interrupt+0xc9/0xc9
[ 1860.069507] DWARF2 unwinder stuck at ret_from_intr+0x0/0x1b
[ 1860.069508]

VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
recovery-random-scale test_fail_client_mds - test_fail_client_mds returned 5

mentioned in: Page Loading...

Details

Description

Attachments

Issue Links

Activity

People

Dates