Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-10812

recovery-random-scale test_fail_client_mds: test_fail_client_mds returned 4

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.11.0
    • None
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for sarah_lw <wei3.liu@intel.com>

      This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/85ed0988-278a-11e8-9e0e-52540065bddc

      test_fail_client_mds failed with the following error:

      test_fail_client_mds returned 5
      

      client and server: lustre-master tag-2.10.59 SLES12SP3

      client 2 hit ll_ping: page allocation failure

      console log

      [  972.544170] Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads BEFORE failover -- failure NOT OK              ELAPSED=357 DURATION=86400 PERIOD=1200
      [ 1005.987638] Lustre: DEBUG MARKER: ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=357 DURATION=86400 PERIOD=1200
      [ 1022.308773] Lustre: DEBUG MARKER: rc=0;
      [ 1022.308773] 			val=$(/usr/sbin/lctl get_param -n catastrophe 2>&1);
      [ 1022.308773] 			if [[ $? -eq 0 && $val -ne 0 ]]; then
      [ 1022.308773] 				echo $(hostname -s): $val;
      [ 1022.308773] 				rc=$val;
      [ 1022.308773] 			fi;
      [ 1022.308773] 			exit $rc
      [ 1022.357893] Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh
      [ 1022.566140] Lustre: DEBUG MARKER: /usr/sbin/lctl mark FAIL CLIENT onyx-47vm4...
      [ 1023.976580] Lustre: DEBUG MARKER: FAIL CLIENT onyx-47vm4...
      [ 1033.244860] Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds1
      [ 1099.399100] Lustre: DEBUG MARKER: Starting failover on mds1
      [ 1123.785219] Lustre: 1838:0:(client.c:2100:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1520989165/real 1520989165]  req@ffff880040416c40 x1594871608337664/t0(0) o400->MGC10.2.8.216@tcp@10.2.8.217@tcp:26/25 lens 224/224 e 0 to 1 dl 1520989172 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      [ 1123.785238] LustreError: 166-1: MGC10.2.8.216@tcp: Connection to MGS (at 10.2.8.217@tcp) was lost; in progress operations using this service will fail
      [ 1158.804380] Lustre: 1838:0:(client.c:2100:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1520989206/real 1520989211]  req@ffff880064c4d080 x1594871608390224/t0(0) o400->lustre-MDT0000-mdc-ffff88007a3ec800@10.2.8.217@tcp:12/10 lens 224/224 e 0 to 1 dl 1520989255 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1
      [ 1158.804393] Lustre: lustre-MDT0000-mdc-ffff88007a3ec800: Connection to lustre-MDT0000 (at 10.2.8.217@tcp) was lost; in progress operations using this service will wait for recovery to complete
      [ 1178.804585] Lustre: Evicted from MGS (at MGC10.2.8.216@tcp_0) after server handle changed from 0x620469374b42c3ff to 0x3e7afd771be5d16d
      [ 1178.806122] Lustre: MGC10.2.8.216@tcp: Connection restored to MGC10.2.8.216@tcp_0 (at 10.2.8.216@tcp)
      [ 1178.809735] LustreError: 1836:0:(client.c:2972:ptlrpc_replay_interpret()) @@@ status 301, old was 0  req@ffff88005eb3bc40 x1594871608322400/t8590014200(8590014200) o101->lustre-MDT0000-mdc-ffff88007a3ec800@10.2.8.216@tcp:12/10 lens 952/544 e 0 to 0 dl 1520989279 ref 2 fl Interpret:RP/4/0 rc 301/301
      [ 1179.747452] Lustre: 1838:0:(client.c:2100:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1520989176/real 1520989176]  req@ffff880069295cc0 x1594871608341840/t0(0) o400->lustre-MDT0000-mdc-ffff88007a3ec800@10.2.8.216@tcp:12/10 lens 224/224 e 0 to 1 dl 1520989225 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      [ 1179.747464] Lustre: 1838:0:(client.c:2100:ptlrpc_expire_one_request()) Skipped 3 previous similar messages
      [ 1180.738657] Lustre: lustre-MDT0000-mdc-ffff88007a3ec800: Connection restored to 10.2.8.216@tcp (at 10.2.8.216@tcp)
      [ 1214.574355] Lustre: 1837:0:(client.c:2100:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1520989191/real 1520989191]  req@ffff880060bc8380 x1594871608349568/t0(0) o400->lustre-MDT0000-mdc-ffff88007a3ec800@10.2.8.216@tcp:12/10 lens 224/224 e 0 to 1 dl 1520989240 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      [ 1214.574362] Lustre: 1837:0:(client.c:2100:ptlrpc_expire_one_request()) Skipped 5 previous similar messages
      [ 1332.924289] Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-47vm4: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
      [ 1333.499978] Lustre: DEBUG MARKER: onyx-47vm4: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
      [ 1366.096335] Lustre: DEBUG MARKER: /usr/sbin/lctl mark Started client load: tar on onyx-47vm4
      [ 1366.593473] Lustre: DEBUG MARKER: Started client load: tar on onyx-47vm4
      [ 1368.645975] Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads AFTER failed client reintegrated -- failure NOT OK
      [ 1369.798370] Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failed client reintegrated -- failure NOT OK
      [ 1370.023893] Lustre: DEBUG MARKER: rc=0;
      [ 1370.023893] 			val=$(/usr/sbin/lctl get_param -n catastrophe 2>&1);
      [ 1370.023893] 			if [[ $? -eq 0 && $val -ne 0 ]]; then
      [ 1370.023893] 				echo $(hostname -s): $val;
      [ 1370.023893] 				rc=$val;
      [ 1370.023893] 			fi;
      [ 1370.023893] 			exit $rc
      [ 1370.080186] Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh
      [ 1370.190671] Lustre: DEBUG MARKER: /usr/sbin/lctl mark Number of failovers:
      [ 1370.190671] mds1 failed over 2 times                and counting...
      [ 1370.405207] Lustre: DEBUG MARKER: Number of failovers:
      [ 1859.772170] ll_ping: page allocation failure: order:0, mode:0x1080020(GFP_ATOMIC)
      [ 1859.772182] CPU: 0 PID: 1839 Comm: ll_ping Tainted: G           OE   N  4.4.114-94.11-default #1
      [ 1859.772183] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
      [ 1859.772189]  0000000000000000 ffffffff813274b0 0000000000000000 ffff88007fc03d00
      [ 1859.772190]  ffffffff8119a732 0108002000000030 000000000000008a 0000000000000400
      [ 1859.772191]  0000000000000194 0000000000000003 0000000000000001 0000000000000000
      [ 1859.772192] Call Trace:
      [ 1859.772207]  [<ffffffff81019b59>] dump_trace+0x59/0x340
      [ 1859.772211]  [<ffffffff81019f2a>] show_stack_log_lvl+0xea/0x170
      [ 1859.772213]  [<ffffffff8101acd1>] show_stack+0x21/0x40
      [ 1859.772219]  [<ffffffff813274b0>] dump_stack+0x5c/0x7c
      [ 1859.772223]  [<ffffffff8119a732>] warn_alloc_failed+0xe2/0x150
      [ 1859.772226]  [<ffffffff8119aba9>] __alloc_pages_nodemask+0x409/0xb80
      [ 1859.772229]  [<ffffffff8119b45a>] __alloc_page_frag+0x10a/0x120
      [ 1859.772235]  [<ffffffff8150c0f2>] __napi_alloc_skb+0x82/0xd0
      [ 1859.772242]  [<ffffffffa02db354>] cp_rx_poll+0x1b4/0x550 [8139cp]
      [ 1859.772251]  [<ffffffff8151aebc>] net_rx_action+0x15c/0x370
      [ 1859.772255]  [<ffffffff810852bc>] __do_softirq+0xec/0x300
      [ 1859.772258]  [<ffffffff8108578a>] irq_exit+0xfa/0x110
      [ 1859.772265]  [<ffffffff816181f1>] do_IRQ+0x51/0xe0
      [ 1859.772268]  [<ffffffff816156c9>] common_interrupt+0xc9/0xc9
      [ 1859.774527] DWARF2 unwinder stuck at ret_from_intr+0x0/0x1b
      [ 1859.774528] 
      [ 1859.774528] Leftover inexact backtrace:
      [ 1859.774528] 
      [ 1859.774534]  <IRQ>  <EOI>  [<ffffffff811a6820>] ? shrink_page_list+0x440/0x7f0
      [ 1859.774536]  [<ffffffff811a7190>] ? shrink_inactive_list+0x1f0/0x4f0
      [ 1859.774538]  [<ffffffff811a7fcb>] ? shrink_zone_memcg+0x2bb/0x6a0
      [ 1859.774540]  [<ffffffff811a8467>] ? shrink_zone+0xb7/0x260
      [ 1859.774542]  [<ffffffff811a896d>] ? do_try_to_free_pages+0x15d/0x450
      [ 1859.774543]  [<ffffffff811a8d1a>] ? try_to_free_pages+0xba/0x170
      [ 1859.774545]  [<ffffffff8119ad93>] ? __alloc_pages_nodemask+0x5f3/0xb80
      [ 1859.774547]  [<ffffffff811e94cd>] ? kmem_getpages+0x4d/0xf0
      [ 1859.774548]  [<ffffffff811eacc9>] ? fallback_alloc+0x199/0x260
      [ 1859.774550]  [<ffffffff811eb3f9>] ? kmem_cache_alloc+0x1f9/0x460
      [ 1859.774598]  [<ffffffffa0ad3dc6>] ? ptlrpc_request_cache_alloc+0x26/0x100 [ptlrpc]
      [ 1859.774621]  [<ffffffffa0ad3ebe>] ? ptlrpc_request_alloc_internal+0x1e/0x420 [ptlrpc]
      [ 1859.774644]  [<ffffffffa0adc408>] ? ptlrpc_request_alloc_pack+0x18/0x50 [ptlrpc]
      [ 1859.774669]  [<ffffffffa0af7c8c>] ? ptlrpc_prep_ping+0x1c/0x40 [ptlrpc]
      [ 1859.774693]  [<ffffffffa0af8105>] ? ptlrpc_pinger_main+0x335/0xa90 [ptlrpc]
      [ 1859.774696]  [<ffffffff810ab000>] ? wake_up_q+0x80/0x80
      [ 1859.774718]  [<ffffffffa0af7dd0>] ? ptlrpc_obd_ping+0x120/0x120 [ptlrpc]
      [ 1859.774720]  [<ffffffff8109e8f9>] ? kthread+0xc9/0xe0
      [ 1859.774722]  [<ffffffff8109e830>] ? kthread_park+0x50/0x50
      [ 1859.774723]  [<ffffffff81614f45>] ? ret_from_fork+0x55/0x80
      [ 1859.774725]  [<ffffffff8109e830>] ? kthread_park+0x50/0x50
      [ 1860.067158] ll_ping: page allocation failure: order:0, mode:0x1080020(GFP_ATOMIC)
      [ 1860.067163] CPU: 0 PID: 1839 Comm: ll_ping Tainted: G           OE   N  4.4.114-94.11-default #1
      [ 1860.067170] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
      [ 1860.067174]  0000000000000000 ffffffff813274b0 0000000000000000 ffff88007fc03d00
      [ 1860.067175]  ffffffff8119a732 0108002000000030 0000000000000005 0000000000000400
      [ 1860.067177]  00000000000002f3 0000000000000001 0000000000000001 0000000000000000
      [ 1860.067177] Call Trace:
      [ 1860.067191]  [<ffffffff81019b59>] dump_trace+0x59/0x340
      [ 1860.067195]  [<ffffffff81019f2a>] show_stack_log_lvl+0xea/0x170
      [ 1860.067197]  [<ffffffff8101acd1>] show_stack+0x21/0x40
      [ 1860.067201]  [<ffffffff813274b0>] dump_stack+0x5c/0x7c
      [ 1860.067205]  [<ffffffff8119a732>] warn_alloc_failed+0xe2/0x150
      [ 1860.067209]  [<ffffffff8119aba9>] __alloc_pages_nodemask+0x409/0xb80
      [ 1860.067212]  [<ffffffff8119b45a>] __alloc_page_frag+0x10a/0x120
      [ 1860.067216]  [<ffffffff8150c0f2>] __napi_alloc_skb+0x82/0xd0
      [ 1860.067222]  [<ffffffffa02db354>] cp_rx_poll+0x1b4/0x550 [8139cp]
      [ 1860.067232]  [<ffffffff8151aebc>] net_rx_action+0x15c/0x370
      [ 1860.067236]  [<ffffffff810852bc>] __do_softirq+0xec/0x300
      [ 1860.067238]  [<ffffffff8108578a>] irq_exit+0xfa/0x110
      [ 1860.067245]  [<ffffffff816181f1>] do_IRQ+0x51/0xe0
      [ 1860.067248]  [<ffffffff816156c9>] common_interrupt+0xc9/0xc9
      [ 1860.069507] DWARF2 unwinder stuck at ret_from_intr+0x0/0x1b
      [ 1860.069508] 

      VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
      recovery-random-scale test_fail_client_mds - test_fail_client_mds returned 5

      Attachments

        Issue Links

          Activity

            People

              wc-triage WC Triage
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated: