Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-7788

replay-single test_70b timeout with MDS OOM

    XMLWordPrintable

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Minor
    • None
    • Lustre 2.9.0
    • None
    • autotest review-dne-part-2
    • 3
    • 9223372036854775807

    Description

      replay-single test 70b times out with out of memory on the MDT. The test_log contains

      shadow-23vm5:    1  cleanup 1493 sec
      shadow-23vm1:    1  cleanup 1492 sec
      shadow-23vm1:  rpc : @@@@@@ FAIL: can't put import for mdc.lustre-MDT0000-mdc-*.mds_server_uuid into FULL state after 1475 sec, have REPLAY 
      shadow-23vm5:  rpc : @@@@@@ FAIL: can't put import for mdc.lustre-MDT0000-mdc-*.mds_server_uuid into FULL state after 1475 sec, have REPLAY_LOCKS 
      shadow-23vm5:    1  cleanup 1494 sec
      shadow-23vm1:   Trace dump:
      shadow-23vm1:   = /usr/lib64/lustre/tests/test-framework.sh:4670:error_noexit()
      shadow-23vm1:   = /usr/lib64/lustre/tests/test-framework.sh:4704:error()
      shadow-23vm1:   = /usr/lib64/lustre/tests/test-framework.sh:5766:_wait_import_state()
      shadow-23vm1:   = /usr/lib64/lustre/tests/test-framework.sh:5788:wait_import_state()
      shadow-23vm1:   = /usr/lib64/lustre/tests/test-framework.sh:5797:wait_import_state_mount()
      shadow-23vm1:   = rpc.sh:20:main()
      shadow-23vm5:   Trace dump:
      shadow-23vm5:   = /usr/lib64/lustre/tests/test-framework.sh:4670:error_noexit()
      shadow-23vm5:   = /usr/lib64/lustre/tests/test-framework.sh:4704:error()
      shadow-23vm5:   = /usr/lib64/lustre/tests/test-framework.sh:5766:_wait_import_state()
      shadow-23vm5:   = /usr/lib64/lustre/tests/test-framework.sh:5788:wait_import_state()
      shadow-23vm5:   = /usr/lib64/lustre/tests/test-framework.sh:5797:wait_import_state_mount()
      shadow-23vm5:   = rpc.sh:20:main()
      shadow-23vm1: CMD: shadow-23vm1.shadow.whamcloud.com,shadow-23vm3,shadow-23vm4,shadow-23vm5,shadow-23vm8 PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:./../utils:/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/lib64/qt-3.3/bin:/usr/lib64/openmpi/bin:/usr/bin:/bin:/usr/sbin:/sbin::/sbin:/bin:/usr/sbin::/sbin:/bin:/usr/sbin: NAME=autotest_config sh rpc.sh check_logdir /tmp/test_logs/1455616347 
      shadow-23vm1:    1  cleanup 1493 sec
      shadow-23vm5: CMD: shadow-23vm3,shadow-23vm4,shadow-23vm5,shadow-23vm5.shadow.whamcloud.com,shadow-23vm8 PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:./../utils:/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/lib64/qt-3.3/bin:/usr/lib64/openmpi/bin:/usr/bin:/bin:/usr/sbin:/sbin::/sbin:/bin:/usr/sbin::/sbin:/bin:/usr/sbin: NAME=autotest_config sh rpc.sh check_logdir /tmp/test_logs/1455616347 
      shadow-23vm5:    1  cleanup 1495 sec
      shadow-23vm1: CMD: shadow-23vm1.shadow.whamcloud.com uname -n
      shadow-23vm1: CMD: shadow-23vm3 uname -n
      shadow-23vm5: CMD: shadow-23vm3 uname -n
      shadow-23vm1:    1  cleanup 1494 sec
      shadow-23vm1: Dumping lctl log to /tmp/test_logs/1455616347/rpc..*.1455617834.log
      shadow-23vm1: CMD: shadow-23vm1.shadow.whamcloud.com,shadow-23vm3,shadow-23vm4,shadow-23vm5,shadow-23vm8 /usr/sbin/lctl dk > /tmp/test_logs/1455616347/rpc..debug_log.\$(hostname -s).1455617834.log;
      shadow-23vm1:          dmesg > /tmp/test_logs/1455616347/rpc..dmesg.\$(hostname -s).1455617834.log
      shadow-23vm5: Dumping lctl log to /tmp/test_logs/1455616347/rpc..*.1455617834.log
      shadow-23vm5: CMD: shadow-23vm3,shadow-23vm4,shadow-23vm5,shadow-23vm5.shadow.whamcloud.com,shadow-23vm8 /usr/sbin/lctl dk > /tmp/test_logs/1455616347/rpc..debug_log.\$(hostname -s).1455617834.log;
      shadow-23vm5:          dmesg > /tmp/test_logs/1455616347/rpc..dmesg.\$(hostname -s).1455617834.log
      shadow-23vm5:    1  cleanup 1496 sec
      shadow-23vm5: CMD: shadow-23vm3,shadow-23vm4,shadow-23vm5,shadow-23vm5.shadow.whamcloud.com,shadow-23vm8 rsync -az /tmp/test_logs/1455616347/rpc..*.1455617834.log shadow-23vm5.shadow.whamcloud.com:/tmp/test_logs/1455616347
      shadow-23vm1: CMD: shadow-23vm1.shadow.whamcloud.com,shadow-23vm3,shadow-23vm4,shadow-23vm5,shadow-23vm8 rsync -az /tmp/test_logs/1455616347/rpc..*.1455617834.log shadow-23vm1.shadow.whamcloud.com:/tmp/test_logs/1455616347
      shadow-23vm1:    1  cleanup 1495 sec
      shadow-23vm5:    1  cleanup 1497 sec
      shadow-23vm5: CMD: shadow-23vm3,shadow-23vm4,shadow-23vm5,shadow-23vm5.shadow.whamcloud.com,shadow-23vm8 lctl set_param -n fail_loc=0 	    fail_val=0 2>/dev/null || true
      shadow-23vm1: CMD: shadow-23vm1.shadow.whamcloud.com,shadow-23vm3,shadow-23vm4,shadow-23vm5,shadow-23vm8 lctl set_param -n fail_loc=0 	    fail_val=0 2>/dev/null || true
      shadow-23vm1:    1  cleanup 1496 sec
      shadow-23vm1: Resetting fail_loc on all nodes...done.
      shadow-23vm5: Resetting fail_loc on all nodes...done.
       replay-single test_70b: @@@@@@ FAIL: import is not in FULL state 
      shadow-23vm5:    1  cleanup 1498 sec
        Trace dump:
        = /usr/lib64/lustre/tests/test-framework.sh:4670:error_noexit()
        = /usr/lib64/lustre/tests/test-framework.sh:4704:error()
        = /usr/lib64/lustre/tests/test-framework.sh:6020:wait_clients_import_state()
        = /usr/lib64/lustre/tests/test-framework.sh:2547:fail()
        = /usr/lib64/lustre/tests/replay-single.sh:2088:test_70b()
        = /usr/lib64/lustre/tests/test-framework.sh:4951:run_one()
        = /usr/lib64/lustre/tests/test-framework.sh:4988:run_one_logged()
        = /usr/lib64/lustre/tests/test-framework.sh:4853:run_test()
        = /usr/lib64/lustre/tests/replay-single.sh:2099:main()
      Dumping lctl log to /logdir/test_logs/2016-02-16/lustre-reviews-el6_7-x86_64--review-dne-part-2--1_7_1__37414__-70019700416340-012040/replay-single.test_70b.*.1455617837.log
      CMD: shadow-23vm1.shadow.whamcloud.com,shadow-23vm3,shadow-23vm4,shadow-23vm5,shadow-23vm8 /usr/sbin/lctl dk > /logdir/test_logs/2016-02-16/lustre-reviews-el6_7-x86_64--review-dne-part-2--1_7_1__37414__-70019700416340-012040/replay-single.test_70b.debug_log.\$(hostname -s).1455617837.log;
               dmesg > /logdir/test_logs/2016-02-16/lustre-reviews-el6_7-x86_64--review-dne-part-2--1_7_1__37414__-70019700416340-012040/replay-single.test_70b.dmesg.\$(hostname -s).1455617837.log
      shadow-23vm1:    1  cleanup 1497 sec
      Resetting fail_loc on all nodes...CMD: shadow-23vm1.shadow.whamcloud.com,shadow-23vm3,shadow-23vm4,shadow-23vm5,shadow-23vm8 lctl set_param -n fail_loc=0 	    fail_val=0 2>/dev/null || true
      done.
      shadow-23vm5:    1  cleanup 1499 sec
      

      From the console log on MDS 1

      03:31:50:LustreError: 32643:0:(ldlm_lib.c:2751:target_queue_recovery_request()) @@@ dropping resent queued req  req@ffff88007bd94680 x1526321429797936/t0(279172874722) o36->28d5e62d-c7a9-49de-9892-e9e751880a22@10.1.5.22@tcp:282/0 lens 760/0 e 0 to 0 dl 1455618387 ref 2 fl Interpret:/6/ffffffff rc 0/-1
      03:31:50:LustreError: 32643:0:(ldlm_lib.c:2751:target_queue_recovery_request()) Skipped 90 previous similar messages
      03:31:50:Lustre: lustre-MDT0000: Recovery already passed deadline 37:38, It is most likely due to DNE recovery is failed or stuck, please wait a few more minutes or abort the recovery.
      03:31:50:Lustre: Skipped 490 previous similar messages
      03:31:50:mdt_out00_001 invoked oom-killer: gfp_mask=0x80d2, order=0, oom_adj=0, oom_score_adj=0
      03:31:50:mdt_out00_001 cpuset=/ mems_allowed=0
      03:31:50:Pid: 32649, comm: mdt_out00_001 Not tainted 2.6.32-573.12.1.el6_lustre.g93f956d.x86_64 #1
      03:31:50:Call Trace:
      03:31:50: [<ffffffff810d71a1>] ? cpuset_print_task_mems_allowed+0x91/0xb0
      03:31:50: [<ffffffff8112a9a0>] ? dump_header+0x90/0x1b0
      03:31:50: [<ffffffff8112ab0e>] ? check_panic_on_oom+0x4e/0x80
      03:31:50: [<ffffffff8112b1fb>] ? out_of_memory+0x1bb/0x3c0
      03:31:50: [<ffffffff81137c3c>] ? __alloc_pages_nodemask+0x93c/0x950
      03:31:50: [<ffffffff81161449>] ? __vmalloc_area_node+0xb9/0x190
      03:31:50: [<ffffffffa07ab7ae>] ? ptlrpc_grow_req_bufs+0x33e/0x710 [ptlrpc]
      03:31:50: [<ffffffff8116131d>] ? __vmalloc_node+0xad/0x120
      03:31:50: [<ffffffffa07ab7ae>] ? ptlrpc_grow_req_bufs+0x33e/0x710 [ptlrpc]
      03:31:50: [<ffffffff81161739>] ? vzalloc_node+0x29/0x30
      03:31:50: [<ffffffffa07ab7ae>] ? ptlrpc_grow_req_bufs+0x33e/0x710 [ptlrpc]
      03:31:50: [<ffffffffa07ae08d>] ? ptlrpc_main+0x114d/0x1800 [ptlrpc]
      03:31:50: [<ffffffffa07acf40>] ? ptlrpc_main+0x0/0x1800 [ptlrpc]
      03:31:50: [<ffffffff810a0fce>] ? kthread+0x9e/0xc0
      03:31:50: [<ffffffff8100c28a>] ? child_rip+0xa/0x20
      03:31:50: [<ffffffff810a0f30>] ? kthread+0x0/0xc0
      03:31:50: [<ffffffff8100c280>] ? child_rip+0x0/0x20
      03:31:50:Mem-Info:
      03:31:50:Node 0 DMA per-cpu:
      03:31:50:CPU    0: hi:    0, btch:   1 usd:   0
      03:31:50:Node 0 DMA32 per-cpu:
      03:31:50:CPU    0: hi:  186, btch:  31 usd:  30
      03:31:50:active_anon:1793 inactive_anon:1851 isolated_anon:0
      03:31:50: active_file:49 inactive_file:65 isolated_file:0
      03:31:50: unevictable:0 dirty:2 writeback:1856 unstable:0
      03:31:50: free:13242 slab_reclaimable:1623 slab_unreclaimable:336645
      03:31:50: mapped:7 shmem:0 pagetables:566 bounce:0
      03:31:50:Node 0 DMA free:8348kB min:332kB low:412kB high:496kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15348kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:7168kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
      03:31:50:lowmem_reserve[]: 0 2004 2004 2004
      03:31:50:Node 0 DMA32 free:44620kB min:44720kB low:55900kB high:67080kB active_anon:7172kB inactive_anon:7404kB active_file:196kB inactive_file:260kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2052308kB mlocked:0kB dirty:8kB writeback:7424kB mapped:28kB shmem:0kB slab_reclaimable:6492kB slab_unreclaimable:1339412kB kernel_stack:3104kB pagetables:2264kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:3904 all_unreclaimable? no
      03:31:50:lowmem_reserve[]: 0 0 0 0
      03:31:50:Node 0 DMA: 1*4kB 1*8kB 1*16kB 0*32kB 0*64kB 1*128kB 0*256kB 0*512kB 0*1024kB 2*2048kB 1*4096kB = 8348kB
      03:31:50:Node 0 DMA32: 1659*4kB 974*8kB 457*16kB 265*32kB 125*64kB 44*128kB 1*256kB 1*512kB 0*1024kB 0*2048kB 0*4096kB = 44620kB
      03:31:50:1969 total pagecache pages
      03:31:50:1861 pages in swap cache
      03:31:50:Swap cache stats: add 1861, delete 0, find 0/0
      03:31:50:Free swap  = 4121320kB
      03:31:50:Total swap = 4128764kB
      03:31:50:524284 pages RAM
      03:31:50:43736 pages reserved
      03:31:50:138 pages shared
      03:31:50:139521 pages non-shared
      03:31:50:[ pid ]   uid  tgid total_vm      rss cpu oom_adj oom_score_adj name
      03:31:50:[  421]     0   421     2663       19   0     -17         -1000 udevd
      03:31:50:[  708]     0   708     2664       13   0     -17         -1000 udevd
      03:31:50:[ 1107]     0  1107     6902       37   0     -17         -1000 auditd
      03:31:50:[ 1141]     0  1141    62272      249   0       0             0 rsyslogd
      03:31:50:[ 1192]    32  1192     4744       15   0       0             0 rpcbind
      03:31:50:[ 1214]    29  1214     5837        1   0       0             0 rpc.statd
      03:31:50:[ 1248]    81  1248     6418        1   0       0             0 dbus-daemon
      03:31:50:[ 1270]     0  1270    53919       26   0       0             0 ypbind
      03:31:50:[ 1302]     0  1302    47233        1   0       0             0 cupsd
      03:31:50:[ 1358]     0  1358     1020        0   0       0             0 acpid
      03:31:50:[ 1370]    68  1370    10520       81   0       0             0 hald
      03:31:50:[ 1371]     0  1371     5099        1   0       0             0 hald-runner
      03:31:50:[ 1403]     0  1403     5629        1   0       0             0 hald-addon-inpu
      03:31:50:[ 1413]    68  1413     4501        1   0       0             0 hald-addon-acpi
      03:31:50:[ 1459]     0  1459    26827        0   0       0             0 rpc.rquotad
      03:31:50:[ 1464]     0  1464     5417        0   0       0             0 rpc.mountd
      03:31:50:[ 1508]     0  1508     6292        1   0       0             0 rpc.idmapd
      03:31:50:[ 1545]   498  1545    57325      156   0       0             0 munged
      03:31:50:[ 8121]     0  8121    16555        0   0     -17         -1000 sshd
      03:31:50:[ 8132]     0  8132     5429       24   0       0             0 xinetd
      03:31:50:[ 8164]     0  8164    22210      291   0       0             0 sendmail
      03:31:50:[ 8173]    51  8173    20073      364   0       0             0 sendmail
      03:31:50:[ 8201]     0  8201    29216      152   0       0             0 crond
      03:31:50:[ 8219]     0  8219     5276       44   0       0             0 atd
      03:31:50:[ 8232]     0  8232     1020       22   0       0             0 agetty
      03:31:50:[ 8234]     0  8234     1016       22   0       0             0 mingetty
      03:31:50:[ 8236]     0  8236     1016       22   0       0             0 mingetty
      03:31:50:[ 8238]     0  8238     1016       22   0       0             0 mingetty
      03:31:50:[ 8240]     0  8240     2662       19   0     -17         -1000 udevd
      03:31:50:[ 8241]     0  8241     1016       21   0       0             0 mingetty
      03:31:50:[ 8243]     0  8243     1016       21   0       0             0 mingetty
      03:31:50:[ 8245]     0  8245     1016       21   0       0             0 mingetty
      03:31:50:[ 9665]    38  9665     7689      157   0       0             0 ntpd
      03:31:50:Kernel panic - not syncing: Out of memory: system-wide panic_on_oom is enabled
      

      From the other MDS’s console

      03:01:01:LustreError: 11053:0:(import.c:675:ptlrpc_connect_import()) already connecting
      03:01:01:LustreError: 11053:0:(import.c:675:ptlrpc_connect_import()) Skipped 5 previous similar messages
      03:01:01:INFO: task mdt_out00_004:7712 blocked for more than 120 seconds.
      03:01:01:      Not tainted 2.6.32-573.12.1.el6_lustre.g93f956d.x86_64 #1
      03:01:01:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      03:01:01:mdt_out00_004 D 0000000000000000     0  7712      2 0x00000080
      03:01:01: ffff880059c4bb50 0000000000000046 0000000000000000 0000000000000000
      03:01:01: ffff88004abd30c0 ffff880050b00118 ffff880052841000 ffff880050b00118
      03:01:01: ffff880059c4bb50 ffffffffa05804cf ffff88004df4f068 ffff880059c4bfd8
      03:01:01:Call Trace:
      03:01:01: [<ffffffffa05804cf>] ? lu_object_find_try+0x9f/0x260 [obdclass]
      03:01:01: [<ffffffffa05806cd>] lu_object_find_at+0x3d/0xe0 [obdclass]
      03:01:01: [<ffffffffa07c3352>] ? __req_capsule_get+0x162/0x6e0 [ptlrpc]
      03:01:01: [<ffffffff810672b0>] ? default_wake_function+0x0/0x20
      03:01:01: [<ffffffffa0798dd0>] ? lustre_swab_object_update_reply+0x0/0xc0 [ptlrpc]
      03:01:01: [<ffffffffa05814dc>] dt_locate_at+0x1c/0xa0 [obdclass]
      03:01:01: [<ffffffffa0809980>] out_handle+0x1030/0x1880 [ptlrpc]
      03:01:01: [<ffffffff8153a65e>] ? mutex_lock+0x1e/0x50
      03:01:01: [<ffffffffa07f91da>] ? req_can_reconstruct+0x6a/0x120 [ptlrpc]
      03:01:01: [<ffffffffa0800c3c>] tgt_request_handle+0x8ec/0x1440 [ptlrpc]
      03:01:01: [<ffffffffa07adc61>] ptlrpc_main+0xd21/0x1800 [ptlrpc]
      03:01:01: [<ffffffffa07acf40>] ? ptlrpc_main+0x0/0x1800 [ptlrpc]
      03:01:01: [<ffffffff810a0fce>] kthread+0x9e/0xc0
      03:01:01: [<ffffffff8100c28a>] child_rip+0xa/0x20
      03:01:01: [<ffffffff810a0f30>] ? kthread+0x0/0xc0
      03:01:01: [<ffffffff8100c280>] ? child_rip+0x0/0x20
      

      There are two recent failures at
      2016-02-16 09:12:14 - https://testing.hpdd.intel.com/test_sets/2485a3b0-d4d2-11e5-9e3f-5254006e85c2
      2016-02-16 16:13:20 - https://testing.hpdd.intel.com/test_sets/428f6ebc-d50f-11e5-8530-5254006e85c2

      Attachments

        Issue Links

          Activity

            People

              wc-triage WC Triage
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: