Details
-
Bug
-
Resolution: Cannot Reproduce
-
Minor
-
None
-
Lustre 2.9.0
-
None
-
autotest review-dne-part-2
-
3
-
9223372036854775807
Description
replay-single test 70b times out with out of memory on the MDT. The test_log contains
shadow-23vm5: 1 cleanup 1493 sec shadow-23vm1: 1 cleanup 1492 sec shadow-23vm1: rpc : @@@@@@ FAIL: can't put import for mdc.lustre-MDT0000-mdc-*.mds_server_uuid into FULL state after 1475 sec, have REPLAY shadow-23vm5: rpc : @@@@@@ FAIL: can't put import for mdc.lustre-MDT0000-mdc-*.mds_server_uuid into FULL state after 1475 sec, have REPLAY_LOCKS shadow-23vm5: 1 cleanup 1494 sec shadow-23vm1: Trace dump: shadow-23vm1: = /usr/lib64/lustre/tests/test-framework.sh:4670:error_noexit() shadow-23vm1: = /usr/lib64/lustre/tests/test-framework.sh:4704:error() shadow-23vm1: = /usr/lib64/lustre/tests/test-framework.sh:5766:_wait_import_state() shadow-23vm1: = /usr/lib64/lustre/tests/test-framework.sh:5788:wait_import_state() shadow-23vm1: = /usr/lib64/lustre/tests/test-framework.sh:5797:wait_import_state_mount() shadow-23vm1: = rpc.sh:20:main() shadow-23vm5: Trace dump: shadow-23vm5: = /usr/lib64/lustre/tests/test-framework.sh:4670:error_noexit() shadow-23vm5: = /usr/lib64/lustre/tests/test-framework.sh:4704:error() shadow-23vm5: = /usr/lib64/lustre/tests/test-framework.sh:5766:_wait_import_state() shadow-23vm5: = /usr/lib64/lustre/tests/test-framework.sh:5788:wait_import_state() shadow-23vm5: = /usr/lib64/lustre/tests/test-framework.sh:5797:wait_import_state_mount() shadow-23vm5: = rpc.sh:20:main() shadow-23vm1: CMD: shadow-23vm1.shadow.whamcloud.com,shadow-23vm3,shadow-23vm4,shadow-23vm5,shadow-23vm8 PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:./../utils:/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/lib64/qt-3.3/bin:/usr/lib64/openmpi/bin:/usr/bin:/bin:/usr/sbin:/sbin::/sbin:/bin:/usr/sbin::/sbin:/bin:/usr/sbin: NAME=autotest_config sh rpc.sh check_logdir /tmp/test_logs/1455616347 shadow-23vm1: 1 cleanup 1493 sec shadow-23vm5: CMD: shadow-23vm3,shadow-23vm4,shadow-23vm5,shadow-23vm5.shadow.whamcloud.com,shadow-23vm8 PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:./../utils:/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/lib64/qt-3.3/bin:/usr/lib64/openmpi/bin:/usr/bin:/bin:/usr/sbin:/sbin::/sbin:/bin:/usr/sbin::/sbin:/bin:/usr/sbin: NAME=autotest_config sh rpc.sh check_logdir /tmp/test_logs/1455616347 shadow-23vm5: 1 cleanup 1495 sec shadow-23vm1: CMD: shadow-23vm1.shadow.whamcloud.com uname -n shadow-23vm1: CMD: shadow-23vm3 uname -n shadow-23vm5: CMD: shadow-23vm3 uname -n shadow-23vm1: 1 cleanup 1494 sec shadow-23vm1: Dumping lctl log to /tmp/test_logs/1455616347/rpc..*.1455617834.log shadow-23vm1: CMD: shadow-23vm1.shadow.whamcloud.com,shadow-23vm3,shadow-23vm4,shadow-23vm5,shadow-23vm8 /usr/sbin/lctl dk > /tmp/test_logs/1455616347/rpc..debug_log.\$(hostname -s).1455617834.log; shadow-23vm1: dmesg > /tmp/test_logs/1455616347/rpc..dmesg.\$(hostname -s).1455617834.log shadow-23vm5: Dumping lctl log to /tmp/test_logs/1455616347/rpc..*.1455617834.log shadow-23vm5: CMD: shadow-23vm3,shadow-23vm4,shadow-23vm5,shadow-23vm5.shadow.whamcloud.com,shadow-23vm8 /usr/sbin/lctl dk > /tmp/test_logs/1455616347/rpc..debug_log.\$(hostname -s).1455617834.log; shadow-23vm5: dmesg > /tmp/test_logs/1455616347/rpc..dmesg.\$(hostname -s).1455617834.log shadow-23vm5: 1 cleanup 1496 sec shadow-23vm5: CMD: shadow-23vm3,shadow-23vm4,shadow-23vm5,shadow-23vm5.shadow.whamcloud.com,shadow-23vm8 rsync -az /tmp/test_logs/1455616347/rpc..*.1455617834.log shadow-23vm5.shadow.whamcloud.com:/tmp/test_logs/1455616347 shadow-23vm1: CMD: shadow-23vm1.shadow.whamcloud.com,shadow-23vm3,shadow-23vm4,shadow-23vm5,shadow-23vm8 rsync -az /tmp/test_logs/1455616347/rpc..*.1455617834.log shadow-23vm1.shadow.whamcloud.com:/tmp/test_logs/1455616347 shadow-23vm1: 1 cleanup 1495 sec shadow-23vm5: 1 cleanup 1497 sec shadow-23vm5: CMD: shadow-23vm3,shadow-23vm4,shadow-23vm5,shadow-23vm5.shadow.whamcloud.com,shadow-23vm8 lctl set_param -n fail_loc=0 fail_val=0 2>/dev/null || true shadow-23vm1: CMD: shadow-23vm1.shadow.whamcloud.com,shadow-23vm3,shadow-23vm4,shadow-23vm5,shadow-23vm8 lctl set_param -n fail_loc=0 fail_val=0 2>/dev/null || true shadow-23vm1: 1 cleanup 1496 sec shadow-23vm1: Resetting fail_loc on all nodes...done. shadow-23vm5: Resetting fail_loc on all nodes...done. replay-single test_70b: @@@@@@ FAIL: import is not in FULL state shadow-23vm5: 1 cleanup 1498 sec Trace dump: = /usr/lib64/lustre/tests/test-framework.sh:4670:error_noexit() = /usr/lib64/lustre/tests/test-framework.sh:4704:error() = /usr/lib64/lustre/tests/test-framework.sh:6020:wait_clients_import_state() = /usr/lib64/lustre/tests/test-framework.sh:2547:fail() = /usr/lib64/lustre/tests/replay-single.sh:2088:test_70b() = /usr/lib64/lustre/tests/test-framework.sh:4951:run_one() = /usr/lib64/lustre/tests/test-framework.sh:4988:run_one_logged() = /usr/lib64/lustre/tests/test-framework.sh:4853:run_test() = /usr/lib64/lustre/tests/replay-single.sh:2099:main() Dumping lctl log to /logdir/test_logs/2016-02-16/lustre-reviews-el6_7-x86_64--review-dne-part-2--1_7_1__37414__-70019700416340-012040/replay-single.test_70b.*.1455617837.log CMD: shadow-23vm1.shadow.whamcloud.com,shadow-23vm3,shadow-23vm4,shadow-23vm5,shadow-23vm8 /usr/sbin/lctl dk > /logdir/test_logs/2016-02-16/lustre-reviews-el6_7-x86_64--review-dne-part-2--1_7_1__37414__-70019700416340-012040/replay-single.test_70b.debug_log.\$(hostname -s).1455617837.log; dmesg > /logdir/test_logs/2016-02-16/lustre-reviews-el6_7-x86_64--review-dne-part-2--1_7_1__37414__-70019700416340-012040/replay-single.test_70b.dmesg.\$(hostname -s).1455617837.log shadow-23vm1: 1 cleanup 1497 sec Resetting fail_loc on all nodes...CMD: shadow-23vm1.shadow.whamcloud.com,shadow-23vm3,shadow-23vm4,shadow-23vm5,shadow-23vm8 lctl set_param -n fail_loc=0 fail_val=0 2>/dev/null || true done. shadow-23vm5: 1 cleanup 1499 sec
From the console log on MDS 1
03:31:50:LustreError: 32643:0:(ldlm_lib.c:2751:target_queue_recovery_request()) @@@ dropping resent queued req req@ffff88007bd94680 x1526321429797936/t0(279172874722) o36->28d5e62d-c7a9-49de-9892-e9e751880a22@10.1.5.22@tcp:282/0 lens 760/0 e 0 to 0 dl 1455618387 ref 2 fl Interpret:/6/ffffffff rc 0/-1 03:31:50:LustreError: 32643:0:(ldlm_lib.c:2751:target_queue_recovery_request()) Skipped 90 previous similar messages 03:31:50:Lustre: lustre-MDT0000: Recovery already passed deadline 37:38, It is most likely due to DNE recovery is failed or stuck, please wait a few more minutes or abort the recovery. 03:31:50:Lustre: Skipped 490 previous similar messages 03:31:50:mdt_out00_001 invoked oom-killer: gfp_mask=0x80d2, order=0, oom_adj=0, oom_score_adj=0 03:31:50:mdt_out00_001 cpuset=/ mems_allowed=0 03:31:50:Pid: 32649, comm: mdt_out00_001 Not tainted 2.6.32-573.12.1.el6_lustre.g93f956d.x86_64 #1 03:31:50:Call Trace: 03:31:50: [<ffffffff810d71a1>] ? cpuset_print_task_mems_allowed+0x91/0xb0 03:31:50: [<ffffffff8112a9a0>] ? dump_header+0x90/0x1b0 03:31:50: [<ffffffff8112ab0e>] ? check_panic_on_oom+0x4e/0x80 03:31:50: [<ffffffff8112b1fb>] ? out_of_memory+0x1bb/0x3c0 03:31:50: [<ffffffff81137c3c>] ? __alloc_pages_nodemask+0x93c/0x950 03:31:50: [<ffffffff81161449>] ? __vmalloc_area_node+0xb9/0x190 03:31:50: [<ffffffffa07ab7ae>] ? ptlrpc_grow_req_bufs+0x33e/0x710 [ptlrpc] 03:31:50: [<ffffffff8116131d>] ? __vmalloc_node+0xad/0x120 03:31:50: [<ffffffffa07ab7ae>] ? ptlrpc_grow_req_bufs+0x33e/0x710 [ptlrpc] 03:31:50: [<ffffffff81161739>] ? vzalloc_node+0x29/0x30 03:31:50: [<ffffffffa07ab7ae>] ? ptlrpc_grow_req_bufs+0x33e/0x710 [ptlrpc] 03:31:50: [<ffffffffa07ae08d>] ? ptlrpc_main+0x114d/0x1800 [ptlrpc] 03:31:50: [<ffffffffa07acf40>] ? ptlrpc_main+0x0/0x1800 [ptlrpc] 03:31:50: [<ffffffff810a0fce>] ? kthread+0x9e/0xc0 03:31:50: [<ffffffff8100c28a>] ? child_rip+0xa/0x20 03:31:50: [<ffffffff810a0f30>] ? kthread+0x0/0xc0 03:31:50: [<ffffffff8100c280>] ? child_rip+0x0/0x20 03:31:50:Mem-Info: 03:31:50:Node 0 DMA per-cpu: 03:31:50:CPU 0: hi: 0, btch: 1 usd: 0 03:31:50:Node 0 DMA32 per-cpu: 03:31:50:CPU 0: hi: 186, btch: 31 usd: 30 03:31:50:active_anon:1793 inactive_anon:1851 isolated_anon:0 03:31:50: active_file:49 inactive_file:65 isolated_file:0 03:31:50: unevictable:0 dirty:2 writeback:1856 unstable:0 03:31:50: free:13242 slab_reclaimable:1623 slab_unreclaimable:336645 03:31:50: mapped:7 shmem:0 pagetables:566 bounce:0 03:31:50:Node 0 DMA free:8348kB min:332kB low:412kB high:496kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15348kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:7168kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes 03:31:50:lowmem_reserve[]: 0 2004 2004 2004 03:31:50:Node 0 DMA32 free:44620kB min:44720kB low:55900kB high:67080kB active_anon:7172kB inactive_anon:7404kB active_file:196kB inactive_file:260kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2052308kB mlocked:0kB dirty:8kB writeback:7424kB mapped:28kB shmem:0kB slab_reclaimable:6492kB slab_unreclaimable:1339412kB kernel_stack:3104kB pagetables:2264kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:3904 all_unreclaimable? no 03:31:50:lowmem_reserve[]: 0 0 0 0 03:31:50:Node 0 DMA: 1*4kB 1*8kB 1*16kB 0*32kB 0*64kB 1*128kB 0*256kB 0*512kB 0*1024kB 2*2048kB 1*4096kB = 8348kB 03:31:50:Node 0 DMA32: 1659*4kB 974*8kB 457*16kB 265*32kB 125*64kB 44*128kB 1*256kB 1*512kB 0*1024kB 0*2048kB 0*4096kB = 44620kB 03:31:50:1969 total pagecache pages 03:31:50:1861 pages in swap cache 03:31:50:Swap cache stats: add 1861, delete 0, find 0/0 03:31:50:Free swap = 4121320kB 03:31:50:Total swap = 4128764kB 03:31:50:524284 pages RAM 03:31:50:43736 pages reserved 03:31:50:138 pages shared 03:31:50:139521 pages non-shared 03:31:50:[ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name 03:31:50:[ 421] 0 421 2663 19 0 -17 -1000 udevd 03:31:50:[ 708] 0 708 2664 13 0 -17 -1000 udevd 03:31:50:[ 1107] 0 1107 6902 37 0 -17 -1000 auditd 03:31:50:[ 1141] 0 1141 62272 249 0 0 0 rsyslogd 03:31:50:[ 1192] 32 1192 4744 15 0 0 0 rpcbind 03:31:50:[ 1214] 29 1214 5837 1 0 0 0 rpc.statd 03:31:50:[ 1248] 81 1248 6418 1 0 0 0 dbus-daemon 03:31:50:[ 1270] 0 1270 53919 26 0 0 0 ypbind 03:31:50:[ 1302] 0 1302 47233 1 0 0 0 cupsd 03:31:50:[ 1358] 0 1358 1020 0 0 0 0 acpid 03:31:50:[ 1370] 68 1370 10520 81 0 0 0 hald 03:31:50:[ 1371] 0 1371 5099 1 0 0 0 hald-runner 03:31:50:[ 1403] 0 1403 5629 1 0 0 0 hald-addon-inpu 03:31:50:[ 1413] 68 1413 4501 1 0 0 0 hald-addon-acpi 03:31:50:[ 1459] 0 1459 26827 0 0 0 0 rpc.rquotad 03:31:50:[ 1464] 0 1464 5417 0 0 0 0 rpc.mountd 03:31:50:[ 1508] 0 1508 6292 1 0 0 0 rpc.idmapd 03:31:50:[ 1545] 498 1545 57325 156 0 0 0 munged 03:31:50:[ 8121] 0 8121 16555 0 0 -17 -1000 sshd 03:31:50:[ 8132] 0 8132 5429 24 0 0 0 xinetd 03:31:50:[ 8164] 0 8164 22210 291 0 0 0 sendmail 03:31:50:[ 8173] 51 8173 20073 364 0 0 0 sendmail 03:31:50:[ 8201] 0 8201 29216 152 0 0 0 crond 03:31:50:[ 8219] 0 8219 5276 44 0 0 0 atd 03:31:50:[ 8232] 0 8232 1020 22 0 0 0 agetty 03:31:50:[ 8234] 0 8234 1016 22 0 0 0 mingetty 03:31:50:[ 8236] 0 8236 1016 22 0 0 0 mingetty 03:31:50:[ 8238] 0 8238 1016 22 0 0 0 mingetty 03:31:50:[ 8240] 0 8240 2662 19 0 -17 -1000 udevd 03:31:50:[ 8241] 0 8241 1016 21 0 0 0 mingetty 03:31:50:[ 8243] 0 8243 1016 21 0 0 0 mingetty 03:31:50:[ 8245] 0 8245 1016 21 0 0 0 mingetty 03:31:50:[ 9665] 38 9665 7689 157 0 0 0 ntpd 03:31:50:Kernel panic - not syncing: Out of memory: system-wide panic_on_oom is enabled
From the other MDS’s console
03:01:01:LustreError: 11053:0:(import.c:675:ptlrpc_connect_import()) already connecting 03:01:01:LustreError: 11053:0:(import.c:675:ptlrpc_connect_import()) Skipped 5 previous similar messages 03:01:01:INFO: task mdt_out00_004:7712 blocked for more than 120 seconds. 03:01:01: Not tainted 2.6.32-573.12.1.el6_lustre.g93f956d.x86_64 #1 03:01:01:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 03:01:01:mdt_out00_004 D 0000000000000000 0 7712 2 0x00000080 03:01:01: ffff880059c4bb50 0000000000000046 0000000000000000 0000000000000000 03:01:01: ffff88004abd30c0 ffff880050b00118 ffff880052841000 ffff880050b00118 03:01:01: ffff880059c4bb50 ffffffffa05804cf ffff88004df4f068 ffff880059c4bfd8 03:01:01:Call Trace: 03:01:01: [<ffffffffa05804cf>] ? lu_object_find_try+0x9f/0x260 [obdclass] 03:01:01: [<ffffffffa05806cd>] lu_object_find_at+0x3d/0xe0 [obdclass] 03:01:01: [<ffffffffa07c3352>] ? __req_capsule_get+0x162/0x6e0 [ptlrpc] 03:01:01: [<ffffffff810672b0>] ? default_wake_function+0x0/0x20 03:01:01: [<ffffffffa0798dd0>] ? lustre_swab_object_update_reply+0x0/0xc0 [ptlrpc] 03:01:01: [<ffffffffa05814dc>] dt_locate_at+0x1c/0xa0 [obdclass] 03:01:01: [<ffffffffa0809980>] out_handle+0x1030/0x1880 [ptlrpc] 03:01:01: [<ffffffff8153a65e>] ? mutex_lock+0x1e/0x50 03:01:01: [<ffffffffa07f91da>] ? req_can_reconstruct+0x6a/0x120 [ptlrpc] 03:01:01: [<ffffffffa0800c3c>] tgt_request_handle+0x8ec/0x1440 [ptlrpc] 03:01:01: [<ffffffffa07adc61>] ptlrpc_main+0xd21/0x1800 [ptlrpc] 03:01:01: [<ffffffffa07acf40>] ? ptlrpc_main+0x0/0x1800 [ptlrpc] 03:01:01: [<ffffffff810a0fce>] kthread+0x9e/0xc0 03:01:01: [<ffffffff8100c28a>] child_rip+0xa/0x20 03:01:01: [<ffffffff810a0f30>] ? kthread+0x0/0xc0 03:01:01: [<ffffffff8100c280>] ? child_rip+0x0/0x20
There are two recent failures at
2016-02-16 09:12:14 - https://testing.hpdd.intel.com/test_sets/2485a3b0-d4d2-11e5-9e3f-5254006e85c2
2016-02-16 16:13:20 - https://testing.hpdd.intel.com/test_sets/428f6ebc-d50f-11e5-8530-5254006e85c2