Details
-
Bug
-
Resolution: Duplicate
-
Minor
-
None
-
Lustre 2.7.0
-
None
-
lustre-master build # 2733 RHEL6
-
3
-
16600
Description
This issue was created by maloo for sarah <sarah@whamcloud.com>
This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/ba0b8798-6902-11e4-9d25-5254006e85c2.
The sub-test test_failover_mds failed with the following error:
test_failover_mds returned 4
client console
03:26:42:Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failover -- failure NOT OK 03:26:42:Lustre: DEBUG MARKER: rc=$(lctl get_param -n catastrophe); 03:26:42: if [ $rc -ne 0 ]; then echo $(hostname): $rc; fi 03:26:42: exit $rc 03:26:42:Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh 03:26:42:Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds1 has failed over 4 times, and counting... 03:26:42:Lustre: DEBUG MARKER: mds1 has failed over 4 times, and counting... 03:26:42:Lustre: 2188:0:(client.c:1947:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1415445951/real 1415445967] req@ffff880032a59800 x1484198832385960/t0(0) o400->lustre-MDT0000-mdc-ffff88007a10bc00@10.2.4.189@tcp:12/10 lens 224/224 e 0 to 1 dl 1415445958 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 03:26:42:Lustre: 2188:0:(client.c:1947:ptlrpc_expire_one_request()) Skipped 1 previous similar message 03:31:34:Lustre: Evicted from MGS (at 10.2.4.185@tcp) after server handle changed from 0xd9f20fa7646c5b6b to 0x11b90fceb8d34e70 03:31:34:Lustre: MGC10.2.4.185@tcp: Connection restored to MGS (at 10.2.4.185@tcp) 03:31:34:LustreError: 2187:0:(client.c:2817:ptlrpc_replay_interpret()) @@@ status 301, old was 0 req@ffff88005bc46c00 x1484198830088128/t17179982789(17179982789) o101->lustre-MDT0000-mdc-ffff88007a10bc00@10.2.4.185@tcp:12/10 lens 704/544 e 0 to 0 dl 1415446038 ref 2 fl Interpret:RP/4/0 rc 301/301 03:31:34:Lustre: lustre-MDT0000-mdc-ffff88007a10bc00: Connection restored to lustre-MDT0000 (at 10.2.4.185@tcp) 03:31:34:ntpd invoked oom-killer: gfp_mask=0x200da, order=0, oom_adj=0, oom_score_adj=0 03:31:34:ntpd cpuset=/ mems_allowed=0 03:31:34:Pid: 1849, comm: ntpd Not tainted 2.6.32-431.29.2.el6.x86_64 #1 03:31:34:Call Trace: 03:31:34: [<ffffffff810d0361>] ? cpuset_print_task_mems_allowed+0x91/0xb0 03:31:34: [<ffffffff81122730>] ? dump_header+0x90/0x1b0 03:31:34: [<ffffffff8112289e>] ? check_panic_on_oom+0x4e/0x80 03:31:34: [<ffffffff81122f8b>] ? out_of_memory+0x1bb/0x3c0 03:31:34: [<ffffffff8112f90f>] ? __alloc_pages_nodemask+0x89f/0x8d0 03:31:34: [<ffffffff8116799a>] ? alloc_pages_vma+0x9a/0x150 03:31:34: [<ffffffff8115b622>] ? read_swap_cache_async+0xf2/0x160 03:31:34: [<ffffffff8115c149>] ? valid_swaphandles+0x69/0x150 03:31:34: [<ffffffff8115b717>] ? swapin_readahead+0x87/0xc0 03:31:34: [<ffffffff8114a9bd>] ? handle_pte_fault+0x6dd/0xb00 03:31:34: [<ffffffff81060aa3>] ? perf_event_task_sched_out+0x33/0x70 03:31:34: [<ffffffff8114b00a>] ? handle_mm_fault+0x22a/0x300 03:31:34: [<ffffffff8104a8d8>] ? __do_page_fault+0x138/0x480 03:31:34: [<ffffffff811a0870>] ? pollwake+0x0/0x60 03:31:34: [<ffffffff811a0870>] ? pollwake+0x0/0x60 03:31:34: [<ffffffff811a0870>] ? pollwake+0x0/0x60 03:31:34: [<ffffffff8152e99e>] ? do_page_fault+0x3e/0xa0 03:31:34: [<ffffffff8152bd55>] ? page_fault+0x25/0x30 03:31:34: [<ffffffff8128dea6>] ? copy_user_generic_unrolled+0x86/0xb0 03:31:34: [<ffffffff810129de>] ? copy_user_generic+0xe/0x20 03:31:34: [<ffffffff811a0589>] ? set_fd_set+0x49/0x60 03:31:34: [<ffffffff811a1a4c>] ? core_sys_select+0x1bc/0x2c0 03:31:34: [<ffffffff8103f9d8>] ? pvclock_clocksource_read+0x58/0xd0 03:31:34: [<ffffffff8103f9d8>] ? pvclock_clocksource_read+0x58/0xd0 03:31:34: [<ffffffff8103ea6c>] ? kvm_clock_read+0x1c/0x20 03:31:34: [<ffffffff8103ea79>] ? kvm_clock_get_cycles+0x9/0x10 03:31:34: [<ffffffff810a6d31>] ? ktime_get_ts+0xb1/0xf0 03:31:34: [<ffffffff811a1da7>] ? sys_select+0x47/0x110 03:31:34: [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b 03:31:34:Mem-Info: 03:31:34:Node 0 DMA per-cpu: 03:31:34:CPU 0: hi: 0, btch: 1 usd: 0 03:31:34:CPU 1: hi: 0, btch: 1 usd: 0 03:31:34:Node 0 DMA32 per-cpu: 03:31:34:CPU 0: hi: 186, btch: 31 usd: 91 03:31:34:CPU 1: hi: 186, btch: 31 usd: 41 03:31:34:active_anon:17 inactive_anon:0 isolated_anon:0 03:31:34: active_file:203782 inactive_file:205446 isolated_file:0 03:31:34: unevictable:0 dirty:440 writeback:30040 unstable:0 03:31:34: free:12464 slab_reclaimable:2779 slab_unreclaimable:42062 03:31:34: mapped:1935 shmem:0 pagetables:1109 bounce:0 03:31:34:Node 0 DMA free:8352kB min:332kB low:412kB high:496kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:6552kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15348kB mlocked:0kB dirty:0kB writeback:6552kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:840kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:11072 all_unreclaimable? yes 03:31:34:lowmem_reserve[]: 0 2004 2004 2004 03:31:34:Node 0 DMA32 free:41504kB min:44720kB low:55900kB high:67080kB active_anon:68kB inactive_anon:0kB active_file:815128kB inactive_file:815232kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2052308kB mlocked:0kB dirty:1760kB writeback:113608kB mapped:7740kB shmem:0kB slab_reclaimable:11116kB slab_unreclaimable:167408kB kernel_stack:1464kB pagetables:4436kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:2961178 all_unreclaimable? yes 03:31:34:lowmem_reserve[]: 0 0 0 0 03:31:34:Node 0 DMA: 8*4kB 60*8kB 40*16kB 25*32kB 12*64kB 0*128kB 0*256kB 1*512kB 1*1024kB 2*2048kB 0*4096kB = 8352kB 03:31:34:Node 0 DMA32: 322*4kB 77*8kB 51*16kB 34*32kB 15*64kB 5*128kB 5*256kB 4*512kB 4*1024kB 0*2048kB 7*4096kB = 41504kB 03:31:34:227493 total pagecache pages 03:31:34:0 pages in swap cache 03:31:34:Swap cache stats: add 6524, delete 6524, find 2262/2343 03:31:34:Free swap = 2703532kB 03:31:34:Total swap = 2725884kB 03:31:34:524284 pages RAM 03:31:34:43694 pages reserved 03:31:34:448398 pages shared 03:31:34:234817 pages non-shared 03:31:34:[ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name 03:31:34:[ 373] 0 373 2721 80 1 -17 -1000 udevd 03:31:34:[ 1019] 0 1019 2280 35 0 0 0 dhclient 03:31:34:[ 1071] 0 1071 6916 132 1 -17 -1000 auditd 03:31:34:[ 1087] 0 1087 63854 236 1 0 0 rsyslogd 03:31:34:[ 1116] 0 1116 2705 103 1 0 0 irqbalance 03:31:34:[ 1130] 32 1130 4744 157 1 0 0 rpcbind 03:31:34:[ 1142] 0 1142 49913 365 1 0 0 sssd 03:31:34:[ 1143] 0 1143 64328 866 1 0 0 sssd_be 03:31:34:[ 1144] 0 1144 50478 506 1 0 0 sssd_nss 03:31:34:[ 1145] 0 1145 48029 495 0 0 0 sssd_pam 03:31:34:[ 1146] 0 1146 47507 503 0 0 0 sssd_ssh 03:31:34:[ 1147] 0 1147 52608 486 0 0 0 sssd_pac 03:31:34:[ 1164] 29 1164 6357 215 1 0 0 rpc.statd 03:31:34:[ 1278] 81 1278 5871 139 0 0 0 dbus-daemon 03:31:34:[ 1316] 0 1316 1020 125 1 0 0 acpid 03:31:34:[ 1325] 68 1325 9921 331 1 0 0 hald 03:31:34:[ 1326] 0 1326 5081 248 1 0 0 hald-runner 03:31:34:[ 1358] 0 1358 5611 238 1 0 0 hald-addon-inpu 03:31:34:[ 1368] 68 1368 4483 236 1 0 0 hald-addon-acpi 03:31:34:[ 1388] 0 1388 168326 554 1 0 0 automount 03:31:34:[ 1434] 0 1434 26827 29 0 0 0 rpc.rquotad 03:31:34:[ 1438] 0 1438 5414 87 0 0 0 rpc.mountd 03:31:34:[ 1474] 0 1474 5773 86 1 0 0 rpc.idmapd 03:31:34:[ 1505] 496 1505 56785 294 1 0 0 munged 03:31:34:[ 1520] 0 1520 16656 100 0 -17 -1000 sshd 03:31:34:[ 1528] 0 1528 5545 179 1 0 0 xinetd 03:31:34:[ 1612] 0 1612 20846 610 1 0 0 master 03:31:34:[ 1620] 89 1620 20866 568 1 0 0 pickup 03:31:34:[ 1622] 89 1622 20909 569 1 0 0 qmgr 03:31:34:[ 1635] 0 1635 29324 153 1 0 0 crond 03:31:34:[ 1646] 0 1646 5385 76 0 0 0 atd 03:31:34:[ 1672] 0 1672 15585 146 0 0 0 certmonger 03:31:34:[ 1686] 0 1686 1020 133 1 0 0 agetty 03:31:34:[ 1687] 0 1687 1016 121 1 0 0 mingetty 03:31:34:[ 1689] 0 1689 1016 121 1 0 0 mingetty 03:31:34:[ 1691] 0 1691 1016 121 1 0 0 mingetty 03:31:34:[ 1693] 0 1693 1016 121 1 0 0 mingetty 03:31:34:[ 1695] 0 1695 1016 121 1 0 0 mingetty 03:31:34:[ 1697] 0 1697 1016 121 1 0 0 mingetty 03:31:34:[ 1701] 0 1701 2720 80 1 -17 -1000 udevd 03:31:34:[ 1702] 0 1702 2720 76 0 -17 -1000 udevd 03:31:34:[ 1849] 38 1849 8205 376 1 0 0 ntpd 03:31:34:[ 3864] 0 3864 15919 354 0 0 0 in.mrshd 03:31:34:[ 3870] 0 3870 26515 292 1 0 0 bash 03:31:34:[ 3892] 0 3892 26515 115 0 0 0 bash 03:31:34:[ 3893] 0 3893 26839 292 0 0 0 run_dd.sh 03:31:34:[ 6142] 0 6142 4346 97 0 0 0 anacron 03:31:34:[ 8144] 0 8144 26295 138 1 0 0 dd 03:31:34:Kernel panic - not syncing: Out of memory: system-wide panic_on_oom is enabled 03:31:34: 03:31:34:Pid: 1849, comm: ntpd Not tainted 2.6.32-431.29.2.el6.x86_64 #1 03:31:34:Call Trace: 03:31:34: [<ffffffff8152873c>] ? panic+0xa7/0x16f 03:31:34: [<ffffffff81122831>] ? dump_header+0x191/0x1b0 03:31:34: [<ffffffff811228cc>] ? check_panic_on_oom+0x7c/0x80 03:31:34: [<ffffffff81122f8b>] ? out_of_memory+0x1bb/0x3c0 03:31:34: [<ffffffff8112f90f>] ? __alloc_pages_nodemask+0x89f/0x8d0 03:31:34: [<ffffffff8116799a>] ? alloc_pages_vma+0x9a/0x150 03:31:34: [<ffffffff8115b622>] ? read_swap_cache_async+0xf2/0x160 03:31:34: [<ffffffff8115c149>] ? valid_swaphandles+0x69/0x150 03:31:34: [<ffffffff8115b717>] ? swapin_readahead+0x87/0xc0 03:31:34: [<ffffffff8114a9bd>] ? handle_pte_fault+0x6dd/0xb00 03:31:34: [<ffffffff81060aa3>] ? perf_event_task_sched_out+0x33/0x70 03:31:34: [<ffffffff8114b00a>] ? handle_mm_fault+0x22a/0x300 03:31:34: [<ffffffff8104a8d8>] ? __do_page_fault+0x138/0x480 03:31:34: [<ffffffff811a0870>] ? pollwake+0x0/0x60 03:31:34: [<ffffffff811a0870>] ? pollwake+0x0/0x60 03:31:34: [<ffffffff811a0870>] ? pollwake+0x0/0x60 03:31:34: [<ffffffff8152e99e>] ? do_page_fault+0x3e/0xa0 03:31:34: [<ffffffff8152bd55>] ? page_fault+0x25/0x30 03:31:34: [<ffffffff8128dea6>] ? copy_user_generic_unrolled+0x86/0xb0 03:31:34: [<ffffffff810129de>] ? copy_user_generic+0xe/0x20 03:31:34: [<ffffffff811a0589>] ? set_fd_set+0x49/0x60 03:31:34: [<ffffffff811a1a4c>] ? core_sys_select+0x1bc/0x2c0 03:31:34: [<ffffffff8103f9d8>] ? pvclock_clocksource_read+0x58/0xd0 03:31:34: [<ffffffff8103f9d8>] ? pvclock_clocksource_read+0x58/0xd0 03:31:34: [<ffffffff8103ea6c>] ? kvm_clock_read+0x1c/0x20 03:31:34: [<ffffffff8103ea79>] ? kvm_clock_get_cycles+0x9/0x10 03:31:34: [<ffffffff810a6d31>] ? ktime_get_ts+0xb1/0xf0 03:31:34: [<ffffffff811a1da7>] ? sys_select+0x47/0x110 03:31:34: [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b
Info required for matching: recovery-mds-scale failover_mds