Details
-
Bug
-
Resolution: Duplicate
-
Minor
-
None
-
Lustre 2.7.0
-
None
-
client and server: lustre-master build # 2639 ldiskfs
-
3
-
15553
Description
This issue was created by maloo for sarah <sarah@whamcloud.com>
This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/038bb078-3082-11e4-9e60-5254006e85c2.
The sub-test test_failover_mds failed with the following error:
test_failover_mds returned 254
This failure causes all the following tests failed
06:07:55:Lustre: DEBUG MARKER: == recovery-mds-scale test failover_mds: failover MDS == 06:07:50 (1409404070) 06:07:55:Lustre: DEBUG MARKER: PATH=/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/lib64/openmpi/bin:/usr/bin:/bin: 06:09:06:Lustre: DEBUG MARKER: /usr/sbin/lctl mark Started client load: dd on onyx-53vm5 06:09:06:Lustre: DEBUG MARKER: Started client load: dd on onyx-53vm5 06:09:06:Lustre: DEBUG MARKER: /usr/sbin/lctl mark Started client load: tar on onyx-53vm6 06:09:06:Lustre: DEBUG MARKER: Started client load: tar on onyx-53vm6 06:09:06:Lustre: DEBUG MARKER: cat /tmp/client-load.pid 06:09:06:Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=0 DURATION=86400 PERIOD=1200 06:09:06:Lustre: DEBUG MARKER: ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=0 DURATION=86400 PERIOD=1200 06:09:06:Lustre: DEBUG MARKER: rc=$([ -f /proc/sys/lnet/catastrophe ] && 06:09:06: echo $(< /proc/sys/lnet/catastrophe) || echo 0); 06:09:06: if [ $rc -ne 0 ]; then echo $(hostname): $rc; fi 06:09:06: exit $rc 06:09:06:Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh 06:09:06:Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds1 recovery complete before doing next failover... 06:09:06:Lustre: DEBUG MARKER: Wait mds1 recovery complete before doing next failover... 06:09:06:Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL state before doing next failover... 06:09:06:Lustre: DEBUG MARKER: Checking clients are in FULL state before doing next failover... 06:09:06:Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/u 06:09:06:Lustre: DEBUG MARKER: lctl get_param -n at_max 06:09:06:Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec 06:09:06:Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec 06:09:06:Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec 06:09:06:Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec 06:09:06:Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec 06:09:06:Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec 06:09:06:Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds1 06:09:06:Lustre: DEBUG MARKER: Starting failover on mds1 06:09:06:Lustre: 2189:0:(client.c:1932:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1409404104/real 1409404104] req@ffff88006cc99c00 x1477867250128216/t0(0) o400->lustre-MDT0000-mdc-ffff88007c826800@10.2.5.48@tcp:12/10 lens 224/224 e 0 to 1 dl 1409404111 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 06:09:06:Lustre: lustre-MDT0000-mdc-ffff88007c826800: Connection to lustre-MDT0000 (at 10.2.5.48@tcp) was lost; in progress operations using this service will wait for recovery to complete 06:09:06:Lustre: 2188:0:(client.c:1932:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1409404111/real 1409404111] req@ffff88005fdba400 x1477867250236752/t0(0) o38->lustre-MDT0000-mdc-ffff88007c826800@10.2.5.48@tcp:12/10 lens 400/544 e 0 to 1 dl 1409404117 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 06:09:06:Lustre: 2188:0:(client.c:1932:ptlrpc_expire_one_request()) Skipped 4 previous similar messages 06:09:06:Lustre: 2188:0:(client.c:1932:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1409404121/real 1409404121] req@ffff88005c229000 x1477867250357216/t0(0) o38->lustre-MDT0000-mdc-ffff88007c826800@10.2.5.52@tcp:12/10 lens 400/544 e 0 to 1 dl 1409404127 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 06:09:06:Lustre: 2189:0:(client.c:1932:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1409404109/real 1409404109] req@ffff880064907000 x1477867250198736/t0(0) o400->lustre-MDT0000-mdc-ffff88007c826800@10.2.5.48@tcp:12/10 lens 224/224 e 0 to 1 dl 1409404116 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 06:09:06:LustreError: 166-1: MGC10.2.5.48@tcp: Connection to MGS (at 10.2.5.48@tcp) was lost; in progress operations using this service will fail 06:18:40:Lustre: 2188:0:(client.c:1932:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1409404136/real 1409404136] req@ffff88005380f000 x1477867250510460/t0(0) o250->MGC10.2.5.48@tcp@10.2.5.48@tcp:26/25 lens 400/544 e 0 to 1 dl 1409404142 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 06:18:40:Lustre: 2188:0:(client.c:1932:ptlrpc_expire_one_request()) Skipped 2 previous similar messages 06:18:40:Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads AFTER failover -- failure NOT OK 06:18:40:Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failover -- failure NOT OK 06:18:40:Lustre: Evicted from MGS (at 10.2.5.52@tcp) after server handle changed from 0x19171eaf73903e33 to 0x58720d9268131580 06:18:40:Lustre: MGC10.2.5.48@tcp: Connection restored to MGS (at 10.2.5.52@tcp) 06:18:40:LustreError: 2188:0:(client.c:2802:ptlrpc_replay_interpret()) @@@ status 301, old was 0 req@ffff880037ab1400 x1477867249598884/t4294967302(4294967302) o101->lustre-MDT0000-mdc-ffff88007c826800@10.2.5.52@tcp:12/10 lens 704/544 e 0 to 0 dl 1409404158 ref 2 fl Interpret:RP/4/0 rc 301/301 06:18:40:Lustre: DEBUG MARKER: rc=$([ -f /proc/sys/lnet/catastrophe ] && 06:18:40: echo $(< /proc/sys/lnet/catastrophe) || echo 0); 06:18:40: if [ $rc -ne 0 ]; then echo $(hostname): $rc; fi 06:18:40: exit $rc 06:18:40:Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh 06:18:40:Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds1 has failed over 1 times, and counting... 06:18:40:Lustre: DEBUG MARKER: mds1 has failed over 1 times, and counting... 06:18:40:Lustre: lustre-MDT0000-mdc-ffff88007c826800: Connection restored to lustre-MDT0000 (at 10.2.5.52@tcp) 06:18:40:ntpd invoked oom-killer: gfp_mask=0x200da, order=0, oom_adj=0, oom_score_adj=0 06:18:40:ntpd cpuset=/ mems_allowed=0 06:18:40:Pid: 1850, comm: ntpd Not tainted 2.6.32-431.23.3.el6.x86_64 #1 06:18:40:Call Trace: 06:18:40: [<ffffffff810d0431>] ? cpuset_print_task_mems_allowed+0x91/0xb0 06:18:40: [<ffffffff81122810>] ? dump_header+0x90/0x1b0 06:18:40: [<ffffffff8122833c>] ? security_real_capable_noaudit+0x3c/0x70 06:18:40: [<ffffffff81122c92>] ? oom_kill_process+0x82/0x2a0 06:18:40: [<ffffffff81122bd1>] ? select_bad_process+0xe1/0x120 06:18:40: [<ffffffff811230d0>] ? out_of_memory+0x220/0x3c0 06:18:40: [<ffffffff8112f9ef>] ? __alloc_pages_nodemask+0x89f/0x8d0 06:18:40: [<ffffffff811679ea>] ? alloc_pages_vma+0x9a/0x150 06:18:40: [<ffffffff8115b6c2>] ? read_swap_cache_async+0xf2/0x160 06:18:40: [<ffffffff8115c1e9>] ? valid_swaphandles+0x69/0x150 06:18:40: [<ffffffff8115b7b7>] ? swapin_readahead+0x87/0xc0 06:18:40: [<ffffffff8114aa8d>] ? handle_pte_fault+0x6dd/0xb00 06:18:40: [<ffffffff8114b0da>] ? handle_mm_fault+0x22a/0x300 06:18:40: [<ffffffff8104a8d8>] ? __do_page_fault+0x138/0x480 06:18:40: [<ffffffff811a08c0>] ? pollwake+0x0/0x60 06:18:40: [<ffffffff811a08c0>] ? pollwake+0x0/0x60 06:18:40: [<ffffffff811a08c0>] ? pollwake+0x0/0x60 06:18:40: [<ffffffff8152e74e>] ? do_page_fault+0x3e/0xa0 06:18:40: [<ffffffff8152bb05>] ? page_fault+0x25/0x30 06:18:40: [<ffffffff8128dee6>] ? copy_user_generic_unrolled+0x86/0xb0 06:18:40: [<ffffffff810129de>] ? copy_user_generic+0xe/0x20 06:18:40: [<ffffffff811a05d9>] ? set_fd_set+0x49/0x60 06:18:40: [<ffffffff811a1a9c>] ? core_sys_select+0x1bc/0x2c0 06:18:40: [<ffffffff8103ea6c>] ? kvm_clock_read+0x1c/0x20 06:18:40: [<ffffffff8103ea79>] ? kvm_clock_get_cycles+0x9/0x10 06:18:40: [<ffffffff8109530f>] ? queue_work+0x1f/0x30 06:18:40: [<ffffffff8103f9d8>] ? pvclock_clocksource_read+0x58/0xd0 06:18:40: [<ffffffff8103ea6c>] ? kvm_clock_read+0x1c/0x20 06:18:40: [<ffffffff8103ea79>] ? kvm_clock_get_cycles+0x9/0x10 06:18:40: [<ffffffff810a6d31>] ? ktime_get_ts+0xb1/0xf0 06:18:40: [<ffffffff811a1df7>] ? sys_select+0x47/0x110 06:18:40: [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b 06:18:40:Mem-Info: 06:18:40:Node 0 DMA per-cpu: 06:18:40:CPU 0: hi: 0, btch: 1 usd: 0 06:18:40:CPU 1: hi: 0, btch: 1 usd: 0 06:18:40:Node 0 DMA32 per-cpu: 06:18:40:CPU 0: hi: 186, btch: 31 usd: 30 06:18:40:CPU 1: hi: 186, btch: 31 usd: 0 06:18:40:active_anon:0 inactive_anon:0 isolated_anon:0 06:18:40: active_file:204319 inactive_file:204250 isolated_file:128 06:18:40: unevictable:0 dirty:305 writeback:31149 unstable:0 06:18:40: free:13260 slab_reclaimable:3544 slab_unreclaimable:41144 06:18:40: mapped:2169 shmem:0 pagetables:1113 bounce:0 06:18:40:Node 0 DMA free:8344kB min:332kB low:412kB high:496kB active_anon:0kB inactive_anon:0kB active_file:3472kB inactive_file:3236kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15348kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:692kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:10105 all_unreclaimable? yes 06:18:40:lowmem_reserve[]: 0 2004 2004 2004 06:18:40:Node 0 DMA32 free:44696kB min:44720kB low:55900kB high:67080kB active_anon:0kB inactive_anon:0kB active_file:813804kB inactive_file:813848kB unevictable:0kB isolated(anon):0kB isolated(file):384kB present:2052308kB mlocked:0kB dirty:1220kB writeback:124596kB mapped:8676kB shmem:0kB slab_reclaimable:14176kB slab_unreclaimable:163884kB kernel_stack:1432kB pagetables:4452kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:2683104 all_unreclaimable? no 06:18:40:lowmem_reserve[]: 0 0 0 0 06:18:40:Node 0 DMA: 2*4kB 30*8kB 18*16kB 20*32kB 8*64kB 2*128kB 1*256kB 0*512kB 2*1024kB 2*2048kB 0*4096kB = 8344kB 06:18:40:Node 0 DMA32: 802*4kB 422*8kB 236*16kB 125*32kB 60*64kB 33*128kB 19*256kB 8*512kB 5*1024kB 2*2048kB 1*4096kB = 44696kB 06:18:40:226130 total pagecache pages 06:18:40:1 pages in swap cache 06:18:40:Swap cache stats: add 6051, delete 6050, find 19/21 06:18:40:Free swap = 2701716kB 06:18:40:Total swap = 2725880kB 06:18:40:524284 pages RAM 06:18:40:43694 pages reserved 06:18:40:444123 pages shared 06:18:40:235407 pages non-shared 06:18:40:[ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name 06:18:40:[ 374] 0 374 2729 80 0 -17 -1000 udevd 06:18:40:[ 1017] 0 1017 2280 35 0 0 0 dhclient 06:18:40:[ 1067] 0 1067 6916 125 1 -17 -1000 auditd 06:18:40:[ 1083] 0 1083 63855 207 1 0 0 rsyslogd 06:18:40:[ 1112] 0 1112 2705 103 1 0 0 irqbalance 06:18:40:[ 1126] 32 1126 4744 157 1 0 0 rpcbind 06:18:40:[ 1138] 0 1138 49883 385 1 0 0 sssd 06:18:40:[ 1139] 0 1139 64845 871 1 0 0 sssd_be 06:18:40:[ 1140] 0 1140 50453 494 1 0 0 sssd_nss 06:18:40:[ 1141] 0 1141 48029 485 1 0 0 sssd_pam 06:18:40:[ 1142] 0 1142 47518 491 1 0 0 sssd_ssh 06:18:40:[ 1143] 0 1143 52608 613 1 0 0 sssd_pac 06:18:40:[ 1160] 29 1160 5837 206 1 0 0 rpc.statd 06:18:40:[ 1271] 81 1271 5871 98 1 0 0 dbus-daemon 06:18:40:[ 1309] 0 1309 1020 125 1 0 0 acpid 06:18:40:[ 1318] 68 1318 9921 387 1 0 0 hald 06:18:40:[ 1319] 0 1319 5081 274 1 0 0 hald-runner 06:18:40:[ 1351] 0 1351 5611 263 1 0 0 hald-addon-inpu 06:18:40:[ 1361] 68 1361 4483 239 1 0 0 hald-addon-acpi 06:18:40:[ 1381] 0 1381 168325 584 1 0 0 automount 06:18:40:[ 1427] 0 1427 26827 32 0 0 0 rpc.rquotad 06:18:40:[ 1431] 0 1431 5414 88 0 0 0 rpc.mountd 06:18:40:[ 1467] 0 1467 5773 79 1 0 0 rpc.idmapd 06:18:40:[ 1498] 496 1498 56782 288 0 0 0 munged 06:18:40:[ 1513] 0 1513 16653 99 0 -17 -1000 sshd 06:18:40:[ 1521] 0 1521 5545 175 0 0 0 xinetd 06:18:40:[ 1605] 0 1605 20843 605 1 0 0 master 06:18:40:[ 1614] 89 1614 20863 576 1 0 0 pickup 06:18:40:[ 1615] 89 1615 20906 564 0 0 0 qmgr 06:18:40:[ 1628] 0 1628 29325 154 1 0 0 crond 06:18:40:[ 1639] 0 1639 5385 72 0 0 0 atd 06:18:40:[ 1653] 0 1653 15582 145 1 0 0 certmonger 06:18:40:[ 1667] 0 1667 1020 137 1 0 0 agetty 06:18:40:[ 1668] 0 1668 1016 124 1 0 0 mingetty 06:18:40:[ 1670] 0 1670 1016 124 1 0 0 mingetty 06:18:40:[ 1672] 0 1672 1016 124 1 0 0 mingetty 06:18:40:[ 1674] 0 1674 1016 124 1 0 0 mingetty 06:18:40:[ 1676] 0 1676 2728 80 0 -17 -1000 udevd 06:18:40:[ 1677] 0 1677 2728 76 0 -17 -1000 udevd 06:18:40:[ 1678] 0 1678 1016 124 0 0 0 mingetty 06:18:40:[ 1680] 0 1680 1016 124 1 0 0 mingetty 06:18:40:[ 1780] 0 1780 4872 176 0 0 0 anacron 06:18:40:[ 1850] 38 1850 8202 374 0 0 0 ntpd 06:18:40:[ 3848] 0 3848 15919 353 1 0 0 in.mrshd 06:18:40:[ 3850] 0 3850 26515 293 1 0 0 bash 06:18:40:[ 3892] 0 3892 26515 117 1 0 0 bash 06:18:40:[ 3893] 0 3893 26839 293 0 0 0 run_dd.sh 06:18:40:[ 4981] 0 4981 26295 142 1 0 0 dd 06:18:40:Out of memory: Kill process 1017 (dhclient) score 1 or sacrifice child 06:18:40:Killed process 1017, UID 0, (dhclient) total-vm:9120kB, anon-rss:0kB, file-rss:140kB 06:18:40:ntpd invoked oom-killer: gfp_mask=0x200da, order=0, oom_adj=0, oom_score_adj=0 06:18:40:ntpd cpuset=/ mems_allowed=0 06:18:40:Pid: 1850, comm: ntpd Not tainted 2.6.32-431.23.3.el6.x86_64 #1
Attachments
Issue Links
- duplicates
-
LU-5483 recovery-mds-scale test failover_mds: oom failure on client
- Reopened