Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5574

Hard Failover failure on test suite recovery-mds-scale test_failover_mds: client OOM

    XMLWordPrintable

Details

    • Bug
    • Resolution: Duplicate
    • Minor
    • None
    • Lustre 2.7.0
    • None
    • client and server: lustre-master build # 2639 ldiskfs
    • 3
    • 15553

    Description

      This issue was created by maloo for sarah <sarah@whamcloud.com>

      This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/038bb078-3082-11e4-9e60-5254006e85c2.

      The sub-test test_failover_mds failed with the following error:

      test_failover_mds returned 254

      This failure causes all the following tests failed

      06:07:55:Lustre: DEBUG MARKER: == recovery-mds-scale test failover_mds: failover MDS == 06:07:50 (1409404070)
      06:07:55:Lustre: DEBUG MARKER: PATH=/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/lib64/openmpi/bin:/usr/bin:/bin:
      06:09:06:Lustre: DEBUG MARKER: /usr/sbin/lctl mark Started client load: dd on onyx-53vm5
      06:09:06:Lustre: DEBUG MARKER: Started client load: dd on onyx-53vm5
      06:09:06:Lustre: DEBUG MARKER: /usr/sbin/lctl mark Started client load: tar on onyx-53vm6
      06:09:06:Lustre: DEBUG MARKER: Started client load: tar on onyx-53vm6
      06:09:06:Lustre: DEBUG MARKER: cat /tmp/client-load.pid
      06:09:06:Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads BEFORE failover -- failure NOT OK              ELAPSED=0 DURATION=86400 PERIOD=1200
      06:09:06:Lustre: DEBUG MARKER: ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=0 DURATION=86400 PERIOD=1200
      06:09:06:Lustre: DEBUG MARKER: rc=$([ -f /proc/sys/lnet/catastrophe ] &&
      06:09:06:		echo $(< /proc/sys/lnet/catastrophe) || echo 0);
      06:09:06:		if [ $rc -ne 0 ]; then echo $(hostname): $rc; fi
      06:09:06:		exit $rc
      06:09:06:Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh
      06:09:06:Lustre: DEBUG MARKER: /usr/sbin/lctl mark Wait mds1 recovery complete before doing next failover...
      06:09:06:Lustre: DEBUG MARKER: Wait mds1 recovery complete before doing next failover...
      06:09:06:Lustre: DEBUG MARKER: /usr/sbin/lctl mark Checking clients are in FULL state before doing next failover...
      06:09:06:Lustre: DEBUG MARKER: Checking clients are in FULL state before doing next failover...
      06:09:06:Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/u
      06:09:06:Lustre: DEBUG MARKER: lctl get_param -n at_max
      06:09:06:Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
      06:09:06:Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
      06:09:06:Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
      06:09:06:Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
      06:09:06:Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
      06:09:06:Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
      06:09:06:Lustre: DEBUG MARKER: /usr/sbin/lctl mark Starting failover on mds1
      06:09:06:Lustre: DEBUG MARKER: Starting failover on mds1
      06:09:06:Lustre: 2189:0:(client.c:1932:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1409404104/real 1409404104]  req@ffff88006cc99c00 x1477867250128216/t0(0) o400->lustre-MDT0000-mdc-ffff88007c826800@10.2.5.48@tcp:12/10 lens 224/224 e 0 to 1 dl 1409404111 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      06:09:06:Lustre: lustre-MDT0000-mdc-ffff88007c826800: Connection to lustre-MDT0000 (at 10.2.5.48@tcp) was lost; in progress operations using this service will wait for recovery to complete
      06:09:06:Lustre: 2188:0:(client.c:1932:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1409404111/real 1409404111]  req@ffff88005fdba400 x1477867250236752/t0(0) o38->lustre-MDT0000-mdc-ffff88007c826800@10.2.5.48@tcp:12/10 lens 400/544 e 0 to 1 dl 1409404117 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      06:09:06:Lustre: 2188:0:(client.c:1932:ptlrpc_expire_one_request()) Skipped 4 previous similar messages
      06:09:06:Lustre: 2188:0:(client.c:1932:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1409404121/real 1409404121]  req@ffff88005c229000 x1477867250357216/t0(0) o38->lustre-MDT0000-mdc-ffff88007c826800@10.2.5.52@tcp:12/10 lens 400/544 e 0 to 1 dl 1409404127 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      06:09:06:Lustre: 2189:0:(client.c:1932:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1409404109/real 1409404109]  req@ffff880064907000 x1477867250198736/t0(0) o400->lustre-MDT0000-mdc-ffff88007c826800@10.2.5.48@tcp:12/10 lens 224/224 e 0 to 1 dl 1409404116 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      06:09:06:LustreError: 166-1: MGC10.2.5.48@tcp: Connection to MGS (at 10.2.5.48@tcp) was lost; in progress operations using this service will fail
      06:18:40:Lustre: 2188:0:(client.c:1932:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1409404136/real 1409404136]  req@ffff88005380f000 x1477867250510460/t0(0) o250->MGC10.2.5.48@tcp@10.2.5.48@tcp:26/25 lens 400/544 e 0 to 1 dl 1409404142 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      06:18:40:Lustre: 2188:0:(client.c:1932:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
      06:18:40:Lustre: DEBUG MARKER: /usr/sbin/lctl mark ==== Checking the clients loads AFTER failover -- failure NOT OK
      06:18:40:Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failover -- failure NOT OK
      06:18:40:Lustre: Evicted from MGS (at 10.2.5.52@tcp) after server handle changed from 0x19171eaf73903e33 to 0x58720d9268131580
      06:18:40:Lustre: MGC10.2.5.48@tcp: Connection restored to MGS (at 10.2.5.52@tcp)
      06:18:40:LustreError: 2188:0:(client.c:2802:ptlrpc_replay_interpret()) @@@ status 301, old was 0  req@ffff880037ab1400 x1477867249598884/t4294967302(4294967302) o101->lustre-MDT0000-mdc-ffff88007c826800@10.2.5.52@tcp:12/10 lens 704/544 e 0 to 0 dl 1409404158 ref 2 fl Interpret:RP/4/0 rc 301/301
      06:18:40:Lustre: DEBUG MARKER: rc=$([ -f /proc/sys/lnet/catastrophe ] &&
      06:18:40:		echo $(< /proc/sys/lnet/catastrophe) || echo 0);
      06:18:40:		if [ $rc -ne 0 ]; then echo $(hostname): $rc; fi
      06:18:40:		exit $rc
      06:18:40:Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh
      06:18:40:Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds1 has failed over 1 times, and counting...
      06:18:40:Lustre: DEBUG MARKER: mds1 has failed over 1 times, and counting...
      06:18:40:Lustre: lustre-MDT0000-mdc-ffff88007c826800: Connection restored to lustre-MDT0000 (at 10.2.5.52@tcp)
      06:18:40:ntpd invoked oom-killer: gfp_mask=0x200da, order=0, oom_adj=0, oom_score_adj=0
      06:18:40:ntpd cpuset=/ mems_allowed=0
      06:18:40:Pid: 1850, comm: ntpd Not tainted 2.6.32-431.23.3.el6.x86_64 #1
      06:18:40:Call Trace:
      06:18:40: [<ffffffff810d0431>] ? cpuset_print_task_mems_allowed+0x91/0xb0
      06:18:40: [<ffffffff81122810>] ? dump_header+0x90/0x1b0
      06:18:40: [<ffffffff8122833c>] ? security_real_capable_noaudit+0x3c/0x70
      06:18:40: [<ffffffff81122c92>] ? oom_kill_process+0x82/0x2a0
      06:18:40: [<ffffffff81122bd1>] ? select_bad_process+0xe1/0x120
      06:18:40: [<ffffffff811230d0>] ? out_of_memory+0x220/0x3c0
      06:18:40: [<ffffffff8112f9ef>] ? __alloc_pages_nodemask+0x89f/0x8d0
      06:18:40: [<ffffffff811679ea>] ? alloc_pages_vma+0x9a/0x150
      06:18:40: [<ffffffff8115b6c2>] ? read_swap_cache_async+0xf2/0x160
      06:18:40: [<ffffffff8115c1e9>] ? valid_swaphandles+0x69/0x150
      06:18:40: [<ffffffff8115b7b7>] ? swapin_readahead+0x87/0xc0
      06:18:40: [<ffffffff8114aa8d>] ? handle_pte_fault+0x6dd/0xb00
      06:18:40: [<ffffffff8114b0da>] ? handle_mm_fault+0x22a/0x300
      06:18:40: [<ffffffff8104a8d8>] ? __do_page_fault+0x138/0x480
      06:18:40: [<ffffffff811a08c0>] ? pollwake+0x0/0x60
      06:18:40: [<ffffffff811a08c0>] ? pollwake+0x0/0x60
      06:18:40: [<ffffffff811a08c0>] ? pollwake+0x0/0x60
      06:18:40: [<ffffffff8152e74e>] ? do_page_fault+0x3e/0xa0
      06:18:40: [<ffffffff8152bb05>] ? page_fault+0x25/0x30
      06:18:40: [<ffffffff8128dee6>] ? copy_user_generic_unrolled+0x86/0xb0
      06:18:40: [<ffffffff810129de>] ? copy_user_generic+0xe/0x20
      06:18:40: [<ffffffff811a05d9>] ? set_fd_set+0x49/0x60
      06:18:40: [<ffffffff811a1a9c>] ? core_sys_select+0x1bc/0x2c0
      06:18:40: [<ffffffff8103ea6c>] ? kvm_clock_read+0x1c/0x20
      06:18:40: [<ffffffff8103ea79>] ? kvm_clock_get_cycles+0x9/0x10
      06:18:40: [<ffffffff8109530f>] ? queue_work+0x1f/0x30
      06:18:40: [<ffffffff8103f9d8>] ? pvclock_clocksource_read+0x58/0xd0
      06:18:40: [<ffffffff8103ea6c>] ? kvm_clock_read+0x1c/0x20
      06:18:40: [<ffffffff8103ea79>] ? kvm_clock_get_cycles+0x9/0x10
      06:18:40: [<ffffffff810a6d31>] ? ktime_get_ts+0xb1/0xf0
      06:18:40: [<ffffffff811a1df7>] ? sys_select+0x47/0x110
      06:18:40: [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b
      06:18:40:Mem-Info:
      06:18:40:Node 0 DMA per-cpu:
      06:18:40:CPU    0: hi:    0, btch:   1 usd:   0
      06:18:40:CPU    1: hi:    0, btch:   1 usd:   0
      06:18:40:Node 0 DMA32 per-cpu:
      06:18:40:CPU    0: hi:  186, btch:  31 usd:  30
      06:18:40:CPU    1: hi:  186, btch:  31 usd:   0
      06:18:40:active_anon:0 inactive_anon:0 isolated_anon:0
      06:18:40: active_file:204319 inactive_file:204250 isolated_file:128
      06:18:40: unevictable:0 dirty:305 writeback:31149 unstable:0
      06:18:40: free:13260 slab_reclaimable:3544 slab_unreclaimable:41144
      06:18:40: mapped:2169 shmem:0 pagetables:1113 bounce:0
      06:18:40:Node 0 DMA free:8344kB min:332kB low:412kB high:496kB active_anon:0kB inactive_anon:0kB active_file:3472kB inactive_file:3236kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15348kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:692kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:10105 all_unreclaimable? yes
      06:18:40:lowmem_reserve[]: 0 2004 2004 2004
      06:18:40:Node 0 DMA32 free:44696kB min:44720kB low:55900kB high:67080kB active_anon:0kB inactive_anon:0kB active_file:813804kB inactive_file:813848kB unevictable:0kB isolated(anon):0kB isolated(file):384kB present:2052308kB mlocked:0kB dirty:1220kB writeback:124596kB mapped:8676kB shmem:0kB slab_reclaimable:14176kB slab_unreclaimable:163884kB kernel_stack:1432kB pagetables:4452kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:2683104 all_unreclaimable? no
      06:18:40:lowmem_reserve[]: 0 0 0 0
      06:18:40:Node 0 DMA: 2*4kB 30*8kB 18*16kB 20*32kB 8*64kB 2*128kB 1*256kB 0*512kB 2*1024kB 2*2048kB 0*4096kB = 8344kB
      06:18:40:Node 0 DMA32: 802*4kB 422*8kB 236*16kB 125*32kB 60*64kB 33*128kB 19*256kB 8*512kB 5*1024kB 2*2048kB 1*4096kB = 44696kB
      06:18:40:226130 total pagecache pages
      06:18:40:1 pages in swap cache
      06:18:40:Swap cache stats: add 6051, delete 6050, find 19/21
      06:18:40:Free swap  = 2701716kB
      06:18:40:Total swap = 2725880kB
      06:18:40:524284 pages RAM
      06:18:40:43694 pages reserved
      06:18:40:444123 pages shared
      06:18:40:235407 pages non-shared
      06:18:40:[ pid ]   uid  tgid total_vm      rss cpu oom_adj oom_score_adj name
      06:18:40:[  374]     0   374     2729       80   0     -17         -1000 udevd
      06:18:40:[ 1017]     0  1017     2280       35   0       0             0 dhclient
      06:18:40:[ 1067]     0  1067     6916      125   1     -17         -1000 auditd
      06:18:40:[ 1083]     0  1083    63855      207   1       0             0 rsyslogd
      06:18:40:[ 1112]     0  1112     2705      103   1       0             0 irqbalance
      06:18:40:[ 1126]    32  1126     4744      157   1       0             0 rpcbind
      06:18:40:[ 1138]     0  1138    49883      385   1       0             0 sssd
      06:18:40:[ 1139]     0  1139    64845      871   1       0             0 sssd_be
      06:18:40:[ 1140]     0  1140    50453      494   1       0             0 sssd_nss
      06:18:40:[ 1141]     0  1141    48029      485   1       0             0 sssd_pam
      06:18:40:[ 1142]     0  1142    47518      491   1       0             0 sssd_ssh
      06:18:40:[ 1143]     0  1143    52608      613   1       0             0 sssd_pac
      06:18:40:[ 1160]    29  1160     5837      206   1       0             0 rpc.statd
      06:18:40:[ 1271]    81  1271     5871       98   1       0             0 dbus-daemon
      06:18:40:[ 1309]     0  1309     1020      125   1       0             0 acpid
      06:18:40:[ 1318]    68  1318     9921      387   1       0             0 hald
      06:18:40:[ 1319]     0  1319     5081      274   1       0             0 hald-runner
      06:18:40:[ 1351]     0  1351     5611      263   1       0             0 hald-addon-inpu
      06:18:40:[ 1361]    68  1361     4483      239   1       0             0 hald-addon-acpi
      06:18:40:[ 1381]     0  1381   168325      584   1       0             0 automount
      06:18:40:[ 1427]     0  1427    26827       32   0       0             0 rpc.rquotad
      06:18:40:[ 1431]     0  1431     5414       88   0       0             0 rpc.mountd
      06:18:40:[ 1467]     0  1467     5773       79   1       0             0 rpc.idmapd
      06:18:40:[ 1498]   496  1498    56782      288   0       0             0 munged
      06:18:40:[ 1513]     0  1513    16653       99   0     -17         -1000 sshd
      06:18:40:[ 1521]     0  1521     5545      175   0       0             0 xinetd
      06:18:40:[ 1605]     0  1605    20843      605   1       0             0 master
      06:18:40:[ 1614]    89  1614    20863      576   1       0             0 pickup
      06:18:40:[ 1615]    89  1615    20906      564   0       0             0 qmgr
      06:18:40:[ 1628]     0  1628    29325      154   1       0             0 crond
      06:18:40:[ 1639]     0  1639     5385       72   0       0             0 atd
      06:18:40:[ 1653]     0  1653    15582      145   1       0             0 certmonger
      06:18:40:[ 1667]     0  1667     1020      137   1       0             0 agetty
      06:18:40:[ 1668]     0  1668     1016      124   1       0             0 mingetty
      06:18:40:[ 1670]     0  1670     1016      124   1       0             0 mingetty
      06:18:40:[ 1672]     0  1672     1016      124   1       0             0 mingetty
      06:18:40:[ 1674]     0  1674     1016      124   1       0             0 mingetty
      06:18:40:[ 1676]     0  1676     2728       80   0     -17         -1000 udevd
      06:18:40:[ 1677]     0  1677     2728       76   0     -17         -1000 udevd
      06:18:40:[ 1678]     0  1678     1016      124   0       0             0 mingetty
      06:18:40:[ 1680]     0  1680     1016      124   1       0             0 mingetty
      06:18:40:[ 1780]     0  1780     4872      176   0       0             0 anacron
      06:18:40:[ 1850]    38  1850     8202      374   0       0             0 ntpd
      06:18:40:[ 3848]     0  3848    15919      353   1       0             0 in.mrshd
      06:18:40:[ 3850]     0  3850    26515      293   1       0             0 bash
      06:18:40:[ 3892]     0  3892    26515      117   1       0             0 bash
      06:18:40:[ 3893]     0  3893    26839      293   0       0             0 run_dd.sh
      06:18:40:[ 4981]     0  4981    26295      142   1       0             0 dd
      06:18:40:Out of memory: Kill process 1017 (dhclient) score 1 or sacrifice child
      06:18:40:Killed process 1017, UID 0, (dhclient) total-vm:9120kB, anon-rss:0kB, file-rss:140kB
      06:18:40:ntpd invoked oom-killer: gfp_mask=0x200da, order=0, oom_adj=0, oom_score_adj=0
      06:18:40:ntpd cpuset=/ mems_allowed=0
      06:18:40:Pid: 1850, comm: ntpd Not tainted 2.6.32-431.23.3.el6.x86_64 #1
      

      Attachments

        Issue Links

          Activity

            People

              wc-triage WC Triage
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: