Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5944

Failover recovery-mds-scale test_failover_mds: client OOM

    XMLWordPrintable

Details

    • Bug
    • Resolution: Duplicate
    • Minor
    • None
    • Lustre 2.7.0
    • None
    • lustre-master build # 2733 RHEL6
    • 3
    • 16600

    Description

      This issue was created by maloo for sarah <sarah@whamcloud.com>

      This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/ba0b8798-6902-11e4-9d25-5254006e85c2.

      The sub-test test_failover_mds failed with the following error:

      test_failover_mds returned 4
      

      client console

      03:26:42:Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failover -- failure NOT OK
      03:26:42:Lustre: DEBUG MARKER: rc=$(lctl get_param -n catastrophe);
      03:26:42:		if [ $rc -ne 0 ]; then echo $(hostname): $rc; fi
      03:26:42:		exit $rc
      03:26:42:Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_dd.sh
      03:26:42:Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds1 has failed over 4 times, and counting...
      03:26:42:Lustre: DEBUG MARKER: mds1 has failed over 4 times, and counting...
      03:26:42:Lustre: 2188:0:(client.c:1947:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1415445951/real 1415445967]  req@ffff880032a59800 x1484198832385960/t0(0) o400->lustre-MDT0000-mdc-ffff88007a10bc00@10.2.4.189@tcp:12/10 lens 224/224 e 0 to 1 dl 1415445958 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      03:26:42:Lustre: 2188:0:(client.c:1947:ptlrpc_expire_one_request()) Skipped 1 previous similar message
      03:31:34:Lustre: Evicted from MGS (at 10.2.4.185@tcp) after server handle changed from 0xd9f20fa7646c5b6b to 0x11b90fceb8d34e70
      03:31:34:Lustre: MGC10.2.4.185@tcp: Connection restored to MGS (at 10.2.4.185@tcp)
      03:31:34:LustreError: 2187:0:(client.c:2817:ptlrpc_replay_interpret()) @@@ status 301, old was 0  req@ffff88005bc46c00 x1484198830088128/t17179982789(17179982789) o101->lustre-MDT0000-mdc-ffff88007a10bc00@10.2.4.185@tcp:12/10 lens 704/544 e 0 to 0 dl 1415446038 ref 2 fl Interpret:RP/4/0 rc 301/301
      03:31:34:Lustre: lustre-MDT0000-mdc-ffff88007a10bc00: Connection restored to lustre-MDT0000 (at 10.2.4.185@tcp)
      03:31:34:ntpd invoked oom-killer: gfp_mask=0x200da, order=0, oom_adj=0, oom_score_adj=0
      03:31:34:ntpd cpuset=/ mems_allowed=0
      03:31:34:Pid: 1849, comm: ntpd Not tainted 2.6.32-431.29.2.el6.x86_64 #1
      03:31:34:Call Trace:
      03:31:34: [<ffffffff810d0361>] ? cpuset_print_task_mems_allowed+0x91/0xb0
      03:31:34: [<ffffffff81122730>] ? dump_header+0x90/0x1b0
      03:31:34: [<ffffffff8112289e>] ? check_panic_on_oom+0x4e/0x80
      03:31:34: [<ffffffff81122f8b>] ? out_of_memory+0x1bb/0x3c0
      03:31:34: [<ffffffff8112f90f>] ? __alloc_pages_nodemask+0x89f/0x8d0
      03:31:34: [<ffffffff8116799a>] ? alloc_pages_vma+0x9a/0x150
      03:31:34: [<ffffffff8115b622>] ? read_swap_cache_async+0xf2/0x160
      03:31:34: [<ffffffff8115c149>] ? valid_swaphandles+0x69/0x150
      03:31:34: [<ffffffff8115b717>] ? swapin_readahead+0x87/0xc0
      03:31:34: [<ffffffff8114a9bd>] ? handle_pte_fault+0x6dd/0xb00
      03:31:34: [<ffffffff81060aa3>] ? perf_event_task_sched_out+0x33/0x70
      03:31:34: [<ffffffff8114b00a>] ? handle_mm_fault+0x22a/0x300
      03:31:34: [<ffffffff8104a8d8>] ? __do_page_fault+0x138/0x480
      03:31:34: [<ffffffff811a0870>] ? pollwake+0x0/0x60
      03:31:34: [<ffffffff811a0870>] ? pollwake+0x0/0x60
      03:31:34: [<ffffffff811a0870>] ? pollwake+0x0/0x60
      03:31:34: [<ffffffff8152e99e>] ? do_page_fault+0x3e/0xa0
      03:31:34: [<ffffffff8152bd55>] ? page_fault+0x25/0x30
      03:31:34: [<ffffffff8128dea6>] ? copy_user_generic_unrolled+0x86/0xb0
      03:31:34: [<ffffffff810129de>] ? copy_user_generic+0xe/0x20
      03:31:34: [<ffffffff811a0589>] ? set_fd_set+0x49/0x60
      03:31:34: [<ffffffff811a1a4c>] ? core_sys_select+0x1bc/0x2c0
      03:31:34: [<ffffffff8103f9d8>] ? pvclock_clocksource_read+0x58/0xd0
      03:31:34: [<ffffffff8103f9d8>] ? pvclock_clocksource_read+0x58/0xd0
      03:31:34: [<ffffffff8103ea6c>] ? kvm_clock_read+0x1c/0x20
      03:31:34: [<ffffffff8103ea79>] ? kvm_clock_get_cycles+0x9/0x10
      03:31:34: [<ffffffff810a6d31>] ? ktime_get_ts+0xb1/0xf0
      03:31:34: [<ffffffff811a1da7>] ? sys_select+0x47/0x110
      03:31:34: [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b
      03:31:34:Mem-Info:
      03:31:34:Node 0 DMA per-cpu:
      03:31:34:CPU    0: hi:    0, btch:   1 usd:   0
      03:31:34:CPU    1: hi:    0, btch:   1 usd:   0
      03:31:34:Node 0 DMA32 per-cpu:
      03:31:34:CPU    0: hi:  186, btch:  31 usd:  91
      03:31:34:CPU    1: hi:  186, btch:  31 usd:  41
      03:31:34:active_anon:17 inactive_anon:0 isolated_anon:0
      03:31:34: active_file:203782 inactive_file:205446 isolated_file:0
      03:31:34: unevictable:0 dirty:440 writeback:30040 unstable:0
      03:31:34: free:12464 slab_reclaimable:2779 slab_unreclaimable:42062
      03:31:34: mapped:1935 shmem:0 pagetables:1109 bounce:0
      03:31:34:Node 0 DMA free:8352kB min:332kB low:412kB high:496kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:6552kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15348kB mlocked:0kB dirty:0kB writeback:6552kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:840kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:11072 all_unreclaimable? yes
      03:31:34:lowmem_reserve[]: 0 2004 2004 2004
      03:31:34:Node 0 DMA32 free:41504kB min:44720kB low:55900kB high:67080kB active_anon:68kB inactive_anon:0kB active_file:815128kB inactive_file:815232kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2052308kB mlocked:0kB dirty:1760kB writeback:113608kB mapped:7740kB shmem:0kB slab_reclaimable:11116kB slab_unreclaimable:167408kB kernel_stack:1464kB pagetables:4436kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:2961178 all_unreclaimable? yes
      03:31:34:lowmem_reserve[]: 0 0 0 0
      03:31:34:Node 0 DMA: 8*4kB 60*8kB 40*16kB 25*32kB 12*64kB 0*128kB 0*256kB 1*512kB 1*1024kB 2*2048kB 0*4096kB = 8352kB
      03:31:34:Node 0 DMA32: 322*4kB 77*8kB 51*16kB 34*32kB 15*64kB 5*128kB 5*256kB 4*512kB 4*1024kB 0*2048kB 7*4096kB = 41504kB
      03:31:34:227493 total pagecache pages
      03:31:34:0 pages in swap cache
      03:31:34:Swap cache stats: add 6524, delete 6524, find 2262/2343
      03:31:34:Free swap  = 2703532kB
      03:31:34:Total swap = 2725884kB
      03:31:34:524284 pages RAM
      03:31:34:43694 pages reserved
      03:31:34:448398 pages shared
      03:31:34:234817 pages non-shared
      03:31:34:[ pid ]   uid  tgid total_vm      rss cpu oom_adj oom_score_adj name
      03:31:34:[  373]     0   373     2721       80   1     -17         -1000 udevd
      03:31:34:[ 1019]     0  1019     2280       35   0       0             0 dhclient
      03:31:34:[ 1071]     0  1071     6916      132   1     -17         -1000 auditd
      03:31:34:[ 1087]     0  1087    63854      236   1       0             0 rsyslogd
      03:31:34:[ 1116]     0  1116     2705      103   1       0             0 irqbalance
      03:31:34:[ 1130]    32  1130     4744      157   1       0             0 rpcbind
      03:31:34:[ 1142]     0  1142    49913      365   1       0             0 sssd
      03:31:34:[ 1143]     0  1143    64328      866   1       0             0 sssd_be
      03:31:34:[ 1144]     0  1144    50478      506   1       0             0 sssd_nss
      03:31:34:[ 1145]     0  1145    48029      495   0       0             0 sssd_pam
      03:31:34:[ 1146]     0  1146    47507      503   0       0             0 sssd_ssh
      03:31:34:[ 1147]     0  1147    52608      486   0       0             0 sssd_pac
      03:31:34:[ 1164]    29  1164     6357      215   1       0             0 rpc.statd
      03:31:34:[ 1278]    81  1278     5871      139   0       0             0 dbus-daemon
      03:31:34:[ 1316]     0  1316     1020      125   1       0             0 acpid
      03:31:34:[ 1325]    68  1325     9921      331   1       0             0 hald
      03:31:34:[ 1326]     0  1326     5081      248   1       0             0 hald-runner
      03:31:34:[ 1358]     0  1358     5611      238   1       0             0 hald-addon-inpu
      03:31:34:[ 1368]    68  1368     4483      236   1       0             0 hald-addon-acpi
      03:31:34:[ 1388]     0  1388   168326      554   1       0             0 automount
      03:31:34:[ 1434]     0  1434    26827       29   0       0             0 rpc.rquotad
      03:31:34:[ 1438]     0  1438     5414       87   0       0             0 rpc.mountd
      03:31:34:[ 1474]     0  1474     5773       86   1       0             0 rpc.idmapd
      03:31:34:[ 1505]   496  1505    56785      294   1       0             0 munged
      03:31:34:[ 1520]     0  1520    16656      100   0     -17         -1000 sshd
      03:31:34:[ 1528]     0  1528     5545      179   1       0             0 xinetd
      03:31:34:[ 1612]     0  1612    20846      610   1       0             0 master
      03:31:34:[ 1620]    89  1620    20866      568   1       0             0 pickup
      03:31:34:[ 1622]    89  1622    20909      569   1       0             0 qmgr
      03:31:34:[ 1635]     0  1635    29324      153   1       0             0 crond
      03:31:34:[ 1646]     0  1646     5385       76   0       0             0 atd
      03:31:34:[ 1672]     0  1672    15585      146   0       0             0 certmonger
      03:31:34:[ 1686]     0  1686     1020      133   1       0             0 agetty
      03:31:34:[ 1687]     0  1687     1016      121   1       0             0 mingetty
      03:31:34:[ 1689]     0  1689     1016      121   1       0             0 mingetty
      03:31:34:[ 1691]     0  1691     1016      121   1       0             0 mingetty
      03:31:34:[ 1693]     0  1693     1016      121   1       0             0 mingetty
      03:31:34:[ 1695]     0  1695     1016      121   1       0             0 mingetty
      03:31:34:[ 1697]     0  1697     1016      121   1       0             0 mingetty
      03:31:34:[ 1701]     0  1701     2720       80   1     -17         -1000 udevd
      03:31:34:[ 1702]     0  1702     2720       76   0     -17         -1000 udevd
      03:31:34:[ 1849]    38  1849     8205      376   1       0             0 ntpd
      03:31:34:[ 3864]     0  3864    15919      354   0       0             0 in.mrshd
      03:31:34:[ 3870]     0  3870    26515      292   1       0             0 bash
      03:31:34:[ 3892]     0  3892    26515      115   0       0             0 bash
      03:31:34:[ 3893]     0  3893    26839      292   0       0             0 run_dd.sh
      03:31:34:[ 6142]     0  6142     4346       97   0       0             0 anacron
      03:31:34:[ 8144]     0  8144    26295      138   1       0             0 dd
      03:31:34:Kernel panic - not syncing: Out of memory: system-wide panic_on_oom is enabled
      03:31:34:
      03:31:34:Pid: 1849, comm: ntpd Not tainted 2.6.32-431.29.2.el6.x86_64 #1
      03:31:34:Call Trace:
      03:31:34: [<ffffffff8152873c>] ? panic+0xa7/0x16f
      03:31:34: [<ffffffff81122831>] ? dump_header+0x191/0x1b0
      03:31:34: [<ffffffff811228cc>] ? check_panic_on_oom+0x7c/0x80
      03:31:34: [<ffffffff81122f8b>] ? out_of_memory+0x1bb/0x3c0
      03:31:34: [<ffffffff8112f90f>] ? __alloc_pages_nodemask+0x89f/0x8d0
      03:31:34: [<ffffffff8116799a>] ? alloc_pages_vma+0x9a/0x150
      03:31:34: [<ffffffff8115b622>] ? read_swap_cache_async+0xf2/0x160
      03:31:34: [<ffffffff8115c149>] ? valid_swaphandles+0x69/0x150
      03:31:34: [<ffffffff8115b717>] ? swapin_readahead+0x87/0xc0
      03:31:34: [<ffffffff8114a9bd>] ? handle_pte_fault+0x6dd/0xb00
      03:31:34: [<ffffffff81060aa3>] ? perf_event_task_sched_out+0x33/0x70
      03:31:34: [<ffffffff8114b00a>] ? handle_mm_fault+0x22a/0x300
      03:31:34: [<ffffffff8104a8d8>] ? __do_page_fault+0x138/0x480
      03:31:34: [<ffffffff811a0870>] ? pollwake+0x0/0x60
      03:31:34: [<ffffffff811a0870>] ? pollwake+0x0/0x60
      03:31:34: [<ffffffff811a0870>] ? pollwake+0x0/0x60
      03:31:34: [<ffffffff8152e99e>] ? do_page_fault+0x3e/0xa0
      03:31:34: [<ffffffff8152bd55>] ? page_fault+0x25/0x30
      03:31:34: [<ffffffff8128dea6>] ? copy_user_generic_unrolled+0x86/0xb0
      03:31:34: [<ffffffff810129de>] ? copy_user_generic+0xe/0x20
      03:31:34: [<ffffffff811a0589>] ? set_fd_set+0x49/0x60
      03:31:34: [<ffffffff811a1a4c>] ? core_sys_select+0x1bc/0x2c0
      03:31:34: [<ffffffff8103f9d8>] ? pvclock_clocksource_read+0x58/0xd0
      03:31:34: [<ffffffff8103f9d8>] ? pvclock_clocksource_read+0x58/0xd0
      03:31:34: [<ffffffff8103ea6c>] ? kvm_clock_read+0x1c/0x20
      03:31:34: [<ffffffff8103ea79>] ? kvm_clock_get_cycles+0x9/0x10
      03:31:34: [<ffffffff810a6d31>] ? ktime_get_ts+0xb1/0xf0
      03:31:34: [<ffffffff811a1da7>] ? sys_select+0x47/0x110
      03:31:34: [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b
      

      Info required for matching: recovery-mds-scale failover_mds

      Attachments

        Issue Links

          Activity

            People

              wc-triage WC Triage
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: