Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6914

replay-single test 80b hangs with MDS OOM

    XMLWordPrintable

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Minor
    • None
    • Lustre 2.8.0
    • None
    • review-dne-part-2 in autotest
    • 3
    • 9223372036854775807

    Description

      Replay-single test 80b hangs and times out in review-dne-part-2. Logs are at https://testing.hpdd.intel.com/test_sets/1791cbe4-32b5-11e5-8214-5254006e85c2 .

      From the test_log, we see there’s a problem with MDT0001:

      onyx-43vm6: CMD: onyx-43vm6.onyx.hpdd.intel.com lctl get_param -n at_max
      onyx-43vm5: CMD: onyx-43vm5.onyx.hpdd.intel.com lctl get_param -n at_max
      onyx-43vm6:  rpc : @@@@@@ FAIL: can't put import for mdc.lustre-MDT0001-mdc-*.mds_server_uuid into FULL state after 1475 sec, have REPLAY_WAIT 
      onyx-43vm5:  rpc : @@@@@@ FAIL: can't put import for mdc.lustre-MDT0001-mdc-*.mds_server_uuid into FULL state after 1475 sec, have REPLAY_WAIT 
      onyx-43vm5:   Trace dump:
      onyx-43vm5:   = /usr/lib64/lustre/tests/test-framework.sh:4727:error_noexit()
      onyx-43vm5:   = /usr/lib64/lustre/tests/test-framework.sh:4758:error()
      onyx-43vm5:   = /usr/lib64/lustre/tests/test-framework.sh:5830:_wait_import_state()
      onyx-43vm5:   = /usr/lib64/lustre/tests/test-framework.sh:5849:wait_import_state()
      onyx-43vm5:   = /usr/lib64/lustre/tests/test-framework.sh:5858:wait_import_state_mount()
      onyx-43vm5:   = rpc.sh:20:main()
      

      Looking at MDT0001 console, we see the oom-killer was invoked:

      20:37:02:mdt_out00_000 invoked oom-killer: gfp_mask=0x80d2, order=0, oom_adj=0, oom_score_adj=0
      20:37:02:mdt_out00_000 cpuset=/ mems_allowed=0
      20:37:03:Pid: 17922, comm: mdt_out00_000 Not tainted 2.6.32-504.23.4.el6_lustre.ge09a9ac.x86_64 #1
      20:37:03:Call Trace:
      20:37:03: [<ffffffff810d4671>] ? cpuset_print_task_mems_allowed+0x91/0xb0
      20:37:03: [<ffffffff81127930>] ? dump_header+0x90/0x1b0
      20:37:05: [<ffffffff81127a9e>] ? check_panic_on_oom+0x4e/0x80
      20:37:06: [<ffffffff8112818b>] ? out_of_memory+0x1bb/0x3c0
      20:37:07: [<ffffffff81134b2f>] ? __alloc_pages_nodemask+0x89f/0x8d0
      20:37:07: [<ffffffff8115e239>] ? __vmalloc_area_node+0xb9/0x190
      20:37:07: [<ffffffffa08306ae>] ? ptlrpc_grow_req_bufs+0x33e/0x7e0 [ptlrpc]
      20:37:07: [<ffffffff8115e10d>] ? __vmalloc_node+0xad/0x120
      20:37:07: [<ffffffffa08306ae>] ? ptlrpc_grow_req_bufs+0x33e/0x7e0 [ptlrpc]
      20:37:07: [<ffffffff8115e529>] ? vzalloc_node+0x29/0x30
      20:37:07: [<ffffffffa08306ae>] ? ptlrpc_grow_req_bufs+0x33e/0x7e0 [ptlrpc]
      20:37:08: [<ffffffffa08334cd>] ? ptlrpc_main+0x126d/0x1920 [ptlrpc]
      20:37:08: [<ffffffffa0832260>] ? ptlrpc_main+0x0/0x1920 [ptlrpc]
      20:37:08: [<ffffffff8109e78e>] ? kthread+0x9e/0xc0
      20:37:08: [<ffffffff8100c28a>] ? child_rip+0xa/0x20
      20:37:08: [<ffffffff8109e6f0>] ? kthread+0x0/0xc0
      20:37:08: [<ffffffff8100c280>] ? child_rip+0x0/0x20
      20:37:09:Mem-Info:
      20:37:09:Node 0 DMA per-cpu:
      20:37:09:CPU    0: hi:    0, btch:   1 usd:   0
      20:37:09:CPU    1: hi:    0, btch:   1 usd:   0
      20:37:09:Node 0 DMA32 per-cpu:
      20:37:10:CPU    0: hi:  186, btch:  31 usd:  30
      20:37:10:CPU    1: hi:  186, btch:  31 usd:   0
      20:37:10:active_anon:925 inactive_anon:986 isolated_anon:0
      20:37:11: active_file:134 inactive_file:165 isolated_file:0
      20:37:11: unevictable:0 dirty:1 writeback:992 unstable:0
      20:37:11: free:13246 slab_reclaimable:1952 slab_unreclaimable:389562
      20:37:11: mapped:0 shmem:1 pagetables:1394 bounce:0
      20:37:11:Node 0 DMA free:8344kB min:332kB low:412kB high:496kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:4kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15348kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:7220kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
      20:37:11:lowmem_reserve[]: 0 2004 2004 2004
      20:37:11:Node 0 DMA32 free:44640kB min:44720kB low:55900kB high:67080kB active_anon:3700kB inactive_anon:3944kB active_file:536kB inactive_file:656kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2052308kB mlocked:0kB dirty:4kB writeback:3968kB mapped:0kB shmem:4kB slab_reclaimable:7808kB slab_unreclaimable:1551028kB kernel_stack:2216kB pagetables:5576kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:13568 all_unreclaimable? yes
      20:37:11:lowmem_reserve[]: 0 0 0 0
      20:37:12:Node 0 DMA: 1*4kB 1*8kB 1*16kB 0*32kB 0*64kB 1*128kB 0*256kB 0*512kB 0*1024kB 2*2048kB 1*4096kB = 8348kB
      20:37:12:Node 0 DMA32: 1042*4kB 627*8kB 428*16kB 228*32kB 97*64kB 46*128kB 20*256kB 6*512kB 1*1024kB 0*2048kB 0*4096kB = 44640kB
      20:37:12:1345 total pagecache pages
      20:37:12:1043 pages in swap cache
      20:37:12:Swap cache stats: add 5108, delete 4065, find 79/118
      20:37:13:Free swap  = 4109748kB
      20:37:13:Total swap = 4128764kB
      20:37:15:524284 pages RAM
      20:37:16:43706 pages reserved
      20:37:16:375 pages shared
      20:37:16:91700 pages non-shared
      20:37:17:[ pid ]   uid  tgid total_vm      rss cpu oom_adj oom_score_adj name
      20:37:17:[  420]     0   420     2729        1   0     -17         -1000 udevd
      20:37:17:[  703]     0   703     2728        0   0     -17         -1000 udevd
      20:37:17:[  833]     0   833     2662        0   0     -17         -1000 udevd
      20:37:18:[ 1072]     0  1072     2280        6   1       0             0 dhclient
      20:37:18:[ 1125]     0  1125     6905       15   0     -17         -1000 auditd
      20:37:18:[ 1155]     0  1155    63855      133   1       0             0 rsyslogd
      20:37:18:[ 1185]     0  1185     4560       21   0       0             0 irqbalance
      20:37:18:[ 1201]    32  1201     4744       15   0       0             0 rpcbind
      20:37:19:[ 1215]     0  1215    52788       43   0       0             0 sssd
      20:37:19:[ 1216]     0  1216    70940       62   0       0             0 sssd_be
      20:37:19:[ 1217]     0  1217    53458       36   0       0             0 sssd_nss
      20:37:19:[ 1218]     0  1218    50504       35   0       0             0 sssd_pam
      20:37:19:[ 1219]     0  1219    49990       34   0       0             0 sssd_ssh
      20:37:19:[ 1220]     0  1220    55084       33   0       0             0 sssd_pac
      20:37:19:[ 1239]    29  1239     6357        1   1       0             0 rpc.statd
      20:37:19:[ 1357]    81  1357     5878        1   1       0             0 dbus-daemon
      20:37:20:[ 1374]     0  1374    47233        1   1       0             0 cupsd
      20:37:20:[ 1412]     0  1412     1020        0   0       0             0 acpid
      20:37:20:[ 1422]    68  1422    10484       76   0       0             0 hald
      20:37:20:[ 1423]     0  1423     5100        1   1       0             0 hald-runner
      20:37:20:[ 1455]     0  1455     5630        1   1       0             0 hald-addon-inpu
      20:37:20:[ 1465]    68  1465     4502        1   1       0             0 hald-addon-acpi
      20:37:20:[ 1485]     0  1485   169286       55   0       0             0 automount
      20:37:20:[ 1533]     0  1533    26827        0   1       0             0 rpc.rquotad
      20:37:21:[ 1538]     0  1538     5417        0   1       0             0 rpc.mountd
      20:37:21:[ 1579]     0  1579     5773        1   0       0             0 rpc.idmapd
      20:37:21:[ 1612]   496  1612    56787       71   0       0             0 munged
      20:37:21:[ 1630]     0  1630    16553        0   1     -17         -1000 sshd
      20:37:22:[ 1639]     0  1639     5429       18   1       0             0 xinetd
      20:37:22:[ 1725]     0  1725    20734       21   0       0             0 master
      20:37:22:[ 1743]    89  1743    20797       16   0       0             0 qmgr
      20:37:22:[ 1750]     0  1750    29215       20   0       0             0 crond
      20:37:22:[ 1763]     0  1763     5276        0   1       0             0 atd
      20:37:22:[ 1791]     0  1791    16058        3   1       0             0 certmonger
      20:37:22:[ 1829]     0  1829     1020        2   1       0             0 agetty
      20:37:22:[ 1831]     0  1831     1016        1   0       0             0 mingetty
      20:37:22:[ 1833]     0  1833     1016        1   0       0             0 mingetty
      20:37:22:[ 1835]     0  1835     1016        1   0       0             0 mingetty
      20:37:22:[ 1837]     0  1837     1016        1   0       0             0 mingetty
      20:37:22:[ 1839]     0  1839     1016        3   0       0             0 mingetty
      20:37:22:[ 1841]     0  1841     1016        3   1       0             0 mingetty
      20:37:22:[ 2874]    38  2874     8205        6   0       0             0 ntpd
      20:37:22:[  900]    89   900    20754      218   0       0             0 pickup
      20:37:22:Kernel panic - not syncing: Out of memory: system-wide panic_on_oom is enabled
      20:37:23:
      20:37:23:Pid: 17922, comm: mdt_out00_000 Not tainted 2.6.32-504.23.4.el6_lustre.ge09a9ac.x86_64 #1
      20:37:23:Call Trace:
      20:37:23: [<ffffffff81529bbc>] ? panic+0xa7/0x16f
      20:37:23: [<ffffffff811279a1>] ? dump_header+0x101/0x1b0
      20:37:23: [<ffffffff81127acc>] ? check_panic_on_oom+0x7c/0x80
      20:37:23: [<ffffffff8112818b>] ? out_of_memory+0x1bb/0x3c0
      20:37:24: [<ffffffff81134b2f>] ? __alloc_pages_nodemask+0x89f/0x8d0
      20:37:24: [<ffffffff8115e239>] ? __vmalloc_area_node+0xb9/0x190
      20:37:24: [<ffffffffa08306ae>] ? ptlrpc_grow_req_bufs+0x33e/0x7e0 [ptlrpc]
      20:37:24: [<ffffffff8115e10d>] ? __vmalloc_node+0xad/0x120
      20:37:24: [<ffffffffa08306ae>] ? ptlrpc_grow_req_bufs+0x33e/0x7e0 [ptlrpc]
      20:37:24: [<ffffffff8115e529>] ? vzalloc_node+0x29/0x30
      20:37:24: [<ffffffffa08306ae>] ? ptlrpc_grow_req_bufs+0x33e/0x7e0 [ptlrpc]
      20:37:24: [<ffffffffa08334cd>] ? ptlrpc_main+0x126d/0x1920 [ptlrpc]
      20:37:24: [<ffffffffa0832260>] ? ptlrpc_main+0x0/0x1920 [ptlrpc]
      20:37:25: [<ffffffff8109e78e>] ? kthread+0x9e/0xc0
      20:37:25: [<ffffffff8100c28a>] ? child_rip+0xa/0x20
      20:37:25: [<ffffffff8109e6f0>] ? kthread+0x0/0xc0
      20:37:25: [<ffffffff8100c280>] ? child_rip+0x0/0x20
      

      Attachments

        Activity

          People

            wc-triage WC Triage
            jamesanunez James Nunez (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: