Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-7883

sanity test_60a invokes oom-killer in subtest 10b and times out

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.9.0
    • Lustre 2.9.0
    • None
    • review-dne-part-1
    • 3
    • 9223372036854775807

    Description

      sanity test 60a invokes oom-killer and times out on subtest 10b. From a recent failure, the MDS console log has

      02:35:42:Lustre: 6822:0:(llog_test.c:1468:llog_test_10()) 10a: create a catalog log with name: 851d315f
      02:35:42:Lustre: 6822:0:(llog_test.c:1487:llog_test_10()) 10b: write 65536 log records
      02:35:42:ntpd invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0, oom_score_adj=0
      02:35:42:ntpd cpuset=/ mems_allowed=0
      02:35:42:Pid: 10244, comm: ntpd Not tainted 2.6.32-573.18.1.el6_lustre.ge5f28dc.x86_64 #1
      02:35:42:Call Trace:
      02:35:42: [<ffffffff810d71a1>] ? cpuset_print_task_mems_allowed+0x91/0xb0
      02:35:42: [<ffffffff8112a9a0>] ? dump_header+0x90/0x1b0
      02:35:42: [<ffffffff8112ab0e>] ? check_panic_on_oom+0x4e/0x80
      02:35:42: [<ffffffff8112b1fb>] ? out_of_memory+0x1bb/0x3c0
      02:35:42: [<ffffffff81137c3c>] ? __alloc_pages_nodemask+0x93c/0x950
      02:35:42: [<ffffffff811709ba>] ? alloc_pages_current+0xaa/0x110
      02:35:42: [<ffffffff81127d97>] ? __page_cache_alloc+0x87/0x90
      02:35:42: [<ffffffff8112777e>] ? find_get_page+0x1e/0xa0
      02:35:42: [<ffffffff81128d37>] ? filemap_fault+0x1a7/0x500
      02:35:42: [<ffffffff81152314>] ? __do_fault+0x54/0x530
      02:35:42: [<ffffffff811a9f30>] ? pollwake+0x0/0x60
      02:35:42: [<ffffffff811528e7>] ? handle_pte_fault+0xf7/0xb20
      02:35:42: [<ffffffff8153cd75>] ? _read_unlock_bh+0x15/0x20
      02:35:42: [<ffffffff8145b227>] ? sock_i_uid+0x47/0x60
      02:35:42: [<ffffffff81012bbe>] ? copy_user_generic+0xe/0x20
      02:35:42: [<ffffffff811535a9>] ? handle_mm_fault+0x299/0x3d0
      02:35:42: [<ffffffff8104f156>] ? __do_page_fault+0x146/0x500
      02:35:42: [<ffffffff81042fc9>] ? kvm_clock_get_cycles+0x9/0x10
      02:35:42: [<ffffffff810ad40f>] ? ktime_get_ts+0xbf/0x100
      02:35:42: [<ffffffff811a9b78>] ? poll_select_copy_remaining+0xf8/0x150
      02:35:42: [<ffffffff8153fd1e>] ? do_page_fault+0x3e/0xa0
      02:35:42: [<ffffffff8153d0c5>] ? page_fault+0x25/0x30
      02:35:42:Mem-Info:
      02:35:42:Node 0 DMA per-cpu:
      02:35:42:CPU    0: hi:    0, btch:   1 usd:   0
      02:35:42:CPU    1: hi:    0, btch:   1 usd:   0
      02:35:42:Node 0 DMA32 per-cpu:
      02:35:42:CPU    0: hi:  186, btch:  31 usd:  75
      02:35:42:CPU    1: hi:  186, btch:  31 usd:   0
      02:35:42:active_anon:3065 inactive_anon:3126 isolated_anon:0
      02:35:42: active_file:1149 inactive_file:1114 isolated_file:0
      02:35:42: unevictable:0 dirty:0 writeback:3125 unstable:0
      02:35:42: free:13241 slab_reclaimable:2692 slab_unreclaimable:420617
      02:35:42: mapped:1 shmem:3 pagetables:1219 bounce:0
      02:35:42:Node 0 DMA free:8356kB min:332kB low:412kB high:496kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:4kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15348kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:4kB slab_unreclaimable:7360kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:1 all_unreclaimable? yes
      02:35:42:lowmem_reserve[]: 0 2004 2004 2004
      02:35:42:Node 0 DMA32 free:44608kB min:44720kB low:55900kB high:67080kB active_anon:12260kB inactive_anon:12504kB active_file:4596kB inactive_file:4452kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2052308kB mlocked:0kB dirty:0kB writeback:3324kB mapped:4kB shmem:12kB slab_reclaimable:10764kB slab_unreclaimable:1674856kB kernel_stack:4592kB pagetables:4876kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:279133 all_unreclaimable? yes
      02:35:42:lowmem_reserve[]: 0 0 0 0
      02:35:42:Node 0 DMA: 3*4kB 1*8kB 3*16kB 3*32kB 2*64kB 3*128kB 2*256kB 2*512kB 2*1024kB 2*2048kB 0*4096kB = 8356kB
      02:35:42:Node 0 DMA32: 1325*4kB 753*8kB 501*16kB 426*32kB 90*64kB 2*128kB 1*256kB 1*512kB 1*1024kB 0*2048kB 1*4096kB = 44876kB
      02:35:42:5204 total pagecache pages
      02:35:42:2638 pages in swap cache
      02:35:42:Swap cache stats: add 3431, delete 825, find 6/6
      02:35:42:Free swap  = 4115064kB
      02:35:42:Total swap = 4128764kB
      02:35:42:524284 pages RAM
      02:35:42:43737 pages reserved
      02:35:42:2420 pages shared
      02:35:42:437717 pages non-shared
      02:35:42:[ pid ]   uid  tgid total_vm      rss cpu oom_adj oom_score_adj name
      02:35:42:[  456]     0   456     2695        4   0     -17         -1000 udevd
      02:35:42:[  761]     0   761     2695        9   0     -17         -1000 udevd
      02:35:42:[  915]     0   915     2694       12   0     -17         -1000 udevd
      02:35:42:[ 1175]     0  1175     2280        1   0       0             0 dhclient
      02:35:42:[ 1238]     0  1238     6905       13   1     -17         -1000 auditd
      02:35:42:[ 1272]     0  1272    63854      276   0       0             0 rsyslogd
      02:35:42:[ 1306]     0  1306     4561        1   0       0             0 irqbalance
      02:35:42:[ 1324]    32  1324     4744        1   0       0             0 rpcbind
      02:35:42:[ 1340]     0  1340    52816        5   0       0             0 sssd
      02:35:42:[ 1341]     0  1341    74242       55   1       0             0 sssd_be
      02:35:42:[ 1342]     0  1342    54172        6   1       0             0 sssd_nss
      02:35:42:[ 1343]     0  1343    50533        2   1       0             0 sssd_pam
      02:35:42:[ 1344]     0  1344    50020       26   0       0             0 sssd_ssh
      02:35:42:[ 1345]     0  1345    55112        2   1       0             0 sssd_pac
      02:35:42:[ 1367]    29  1367     6357        1   1       0             0 rpc.statd
      02:35:42:[ 1401]    81  1401     5878        3   0       0             0 dbus-daemon
      02:35:42:[ 1422]     0  1422    47234        1   1       0             0 cupsd
      02:35:42:[ 1466]     0  1466     1020        0   0       0             0 acpid
      02:35:42:[ 1478]    68  1478    10493        1   0       0             0 hald
      02:35:42:[ 1479]     0  1479     5099        1   1       0             0 hald-runner
      02:35:42:[ 1511]     0  1511     5629        1   0       0             0 hald-addon-inpu
      02:35:42:[ 1523]    68  1523     4501        1   0       0             0 hald-addon-acpi
      02:35:42:[ 1547]     0  1547   169290      182   1       0             0 automount
      02:35:42:[ 1601]     0  1601    26827        0   0       0             0 rpc.rquotad
      02:35:42:[ 1606]     0  1606     5417        0   0       0             0 rpc.mountd
      02:35:42:[ 1651]     0  1651     5774        1   0       0             0 rpc.idmapd
      02:35:42:[ 1688]   496  1688    56786      138   0       0             0 munged
      02:35:42:[ 8272]     0  8272    16556        0   1     -17         -1000 sshd
      02:35:42:[ 8283]     0  8283     5429       36   1       0             0 xinetd
      02:35:42:[ 8373]     0  8373    20737      227   0       0             0 master
      02:35:42:[ 8387]    89  8387    20757      219   0       0             0 pickup
      02:35:42:[ 8388]    89  8388    20800      221   0       0             0 qmgr
      02:35:42:[ 8402]     0  8402    29215      151   0       0             0 crond
      02:35:42:[ 8421]     0  8421     5276       46   1       0             0 atd
      02:35:42:[ 8453]     0  8453    16112      187   1       0             0 certmonger
      02:35:42:[ 8482]     0  8482     1020       23   1       0             0 agetty
      02:35:42:[ 8486]     0  8486     1016       21   1       0             0 mingetty
      02:35:42:[ 8491]     0  8491     1016       22   1       0             0 mingetty
      02:35:42:[ 8494]     0  8494     1016       22   0       0             0 mingetty
      02:35:42:[ 8499]     0  8499     1016       21   1       0             0 mingetty
      02:35:42:[ 8504]     0  8504     1016       21   1       0             0 mingetty
      02:35:42:[ 8507]     0  8507     1016       21   0       0             0 mingetty
      02:35:42:[10244]    38 10244     8207      161   1       0             0 ntpd
      02:35:42:[30737]     0 30737     4237       52   1       0             0 anacron
      02:35:42:[ 6720]     0  6720    15806      178   0       0             0 in.mrshd
      02:35:42:[ 6721]     0  6721    26515       66   0       0             0 bash
      02:35:42:[ 6746]     0  6746    26515       66   1       0             0 bash
      02:35:42:[ 6747]     0  6747    27593      618   0       0             0 sh
      02:35:42:[ 6822]     0  6822    28024       73   0       0             0 lctl
      02:35:42:Kernel panic - not syncing: Out of memory: system-wide panic_on_oom is enabled
      02:35:42:
      02:35:42:Pid: 10244, comm: ntpd Not tainted 2.6.32-573.18.1.el6_lustre.ge5f28dc.x86_64 #1
      

      Here are recent logs for this type of failure
      2016-03-10
      https://testing.hpdd.intel.com/test_sets/2e6d4b9a-e700-11e5-bdde-5254006e85c2
      https://testing.hpdd.intel.com/test_sets/6c42ab98-e6d0-11e5-8590-5254006e85c2

      2016-03-14
      https://testing.hpdd.intel.com/test_sets/8f4538da-ea7d-11e5-8779-5254006e85c2
      https://testing.hpdd.intel.com/test_sets/bc22e04c-ea4f-11e5-8606-5254006e85c2
      https://testing.hpdd.intel.com/test_sets/7dc47e5e-ea41-11e5-8606-5254006e85c2
      https://testing.hpdd.intel.com/test_sets/a332f46a-e9fe-11e5-8186-5254006e85c2

      Attachments

        Issue Links

          Activity

            [LU-7883] sanity test_60a invokes oom-killer in subtest 10b and times out

            Landed to master for 2.9.0

            jgmitter Joseph Gmitter (Inactive) added a comment - Landed to master for 2.9.0

            Bob Glossman (bob.glossman@intel.com) uploaded a new patch: http://review.whamcloud.com/19321
            Subject: LU-7883 llog: Flush out journal callback data
            Project: fs/lustre-release
            Branch: b2_8
            Current Patch Set: 1
            Commit: 780379da359be080612489d13a01bf48e2007f3c

            gerrit Gerrit Updater added a comment - Bob Glossman (bob.glossman@intel.com) uploaded a new patch: http://review.whamcloud.com/19321 Subject: LU-7883 llog: Flush out journal callback data Project: fs/lustre-release Branch: b2_8 Current Patch Set: 1 Commit: 780379da359be080612489d13a01bf48e2007f3c
            bogl Bob Glossman (Inactive) added a comment - another, seen in b2_8: https://testing.hpdd.intel.com/test_sets/1d8daa2a-fa9e-11e5-8951-5254006e85c2

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/19005/
            Subject: LU-7883 llog: Flush out journal callback data
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 97ef9af439181510cc4abce355f1833399aa9445

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/19005/ Subject: LU-7883 llog: Flush out journal callback data Project: fs/lustre-release Branch: master Current Patch Set: Commit: 97ef9af439181510cc4abce355f1833399aa9445
            rhenwood Richard Henwood (Inactive) added a comment - Another failure on Master/review-dne-part-1: https://testing.hpdd.intel.com/test_sets/2733848a-eeb1-11e5-8047-5254006e85c2

            Yang Sheng (yang.sheng@intel.com) uploaded a new patch: http://review.whamcloud.com/19005
            Subject: LU-7883 llog: Flush out journal callback data
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: c2237e6d1b4729a84a547a25f31ee91eb07a8b68

            gerrit Gerrit Updater added a comment - Yang Sheng (yang.sheng@intel.com) uploaded a new patch: http://review.whamcloud.com/19005 Subject: LU-7883 llog: Flush out journal callback data Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: c2237e6d1b4729a84a547a25f31ee91eb07a8b68
            ys Yang Sheng added a comment -

            Hi, Bruno,

            I cannot reproduce it even with one core VM. So i think maybe some other reason cause such problem. Anyway, I'll made a patch as you suggestion. Let me look what happen with it.

            Thanks,
            YangSheng

            ys Yang Sheng added a comment - Hi, Bruno, I cannot reproduce it even with one core VM. So i think maybe some other reason cause such problem. Anyway, I'll made a patch as you suggestion. Let me look what happen with it. Thanks, YangSheng

            Andreas, I agree for the low memory VM profile in order to stress Lustre against, but playing so we will expose to this kind of OOM situations when we reach some threshold!

            Yang Sheng, I wonder if with an additional Core available, this could leave the journaling/jbd2 threads to be able to handle more efficiently the commit callbacks backlog whose data consume all memory.

            So again, my 2 cents that simply adding a new dt_sync() at the very beginning of llog_test_10() code (like already done inside the sub-test itself, by patch for LU-7329) will help/fix by allowing callbacks from previous sub-tests LLOG ops to be handled.

            bfaccini Bruno Faccini (Inactive) added a comment - Andreas, I agree for the low memory VM profile in order to stress Lustre against, but playing so we will expose to this kind of OOM situations when we reach some threshold! Yang Sheng, I wonder if with an additional Core available, this could leave the journaling/jbd2 threads to be able to handle more efficiently the commit callbacks backlog whose data consume all memory. So again, my 2 cents that simply adding a new dt_sync() at the very beginning of llog_test_10() code (like already done inside the sub-test itself, by patch for LU-7329 ) will help/fix by allowing callbacks from previous sub-tests LLOG ops to be handled.
            ys Yang Sheng added a comment -

            It is really a same issue with LU-7329. The size-512 very high while llog_test running. But i am no trigger OOM in my test box(2G ram & 2 cores).

            ys Yang Sheng added a comment - It is really a same issue with LU-7329 . The size-512 very high while llog_test running. But i am no trigger OOM in my test box(2G ram & 2 cores).

            Definitely we need to be running the VMs in SMP mode, so at least 2 processors to ensure that there is scheduling between threads. However, I also think that part of the purpose of the VM test environment is to stress Lustre, so having less memory is good. Otherwise, there may be leaks that don't cause any problems because we don't run our tests for so long that the VM runs out of memory.

            adilger Andreas Dilger added a comment - Definitely we need to be running the VMs in SMP mode, so at least 2 processors to ensure that there is scheduling between threads. However, I also think that part of the purpose of the VM test environment is to stress Lustre, so having less memory is good. Otherwise, there may be leaks that don't cause any problems because we don't run our tests for so long that the VM runs out of memory.

            People

              ys Yang Sheng
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: