Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-9514

sanity test 51f times out due to out of memory

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.10.1, Lustre 2.11.0
    • Lustre 2.10.0
    • None
    • autotest
    • 3
    • 9223372036854775807

    Description

      sanity test_51f times out with the usual message

      'test failed to respond and timed out' 
      

      The only information in the test_log is

      == sanity test 51f: check many open files limit ====================================================== 08:37:34 (1494923854)
      MDT0 numfree=711956, max=100000
      changed ulimit from 1024 to 100020
      

      There are two cases where sanity test 51f times out with oom. The stack trace is found in the client console

      08:38:17:[ 1565.371695] Lustre: DEBUG MARKER: == sanity test 51f: check many open files limit ====================================================== 08:37:34 (1494923854)
      08:38:17:[ 1602.443283] automount invoked oom-killer: gfp_mask=0x200da, order=0, oom_score_adj=0
      08:38:17:[ 1602.446098] automount cpuset=/ mems_allowed=0
      08:38:17:[ 1602.448513] CPU: 0 PID: 1010 Comm: automount Tainted: G           OE  ------------   3.10.0-514.16.1.el7.x86_64 #1
      08:38:17:[ 1602.451566] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007
      08:38:17:
      08:38:17:[ 1602.454058]  ffff8800794eaf10 000000003457cee2 ffff88007b9e79f8 ffffffff81686ac3
      08:38:17:[ 1602.459021]  ffff88007b9e7a88 ffffffff81681a6e ffff88007b9e7af0 ffff88007b9e7a48
      08:38:17:[ 1602.461680]  ffffffff81692b6c ffffffff81a33cc0 0000000000000000 0000000000000000
      08:38:17:[ 1602.464276] Call Trace:
      08:38:17:[ 1602.466395]  [<ffffffff81686ac3>] dump_stack+0x19/0x1b
      08:38:17:[ 1602.468694]  [<ffffffff81681a6e>] dump_header+0x8e/0x225
      08:38:17:[ 1602.470982]  [<ffffffff81692b6c>] ? notifier_call_chain+0x4c/0x70
      08:38:17:[ 1602.473299]  [<ffffffff810b6d98>] ? __blocking_notifier_call_chain+0x58/0x70
      08:38:17:[ 1602.475696]  [<ffffffff81184e1e>] check_panic_on_oom+0x2e/0x60
      08:38:17:[ 1602.477970]  [<ffffffff8118523b>] out_of_memory+0x23b/0x4f0
      08:38:17:[ 1602.480183]  [<ffffffff81682577>] __alloc_pages_slowpath+0x5d7/0x725
      08:38:17:[ 1602.482651]  [<ffffffff8118b5c5>] __alloc_pages_nodemask+0x405/0x420
      08:38:17:[ 1602.484867]  [<ffffffff811d272a>] alloc_pages_vma+0x9a/0x150
      08:38:17:[ 1602.487039]  [<ffffffff811c351b>] read_swap_cache_async+0xeb/0x160
      08:38:17:[ 1602.489192]  [<ffffffff811c3638>] swapin_readahead+0xa8/0x110
      08:38:17:[ 1602.491293]  [<ffffffff811b189c>] handle_mm_fault+0xb1c/0xfe0
      08:38:17:[ 1602.493385]  [<ffffffff81692644>] __do_page_fault+0x154/0x450
      08:38:17:[ 1602.495536]  [<ffffffff81692975>] do_page_fault+0x35/0x90
      08:38:17:[ 1602.497619]  [<ffffffff8168eb88>] page_fault+0x28/0x30
      08:38:17:[ 1602.499598] Mem-Info:
      08:38:17:[ 1602.501313] active_anon:350 inactive_anon:385 isolated_anon:0
      08:38:17:[ 1602.501313]  active_file:0 inactive_file:525 isolated_file:0
      08:38:17:[ 1602.501313]  unevictable:0 dirty:0 writeback:142 unstable:0
      08:38:17:[ 1602.501313]  slab_reclaimable:4058 slab_unreclaimable:36202
      08:38:17:[ 1602.501313]  mapped:0 shmem:52 pagetables:1613 bounce:0
      08:38:17:[ 1602.501313]  free:12916 free_pcp:7 free_cma:0
      08:38:17:[ 1602.512345] Node 0 DMA free:7020kB min:416kB low:520kB high:624kB active_anon:400kB inactive_anon:640kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15984kB managed:15892kB mlocked:0kB dirty:0kB writeback:36kB mapped:0kB shmem:228kB slab_reclaimable:108kB slab_unreclaimable:712kB kernel_stack:48kB pagetables:20kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:15736 all_unreclaimable? yes
      08:38:17:[ 1602.521429] lowmem_reserve[]: 0 1663 1663 1663
      08:38:17:[ 1602.523300] Node 0 DMA32 free:44644kB min:44636kB low:55792kB high:66952kB active_anon:1000kB inactive_anon:900kB active_file:0kB inactive_file:2100kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080756kB managed:1705000kB mlocked:0kB dirty:0kB writeback:532kB mapped:0kB shmem:0kB slab_reclaimable:16124kB slab_unreclaimable:144096kB kernel_stack:2656kB pagetables:6432kB unstable:0kB bounce:0kB free_pcp:28kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
      08:38:17:[ 1602.534026] lowmem_reserve[]: 0 0 0 0
      08:38:17:[ 1602.535888] Node 0 DMA: 39*4kB (EM) 32*8kB (UEM) 25*16kB (UM) 12*32kB (UEM) 3*64kB (EM) 10*128kB (UM) 3*256kB (EM) 3*512kB (UEM) 2*1024kB (UE) 0*2048kB 0*4096kB = 7020kB
      08:38:17:[ 1602.540391] Node 0 DMA32: 995*4kB (UEM) 623*8kB (UEM) 319*16kB (UEM) 861*32kB (UEM) 46*64kB (M) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 44564kB
      08:38:17:[ 1602.544824] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
      08:38:17:[ 1602.547006] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
      08:38:17:[ 1602.549124] 975 total pagecache pages
      08:38:17:[ 1602.550875] 375 pages in swap cache
      08:38:17:[ 1602.552582] Swap cache stats: add 26733, delete 26358, find 248/330
      08:38:17:[ 1602.554487] Free swap  = 3566956kB
      08:38:17:[ 1602.556164] Total swap = 3671036kB
      08:38:17:[ 1602.557826] 524185 pages RAM
      08:38:17:[ 1602.559435] 0 pages HighMem/MovableOnly
      08:38:17:[ 1602.561132] 93962 pages reserved
      08:38:17:[ 1602.562747] [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
      08:38:17:[ 1602.564748] [  343]     0   343     9303        1      20       92             0 systemd-journal
      08:38:17:[ 1602.566845] [  365]     0   365    29723        0      26       85             0 lvmetad
      08:38:17:[ 1602.568879] [  382]     0   382    10895        1      21      134         -1000 systemd-udevd
      08:38:17:[ 1602.570936] [  460]     0   460    13854        0      28      108         -1000 auditd
      08:38:17:[ 1602.572986] [  492]     0   492     4793       18      14       37             0 irqbalance
      08:38:17:[ 1602.575027] [  493]   998   493   134941        0      53     2131             0 polkitd
      08:38:17:[ 1602.577041] [  496]    81   496     9197        0      20      129          -900 dbus-daemon
      08:38:17:[ 1602.579085] [  509]   997   509    28962        0      28       98             0 chronyd
      08:38:17:[ 1602.581105] [  559]     0   559    67227        0      43      657             0 gssproxy
      08:38:17:[ 1602.583118] [  603]     0   603    64343        0      74      266             0 sssd
      08:38:17:[ 1602.585096] [  604]     0   604    90858        0     118      570             0 sssd_be
      08:38:17:[ 1602.587095] [  605]     0   605   127883        0      68     1074             0 NetworkManager
      08:38:17:[ 1602.589144] [  612]     0   612    30625        1      30      207             0 rpc.gssd
      08:38:17:[ 1602.591150] [  619]     0   619    65752        5      84      227             0 sssd_nss
      08:38:17:[ 1602.593151] [  620]     0   620    60653        0      73      238             0 sssd_pam
      08:38:17:[ 1602.595417] [  621]     0   621    59047        5      71      221             0 sssd_ssh
      08:38:17:[ 1602.598098] [  622]     0   622    67553        0      83      330             0 sssd_pac
      08:38:17:[ 1602.600145] [  635]     0   635     6048        1      16       77             0 systemd-logind
      08:38:17:[ 1602.602201] [  641]     0   641    28220        1      58     3120             0 dhclient
      08:38:17:[ 1602.604259] [  859]     0   859   138287       62      87     2611             0 tuned
      08:38:17:[ 1602.606279] [  860]     0   860    77477        0      46      285             0 rsyslogd
      08:38:17:[ 1602.608319] [  861]     0   861    26370        1      54      248         -1000 sshd
      08:38:17:[ 1602.610259] [  868]     0   868     6776        1      17       62             0 xinetd
      08:38:17:[ 1602.612235] [  875]   995   875    56858        0      23      263             0 munged
      08:38:17:[ 1602.614120] [  989]     0   989    22780        0      43      262             0 master
      08:38:17:[ 1602.615980] [  991]    89   991    25347        0      45      253             0 pickup
      08:38:17:[ 1602.617764] [  992]    89   992    25364        0      47      254             0 qmgr
      08:38:17:[ 1602.619536] [ 1007]     0  1007    31555        0      18      155             0 crond
      08:38:17:[ 1602.621255] [ 1009]     0  1009   149996        0      63      498             0 automount
      08:38:17:[ 1602.623021] [ 1012]     0  1012     6461        0      17       51             0 atd
      08:38:17:[ 1602.624633] [ 1043]     0  1043    27509        1      10       32             0 agetty
      08:38:17:[ 1602.626296] [ 1045]     0  1045    27509        1      12       32             0 agetty
      08:38:17:[ 1602.627878] [ 2247]     0  2247    40560        0      79      347             0 sshd
      08:38:17:[ 1602.629439] [ 2249]     0  2249    28283        0      14       58             0 run_test.sh
      08:38:17:[ 1602.630986] [ 2461]     0  2461    29424       28      15      671             0 bash
      08:38:17:[ 1602.632479] [22429]     0 22429    29424       23      13      675             0 bash
      08:38:17:[ 1602.633907] [22430]     0 22430    26975        0       9       27             0 tee
      08:38:17:[ 1602.635313] [22600]     0 22600    29793       39      14     1016             0 bash
      08:38:17:[ 1602.636706] [17680]     0 17680    29793       59      14      999             0 bash
      08:38:17:[ 1602.638107] [17681]     0 17681    26975        1       9       26             0 tee
      08:38:17:[ 1602.639460] [17855]     0 17855     2076        9       9       19             0 createmany
      08:38:17:[ 1602.640891] Kernel panic - not syncing: Out of memory: system-wide panic_on_oom is enabled
      

      Logs for these two cases are at:
      2017-05-16 - https://testing.hpdd.intel.com/test_sets/4f3d47bc-3a42-11e7-b81e-5254006e85c2
      2017-05-16 - https://testing.hpdd.intel.com/test_sets/6d1be1a2-3a43-11e7-b861-5254006e85c2

      Attachments

        Issue Links

          Activity

            People

              yong.fan nasf (Inactive)
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: