Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11878

sanity test 103b: OOM because of too many bash processes: page allocation stalls for 18420ms

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.13.0, Lustre 2.12.1
    • Lustre 2.12.0, Lustre 2.13.0
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for Andreas Dilger <adilger@whamcloud.com>

      This issue relates to the following test suite run:
      https://testing.whamcloud.com/test_sets/adfb4bd4-1978-11e9-8388-52540065bddc

      The test_103b code is running 512 parallel bash processes to verify that different values umask are working properly. On the x86 clients there is either not as much kernel debugging enabled, or the smaller pages (== smaller stack) doesn't cause as much grief. On ARM the client crashes because of slow allocation and OOM with the following stack trace:

      [ 5945.554571] bash: page allocation stalls for 18420ms, order:0, mode:0x14200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null)
      [ 5945.562347] bash cpuset=/ mems_allowed=0
      [ 5945.564625] CPU: 1 PID: 20442 Comm: bash Kdump: loaded Tainted: G           OE  ------------   4.14.0-115.2.2.el7a.aarch64 #1
      [ 5945.578547] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
      [ 5945.586497] Call trace:
      [ 5945.588107] [<ffff000008089e14>] dump_backtrace+0x0/0x23c
      [ 5945.599468] [<ffff00000808a074>] show_stack+0x24/0x2c
      [ 5945.603148] [<ffff000008855c28>] dump_stack+0x84/0xa8
      [ 5945.606676] [<ffff000008216e34>] warn_alloc+0x11c/0x1ac
      [ 5945.614536] [<ffff000008217ddc>] __alloc_pages_nodemask+0xe90/0xec0
      [ 5945.624463] [<ffff00000827bca4>] alloc_pages_vma+0x90/0x1c0
      [ 5945.628873] [<ffff00000824b574>] wp_page_copy+0x94/0x670
      [ 5945.633271] [<ffff00000824ea40>] do_wp_page+0xbc/0x63c
      [ 5945.639748] [<ffff000008251868>] __handle_mm_fault+0x4d0/0x560
      [ 5945.650364] [<ffff0000082519d8>] handle_mm_fault+0xe0/0x178
      [ 5945.655960] [<ffff000008872dc4>] do_page_fault+0x1c4/0x3cc
      [ 5945.663762] [<ffff0000080813e8>] do_mem_abort+0x64/0xe4
      [ 5945.756137] Mem-Info:
      [ 5945.759687] active_anon:4916 inactive_anon:4896 isolated_anon:584
       active_file:65 inactive_file:50 isolated_file:0
       unevictable:0 dirty:0 writeback:58 unstable:0
       slab_reclaimable:353 slab_unreclaimable:2005
       mapped:86 shmem:5 pagetables:4117 bounce:0
       free:2810 free_pcp:10 free_cma:0
      [ 5945.783426] Node 0 active_anon:307392kB inactive_anon:307648kB active_file:2752kB inactive_file:3200kB unevictable:0kB isolated(anon):37376kB isolated(file):0kB mapped:5504kB dirty:0kB writeback:2368kB shmem:320kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
      [ 5945.800403] Node 0 DMA free:195968kB min:75520kB low:94400kB high:113280kB active_anon:309184kB inactive_anon:311360kB active_file:4928kB inactive_file:4608kB unevictable:0kB writepending:0kB present:2097152kB managed:1537088kB mlocked:0kB kernel_stack:76544kB pagetables:263488kB bounce:0kB free_pcp:640kB local_pcp:320kB free_cma:0kB
      [ 5945.817944] lowmem_reserve[]: 0 0 0
      [ 5945.820200] Node 0 DMA: 1794*64kB (U) 236*128kB (U) 36*256kB (U) 1*512kB (U) 0*1024kB 1*2048kB (U) 0*4096kB 0*8192kB 1*16384kB (U) 1*32768kB (U) 0*65536kB 0*131072kB 0*262144kB 0*524288kB = 205952kB
      [ 5945.830444] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
      [ 5945.835293] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=524288kB
      [ 5945.845101] 1568 total pagecache pages
      [ 5945.850497] 1505 pages in swap cache
      [ 5945.854516] Swap cache stats: add 131425, delete 129953, find 93983/140484
      [ 5945.861924] Free swap  = 208256kB
      [ 5945.864822] Total swap = 2098112kB
      [ 5945.867189] 32768 pages RAM
      [ 5945.869354] 0 pages HighMem/MovableOnly
      [ 5945.873040] 8751 pages reserved
      [ 5945.876243] 0 pages hwpoisoned
      [ 5979.408965] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
      [ 5979.414229] [ 1334]     0  1334      237        3       4       2       37             0 systemd-journal
      [ 5979.419778] [ 1354]     0  1354     1282        0       4       2       43             0 lvmetad
      [ 5979.425682] [ 1364]     0  1364      243        2       4       2       42         -1000 systemd-udevd
      :
      :
      [ 5979.569754] [11382]     0 11382     1739        0       4       2       15             0 run_test.sh
      [ 5979.575016] [11652]     0 11652     1785        2       3       2       62             0 bash
      [ 5979.579985] [19821]     0 19821     1785        1       3       2       62             0 bash
      [ 5979.584878] [19822]     0 19822     1715        1       3       2        8             0 tee
      [ 5979.589861] [20003]     0 20003     1828        2       4       2      104             0 bash
      [ 5979.594729] [15391]     0 15391     1743        1       5       2       23             0 anacron
      [ 5979.599854] [17647]     0 17647     1834        4       4       2      108             0 bash
      [ 5979.604748] [17648]     0 17648     1715        1       4       2        9             0 tee
      [ 5979.609712] [17832]     0 17832     1831       30       4       2       76             0 bash
      [ 5979.614600] [17834]     0 17834     1831        0       4       2      109             0 bash
      [ 5979.619561] [17835]     0 17835     1828        9       4       2       97             0 bash
      [ 5979.624770] [17836]     0 17836     1831       23       4       2       89             0 bash
      [ 5979.629739] [17841]     0 17841     1828        0       4       2      109             0 bash
      :
      :
      [ 5986.229441] [22230]     0 22230     1831       24       4       2       83             0 bash
      [ 5986.234602] [22231]     0 22231     1828       26       4       2       79             0 bash
      [ 5986.239474] [22232]     0 22232     1834       24       4       2       86             0 bash
      [ 5986.244709] [22233]     0 22233     1831       15       4       2       92             0 bash
      [ 5986.249630] [22234]     0 22234     1831       20       4       2       86             0 bash
      [ 5986.254535] [22235]     0 22235     1834       22       4       2       88             0 bash
      [ 5986.259377] Kernel panic - not syncing: Out of memory: system-wide panic_on_oom is enabled
      

      It was initially a bit of a surprise that there was any swap in use, because Lustre runs in the kernel and cannot be swapped out, but this space is used by the many (nearly 1000) bash processes that are running on the node, in addition to many lfs and rm processes.

      VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
      sanity test_103b - onyx-90vm17 crashed during sanity test_103b

      Attachments

        Issue Links

          Activity

            People

              adilger Andreas Dilger
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: