Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12864

sanity-benchmark test_iozone crashes with OOM on clients

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • None
    • Lustre 2.13.0, Lustre 2.12.3, Lustre 2.14.0, Lustre 2.12.4, Lustre 2.12.6, Lustre 2.12.8, Lustre 2.16.0
    • None
    • 3
    • 9223372036854775807

    Description

      sanity-benchmark test_iozone crashes with OOM. This crash has been seen in ARM and x86_64 client testing a total of eight times. The first occurrence was 30 JULY 2019 for Lustre 2.12.2.101 and 09 AUG 2019 for Lustre 2.12.56.87.

      Looking at the kernel crash for https://testing.whamcloud.com/test_sets/93a9c704-eb70-11e9-b62b-52540065bddc, we see ext4_filemap_fault in the call stack which seems unique to other OOM crashes we’ve seen

      [23529.881894] Lustre: DEBUG MARKER: == sanity-benchmark test iozone: iozone ============================================================== 22:48:27 (1570574907)
      [23532.537981] Lustre: DEBUG MARKER: /usr/sbin/lctl mark min OST has 1785584kB available, using 3074176kB file size
      [23533.130811] Lustre: DEBUG MARKER: min OST has 1785584kB available, using 3074176kB file size
      [23702.824787] in:imjournal invoked oom-killer: gfp_mask=0x14201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), nodemask=(null),  order=0, oom_score_adj=0
      [23702.841803] in:imjournal cpuset=/ mems_allowed=0
      [23702.844436] CPU: 0 PID: 937 Comm: in:imjournal Kdump: loaded Tainted: G           OE  ------------   4.14.0-115.2.2.el7a.aarch64 #1
      [23702.851192] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
      [23702.855331] Call trace:
      [23702.856924] [<ffff000008089e14>] dump_backtrace+0x0/0x23c
      [23702.860189] [<ffff00000808a074>] show_stack+0x24/0x2c
      [23702.863232] [<ffff000008855c28>] dump_stack+0x84/0xa8
      [23702.866282] [<ffff000008211fc0>] dump_header+0x94/0x1ec
      [23702.869476] [<ffff000008211e4c>] out_of_memory+0x430/0x484
      [23702.872747] [<ffff0000082179c4>] __alloc_pages_nodemask+0xa78/0xec0
      [23702.876522] [<ffff00000827a89c>] alloc_pages_current+0x8c/0xd8
      [23702.880039] [<ffff000008209eb8>] __page_cache_alloc+0x9c/0xd8
      [23702.883499] [<ffff00000820dc40>] filemap_fault+0x340/0x550
      [23702.887580] [<ffff000001405608>] ext4_filemap_fault+0x38/0x54 [ext4]
      [23702.891420] [<ffff00000824b364>] __do_fault+0x30/0xf4
      [23702.894459] [<ffff000008250130>] do_fault+0x3ec/0x4b8
      [23702.897517] [<ffff00000825178c>] __handle_mm_fault+0x3f4/0x560
      [23702.900998] [<ffff0000082519d8>] handle_mm_fault+0xe0/0x178
      [23702.904324] [<ffff000008872dc4>] do_page_fault+0x1c4/0x3cc
      [23702.907608] [<ffff00000887301c>] do_translation_fault+0x50/0x5c
      [23702.911152] [<ffff0000080813e8>] do_mem_abort+0x64/0xe4
      [23702.914390] [<ffff000008081568>] do_el0_ia_bp_hardening+0x94/0xb4
      [23702.918206] Exception stack(0xffff00000be2fec0 to 0xffff00000be30000)
      [23702.922205] fec0: 0000000000000000 0000000000000000 0000000000000000 0000ffff9768e6a0
      [23702.927072] fee0: 0000000000000002 0000000000000000 00000000ffffffbb 0000000000000000
      [23702.931975] ff00: 0000000000000049 003b9aca00000000 0000000000005c93 0000000028da3176
      [23702.936883] ff20: 0000000000000018 000000005d9d12e6 001d34ce80000000 0000a26c46000000
      [23702.941748] ff40: 0000ffff987ffae0 0000ffff98974ef0 0000000000000012 0000ffff987ff000
      [23702.946622] ff60: 00000000000dbba0 0000ffff987ff000 0000ffff900be4d0 0000ffff98830000
      [23702.951483] ff80: 000000000000b712 0000ffff900acef0 0000ffff9768e8a0 0000ffff98830000
      [23702.956372] ffa0: 0000000000000000 0000ffff9768e700 0000ffff987ca4e0 0000ffff9768e700
      [23702.961255] ffc0: 0000ffff987ca4e0 0000000080000000 0000ffff9768e720 00000000ffffffff
      [23702.966136] ffe0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
      [23702.971019] [<ffff0000080832a4>] el0_ia+0x1c/0x20
      [23702.974052] Mem-Info:
      [23702.975375] active_anon:0 inactive_anon:0 isolated_anon:0
       active_file:3693 inactive_file:15860 isolated_file:64
       unevictable:0 dirty:256 writeback:3416 unstable:0
       slab_reclaimable:351 slab_unreclaimable:1716
       mapped:4 shmem:0 pagetables:145 bounce:0
       free:1170 free_pcp:4 free_cma:0
      [23702.994215] Node 0 active_anon:0kB inactive_anon:0kB active_file:236352kB inactive_file:1014336kB unevictable:0kB isolated(anon):0kB isolated(file):4096kB mapped:256kB dirty:16384kB writeback:218624kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB unstable:0kB all_unreclaimable? yes
      [23703.010485] Node 0 DMA free:74880kB min:75328kB low:94144kB high:112960kB active_anon:0kB inactive_anon:0kB active_file:236352kB inactive_file:1012992kB unevictable:0kB writepending:235008kB present:2097152kB managed:1537088kB mlocked:0kB kernel_stack:10624kB pagetables:9280kB bounce:0kB free_pcp:256kB local_pcp:128kB free_cma:0kB
      [23703.028015] lowmem_reserve[]: 0 0 0
      [23703.030000] Node 0 DMA: 92*64kB (U) 5*128kB (U) 1*256kB (U) 3*512kB (U) 1*1024kB (U) 0*2048kB 0*4096kB 2*8192kB (U) 1*16384kB (U) 1*32768kB (U) 0*65536kB 0*131072kB 0*262144kB 0*524288kB = 74880kB
      [23703.040433] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
      [23703.045443] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=524288kB
      [23703.050373] 3700 total pagecache pages
      [23703.052649] 0 pages in swap cache
      [23703.054534] Swap cache stats: add 4640, delete 4640, find 319/523
      [23703.057997] Free swap  = 1826560kB
      [23703.059928] Total swap = 2098112kB
      [23703.061952] 32768 pages RAM
      [23703.063537] 0 pages HighMem/MovableOnly
      [23703.065741] 8751 pages reserved
      [23703.067542] 0 pages hwpoisoned
      [23703.069277] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
      [23703.074369] [  417]     0   417      237        0       3       2       39             0 systemd-journal
      [23703.079784] [  439]     0   439     1282        0       4       2       43             0 lvmetad
      [23703.085004] [  453]     0   453      243        2       4       2       41         -1000 systemd-udevd
      [23703.090284] [  541]     0   541      267        0       4       2       48         -1000 auditd
      [23703.095514] [  588]    81   588      160        2       3       2       59          -900 dbus-daemon
      [23703.100750] [  589]    32   589      185        0       4       2       75             0 rpcbind
      [23703.105767] [  592]     0   592     2458        0       4       2       55             0 gssproxy
      [23703.110859] [  600]     0   600     6767        0       5       2      172             0 NetworkManager
      [23703.116188] [  601]   999   601     8492        0       6       3      145             0 polkitd
      [23703.121264] [  602]     0   602       88        0       3       2       31             0 systemd-logind
      [23703.126603] [  603]     0   603       92        2       3       2       24             0 irqbalance
      [23703.131843] [  610]    38   610      160        2       4       2       47             0 ntpd
      [23703.136684] [  683]     0   683      359        2       3       2      113             0 dhclient
      [23703.141833] [  917]     0   917      323        2       4       2      104         -1000 sshd
      [23703.146688] [  921]     0   921     6888        1       5       2      318             0 tuned
      [23703.151673] [  923]     0   923       91        2       3       2       29             0 xinetd
      [23703.156625] [  924]     0   924     3762        1       4       2       99             0 rsyslogd
      [23703.161769] [  930]   997   930     3200        0       3       2       51             0 munged
      [23703.166723] [  938]    29   938      130        2       3       2       51             0 rpc.statd
      [23703.171907] [  980]     0   980     9711        0       5       2      156             0 automount
      [23703.177064] [  986]     0   986     1756        0       4       2       39             0 crond
      [23703.182076] [  988]     0   988       78        0       4       2       29             0 atd
      [23703.186875] [ 1003]     0  1003     1718        2       3       2       10             0 agetty
      [23703.191913] [ 1005]     0  1005     1718        2       3       2       10             0 agetty
      [23703.196881] [ 1521]     0  1521      344        0       4       2       85             0 master
      [23703.201935] [ 1565]    89  1565      347        2       4       2       81             0 qmgr
      [23703.206802] [ 9394]     0  9394      392        0       4       2      149             0 sshd
      [23703.211760] [ 9396]     0  9396     1739        0       3       2       14             0 run_test.sh
      [23703.216936] [ 9702]     0  9702     1788        2       3       2       63             0 bash
      [23703.221846] [22149]    89 22149      346        0       4       2       81             0 pickup
      [23703.226825] [26394]     0 26394     1788        1       3       2       63             0 bash
      [23703.231749] [26395]     0 26395     1715        1       4       2        8             0 tee
      [23703.236525] [26592]     0 26592     1785        2       3       2       61             0 bash
      [23703.241464] [31996]     0 31996     1788        1       3       2       63             0 bash
      [23703.246327] [31997]     0 31997     1715        1       4       2        9             0 tee
      [23703.251181] [32459]   500 32459      685        0       4       2      284             0 iozone
      [23703.256109] [32460]     0 32460     1715        1       4       2        8             0 tee
      [23703.260934] Kernel panic - not syncing: Out of memory: system-wide panic_on_oom is enabled
      
      [23703.266467] CPU: 0 PID: 937 Comm: in:imjournal Kdump: loaded Tainted: G           OE  ------------   4.14.0-115.2.2.el7a.aarch64 #1
      [23703.273347] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
      [23703.277474] Call trace:
      [23703.278946] [<ffff000008089e14>] dump_backtrace+0x0/0x23c
      [23703.282220] [<ffff00000808a074>] show_stack+0x24/0x2c
      [23703.285224] [<ffff000008855c28>] dump_stack+0x84/0xa8
      [23703.288434] [<ffff0000080d4e5c>] panic+0x138/0x2a0
      [23703.291316] [<ffff000008211e70>] out_of_memory+0x454/0x484
      [23703.294619] [<ffff0000082179c4>] __alloc_pages_nodemask+0xa78/0xec0
      [23703.298371] [<ffff00000827a89c>] alloc_pages_current+0x8c/0xd8
      [23703.301874] [<ffff000008209eb8>] __page_cache_alloc+0x9c/0xd8
      [23703.305324] [<ffff00000820dc40>] filemap_fault+0x340/0x550
      [23703.308897] [<ffff000001405608>] ext4_filemap_fault+0x38/0x54 [ext4]
      [23703.312710] [<ffff00000824b364>] __do_fault+0x30/0xf4
      [23703.315715] [<ffff000008250130>] do_fault+0x3ec/0x4b8
      [23703.318783] [<ffff00000825178c>] __handle_mm_fault+0x3f4/0x560
      [23703.322271] [<ffff0000082519d8>] handle_mm_fault+0xe0/0x178
      [23703.325625] [<ffff000008872dc4>] do_page_fault+0x1c4/0x3cc
      [23703.328906] [<ffff00000887301c>] do_translation_fault+0x50/0x5c
      [23703.332421] [<ffff0000080813e8>] do_mem_abort+0x64/0xe4
      [23703.335530] [<ffff000008081568>] do_el0_ia_bp_hardening+0x94/0xb4
      [23703.339191] Exception stack(0xffff00000be2fec0 to 0xffff00000be30000)
      [23703.343081] fec0: 0000000000000000 0000000000000000 0000000000000000 0000ffff9768e6a0
      [23703.347762] fee0: 0000000000000002 0000000000000000 00000000ffffffbb 0000000000000000
      [23703.352444] ff00: 0000000000000049 003b9aca00000000 0000000000005c93 0000000028da3176
      [23703.357175] ff20: 0000000000000018 000000005d9d12e6 001d34ce80000000 0000a26c46000000
      [23703.361853] ff40: 0000ffff987ffae0 0000ffff98974ef0 0000000000000012 0000ffff987ff000
      [23703.366551] ff60: 00000000000dbba0 0000ffff987ff000 0000ffff900be4d0 0000ffff98830000
      [23703.371220] ff80: 000000000000b712 0000ffff900acef0 0000ffff9768e8a0 0000ffff98830000
      [23703.375942] ffa0: 0000000000000000 0000ffff9768e700 0000ffff987ca4e0 0000ffff9768e700
      [23703.380639] ffc0: 0000ffff987ca4e0 0000000080000000 0000ffff9768e720 00000000ffffffff
      [23703.385333] ffe0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
      [23703.390039] [<ffff0000080832a4>] el0_ia+0x1c/0x20
      [23703.392919] SMP: stopping secondary CPUs
      [23703.398529] Starting crashdump kernel...
      [23703.400804] Bye!
      

      Logs for other crashes are at
      https://testing.whamcloud.com/test_sets/a2546caa-d315-11e9-9fc9-52540065bddc
      https://testing.whamcloud.com/test_sets/6bc5196c-bb4d-11e9-a25b-52540065bddc
      https://testing.whamcloud.com/test_sets/d4d9a03c-c046-11e9-97d5-52540065bddc
      https://testing.whamcloud.com/test_sets/eecba4da-e577-11e9-a197-52540065bddc

      Attachments

        Issue Links

          Activity

            People

              bobijam Zhenyu Xu
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

                Created:
                Updated: