Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12864

sanity-benchmark test_iozone crashes with OOM on clients

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • None
    • Lustre 2.13.0, Lustre 2.12.3, Lustre 2.14.0, Lustre 2.12.4, Lustre 2.12.6, Lustre 2.12.8, Lustre 2.16.0
    • None
    • 3
    • 9223372036854775807

    Description

      sanity-benchmark test_iozone crashes with OOM. This crash has been seen in ARM and x86_64 client testing a total of eight times. The first occurrence was 30 JULY 2019 for Lustre 2.12.2.101 and 09 AUG 2019 for Lustre 2.12.56.87.

      Looking at the kernel crash for https://testing.whamcloud.com/test_sets/93a9c704-eb70-11e9-b62b-52540065bddc, we see ext4_filemap_fault in the call stack which seems unique to other OOM crashes we’ve seen

      [23529.881894] Lustre: DEBUG MARKER: == sanity-benchmark test iozone: iozone ============================================================== 22:48:27 (1570574907)
      [23532.537981] Lustre: DEBUG MARKER: /usr/sbin/lctl mark min OST has 1785584kB available, using 3074176kB file size
      [23533.130811] Lustre: DEBUG MARKER: min OST has 1785584kB available, using 3074176kB file size
      [23702.824787] in:imjournal invoked oom-killer: gfp_mask=0x14201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), nodemask=(null),  order=0, oom_score_adj=0
      [23702.841803] in:imjournal cpuset=/ mems_allowed=0
      [23702.844436] CPU: 0 PID: 937 Comm: in:imjournal Kdump: loaded Tainted: G           OE  ------------   4.14.0-115.2.2.el7a.aarch64 #1
      [23702.851192] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
      [23702.855331] Call trace:
      [23702.856924] [<ffff000008089e14>] dump_backtrace+0x0/0x23c
      [23702.860189] [<ffff00000808a074>] show_stack+0x24/0x2c
      [23702.863232] [<ffff000008855c28>] dump_stack+0x84/0xa8
      [23702.866282] [<ffff000008211fc0>] dump_header+0x94/0x1ec
      [23702.869476] [<ffff000008211e4c>] out_of_memory+0x430/0x484
      [23702.872747] [<ffff0000082179c4>] __alloc_pages_nodemask+0xa78/0xec0
      [23702.876522] [<ffff00000827a89c>] alloc_pages_current+0x8c/0xd8
      [23702.880039] [<ffff000008209eb8>] __page_cache_alloc+0x9c/0xd8
      [23702.883499] [<ffff00000820dc40>] filemap_fault+0x340/0x550
      [23702.887580] [<ffff000001405608>] ext4_filemap_fault+0x38/0x54 [ext4]
      [23702.891420] [<ffff00000824b364>] __do_fault+0x30/0xf4
      [23702.894459] [<ffff000008250130>] do_fault+0x3ec/0x4b8
      [23702.897517] [<ffff00000825178c>] __handle_mm_fault+0x3f4/0x560
      [23702.900998] [<ffff0000082519d8>] handle_mm_fault+0xe0/0x178
      [23702.904324] [<ffff000008872dc4>] do_page_fault+0x1c4/0x3cc
      [23702.907608] [<ffff00000887301c>] do_translation_fault+0x50/0x5c
      [23702.911152] [<ffff0000080813e8>] do_mem_abort+0x64/0xe4
      [23702.914390] [<ffff000008081568>] do_el0_ia_bp_hardening+0x94/0xb4
      [23702.918206] Exception stack(0xffff00000be2fec0 to 0xffff00000be30000)
      [23702.922205] fec0: 0000000000000000 0000000000000000 0000000000000000 0000ffff9768e6a0
      [23702.927072] fee0: 0000000000000002 0000000000000000 00000000ffffffbb 0000000000000000
      [23702.931975] ff00: 0000000000000049 003b9aca00000000 0000000000005c93 0000000028da3176
      [23702.936883] ff20: 0000000000000018 000000005d9d12e6 001d34ce80000000 0000a26c46000000
      [23702.941748] ff40: 0000ffff987ffae0 0000ffff98974ef0 0000000000000012 0000ffff987ff000
      [23702.946622] ff60: 00000000000dbba0 0000ffff987ff000 0000ffff900be4d0 0000ffff98830000
      [23702.951483] ff80: 000000000000b712 0000ffff900acef0 0000ffff9768e8a0 0000ffff98830000
      [23702.956372] ffa0: 0000000000000000 0000ffff9768e700 0000ffff987ca4e0 0000ffff9768e700
      [23702.961255] ffc0: 0000ffff987ca4e0 0000000080000000 0000ffff9768e720 00000000ffffffff
      [23702.966136] ffe0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
      [23702.971019] [<ffff0000080832a4>] el0_ia+0x1c/0x20
      [23702.974052] Mem-Info:
      [23702.975375] active_anon:0 inactive_anon:0 isolated_anon:0
       active_file:3693 inactive_file:15860 isolated_file:64
       unevictable:0 dirty:256 writeback:3416 unstable:0
       slab_reclaimable:351 slab_unreclaimable:1716
       mapped:4 shmem:0 pagetables:145 bounce:0
       free:1170 free_pcp:4 free_cma:0
      [23702.994215] Node 0 active_anon:0kB inactive_anon:0kB active_file:236352kB inactive_file:1014336kB unevictable:0kB isolated(anon):0kB isolated(file):4096kB mapped:256kB dirty:16384kB writeback:218624kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB unstable:0kB all_unreclaimable? yes
      [23703.010485] Node 0 DMA free:74880kB min:75328kB low:94144kB high:112960kB active_anon:0kB inactive_anon:0kB active_file:236352kB inactive_file:1012992kB unevictable:0kB writepending:235008kB present:2097152kB managed:1537088kB mlocked:0kB kernel_stack:10624kB pagetables:9280kB bounce:0kB free_pcp:256kB local_pcp:128kB free_cma:0kB
      [23703.028015] lowmem_reserve[]: 0 0 0
      [23703.030000] Node 0 DMA: 92*64kB (U) 5*128kB (U) 1*256kB (U) 3*512kB (U) 1*1024kB (U) 0*2048kB 0*4096kB 2*8192kB (U) 1*16384kB (U) 1*32768kB (U) 0*65536kB 0*131072kB 0*262144kB 0*524288kB = 74880kB
      [23703.040433] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
      [23703.045443] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=524288kB
      [23703.050373] 3700 total pagecache pages
      [23703.052649] 0 pages in swap cache
      [23703.054534] Swap cache stats: add 4640, delete 4640, find 319/523
      [23703.057997] Free swap  = 1826560kB
      [23703.059928] Total swap = 2098112kB
      [23703.061952] 32768 pages RAM
      [23703.063537] 0 pages HighMem/MovableOnly
      [23703.065741] 8751 pages reserved
      [23703.067542] 0 pages hwpoisoned
      [23703.069277] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
      [23703.074369] [  417]     0   417      237        0       3       2       39             0 systemd-journal
      [23703.079784] [  439]     0   439     1282        0       4       2       43             0 lvmetad
      [23703.085004] [  453]     0   453      243        2       4       2       41         -1000 systemd-udevd
      [23703.090284] [  541]     0   541      267        0       4       2       48         -1000 auditd
      [23703.095514] [  588]    81   588      160        2       3       2       59          -900 dbus-daemon
      [23703.100750] [  589]    32   589      185        0       4       2       75             0 rpcbind
      [23703.105767] [  592]     0   592     2458        0       4       2       55             0 gssproxy
      [23703.110859] [  600]     0   600     6767        0       5       2      172             0 NetworkManager
      [23703.116188] [  601]   999   601     8492        0       6       3      145             0 polkitd
      [23703.121264] [  602]     0   602       88        0       3       2       31             0 systemd-logind
      [23703.126603] [  603]     0   603       92        2       3       2       24             0 irqbalance
      [23703.131843] [  610]    38   610      160        2       4       2       47             0 ntpd
      [23703.136684] [  683]     0   683      359        2       3       2      113             0 dhclient
      [23703.141833] [  917]     0   917      323        2       4       2      104         -1000 sshd
      [23703.146688] [  921]     0   921     6888        1       5       2      318             0 tuned
      [23703.151673] [  923]     0   923       91        2       3       2       29             0 xinetd
      [23703.156625] [  924]     0   924     3762        1       4       2       99             0 rsyslogd
      [23703.161769] [  930]   997   930     3200        0       3       2       51             0 munged
      [23703.166723] [  938]    29   938      130        2       3       2       51             0 rpc.statd
      [23703.171907] [  980]     0   980     9711        0       5       2      156             0 automount
      [23703.177064] [  986]     0   986     1756        0       4       2       39             0 crond
      [23703.182076] [  988]     0   988       78        0       4       2       29             0 atd
      [23703.186875] [ 1003]     0  1003     1718        2       3       2       10             0 agetty
      [23703.191913] [ 1005]     0  1005     1718        2       3       2       10             0 agetty
      [23703.196881] [ 1521]     0  1521      344        0       4       2       85             0 master
      [23703.201935] [ 1565]    89  1565      347        2       4       2       81             0 qmgr
      [23703.206802] [ 9394]     0  9394      392        0       4       2      149             0 sshd
      [23703.211760] [ 9396]     0  9396     1739        0       3       2       14             0 run_test.sh
      [23703.216936] [ 9702]     0  9702     1788        2       3       2       63             0 bash
      [23703.221846] [22149]    89 22149      346        0       4       2       81             0 pickup
      [23703.226825] [26394]     0 26394     1788        1       3       2       63             0 bash
      [23703.231749] [26395]     0 26395     1715        1       4       2        8             0 tee
      [23703.236525] [26592]     0 26592     1785        2       3       2       61             0 bash
      [23703.241464] [31996]     0 31996     1788        1       3       2       63             0 bash
      [23703.246327] [31997]     0 31997     1715        1       4       2        9             0 tee
      [23703.251181] [32459]   500 32459      685        0       4       2      284             0 iozone
      [23703.256109] [32460]     0 32460     1715        1       4       2        8             0 tee
      [23703.260934] Kernel panic - not syncing: Out of memory: system-wide panic_on_oom is enabled
      
      [23703.266467] CPU: 0 PID: 937 Comm: in:imjournal Kdump: loaded Tainted: G           OE  ------------   4.14.0-115.2.2.el7a.aarch64 #1
      [23703.273347] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
      [23703.277474] Call trace:
      [23703.278946] [<ffff000008089e14>] dump_backtrace+0x0/0x23c
      [23703.282220] [<ffff00000808a074>] show_stack+0x24/0x2c
      [23703.285224] [<ffff000008855c28>] dump_stack+0x84/0xa8
      [23703.288434] [<ffff0000080d4e5c>] panic+0x138/0x2a0
      [23703.291316] [<ffff000008211e70>] out_of_memory+0x454/0x484
      [23703.294619] [<ffff0000082179c4>] __alloc_pages_nodemask+0xa78/0xec0
      [23703.298371] [<ffff00000827a89c>] alloc_pages_current+0x8c/0xd8
      [23703.301874] [<ffff000008209eb8>] __page_cache_alloc+0x9c/0xd8
      [23703.305324] [<ffff00000820dc40>] filemap_fault+0x340/0x550
      [23703.308897] [<ffff000001405608>] ext4_filemap_fault+0x38/0x54 [ext4]
      [23703.312710] [<ffff00000824b364>] __do_fault+0x30/0xf4
      [23703.315715] [<ffff000008250130>] do_fault+0x3ec/0x4b8
      [23703.318783] [<ffff00000825178c>] __handle_mm_fault+0x3f4/0x560
      [23703.322271] [<ffff0000082519d8>] handle_mm_fault+0xe0/0x178
      [23703.325625] [<ffff000008872dc4>] do_page_fault+0x1c4/0x3cc
      [23703.328906] [<ffff00000887301c>] do_translation_fault+0x50/0x5c
      [23703.332421] [<ffff0000080813e8>] do_mem_abort+0x64/0xe4
      [23703.335530] [<ffff000008081568>] do_el0_ia_bp_hardening+0x94/0xb4
      [23703.339191] Exception stack(0xffff00000be2fec0 to 0xffff00000be30000)
      [23703.343081] fec0: 0000000000000000 0000000000000000 0000000000000000 0000ffff9768e6a0
      [23703.347762] fee0: 0000000000000002 0000000000000000 00000000ffffffbb 0000000000000000
      [23703.352444] ff00: 0000000000000049 003b9aca00000000 0000000000005c93 0000000028da3176
      [23703.357175] ff20: 0000000000000018 000000005d9d12e6 001d34ce80000000 0000a26c46000000
      [23703.361853] ff40: 0000ffff987ffae0 0000ffff98974ef0 0000000000000012 0000ffff987ff000
      [23703.366551] ff60: 00000000000dbba0 0000ffff987ff000 0000ffff900be4d0 0000ffff98830000
      [23703.371220] ff80: 000000000000b712 0000ffff900acef0 0000ffff9768e8a0 0000ffff98830000
      [23703.375942] ffa0: 0000000000000000 0000ffff9768e700 0000ffff987ca4e0 0000ffff9768e700
      [23703.380639] ffc0: 0000ffff987ca4e0 0000000080000000 0000ffff9768e720 00000000ffffffff
      [23703.385333] ffe0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
      [23703.390039] [<ffff0000080832a4>] el0_ia+0x1c/0x20
      [23703.392919] SMP: stopping secondary CPUs
      [23703.398529] Starting crashdump kernel...
      [23703.400804] Bye!
      

      Logs for other crashes are at
      https://testing.whamcloud.com/test_sets/a2546caa-d315-11e9-9fc9-52540065bddc
      https://testing.whamcloud.com/test_sets/6bc5196c-bb4d-11e9-a25b-52540065bddc
      https://testing.whamcloud.com/test_sets/d4d9a03c-c046-11e9-97d5-52540065bddc
      https://testing.whamcloud.com/test_sets/eecba4da-e577-11e9-a197-52540065bddc

      Attachments

        Issue Links

          Activity

            [LU-12864] sanity-benchmark test_iozone crashes with OOM on clients

            Client was running 2.15.51.45 against 2.12.9 servers crashed with following allocation stats, so mostly in inactive_file:
            https://testing.whamcloud.com/test_sets/150d06df-aeae-436a-ba86-fc5082d70706

            [ 8984.082842] Mem-Info:
            [ 8984.083352] active_anon:0 inactive_anon:0 isolated_anon:0
            [ 8984.083352]  active_file:108414 inactive_file:517320 isolated_file:576
            [ 8984.083352]  unevictable:0 dirty:73728 writeback:65408
            [ 8984.083352]  slab_reclaimable:5600 slab_unreclaimable:22668
            [ 8984.088978] Node 0 active_anon:0kB inactive_anon:0kB active_file:448656kB inactive_file:2054532kB unevictable:0kB isolated(anon):0kB isolated(file):2048kB mapped:1036kB dirty:294912kB writeback:261632kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB kernel_stack:2768kB pagetables:8488kB all_unreclaimable? yes
            
            adilger Andreas Dilger added a comment - Client was running 2.15.51.45 against 2.12.9 servers crashed with following allocation stats, so mostly in inactive_file : https://testing.whamcloud.com/test_sets/150d06df-aeae-436a-ba86-fc5082d70706 [ 8984.082842] Mem-Info: [ 8984.083352] active_anon:0 inactive_anon:0 isolated_anon:0 [ 8984.083352] active_file:108414 inactive_file:517320 isolated_file:576 [ 8984.083352] unevictable:0 dirty:73728 writeback:65408 [ 8984.083352] slab_reclaimable:5600 slab_unreclaimable:22668 [ 8984.088978] Node 0 active_anon:0kB inactive_anon:0kB active_file:448656kB inactive_file:2054532kB unevictable:0kB isolated(anon):0kB isolated(file):2048kB mapped:1036kB dirty:294912kB writeback:261632kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB kernel_stack:2768kB pagetables:8488kB all_unreclaimable? yes

            This crash is only happening with EL7.9 kernels (tested only on b2_12 with master in interop mode), and not EL8.5, SLES15, etc. kernels, so it is unlikely to be a prime concern going forward, and could eventually be closed.

            adilger Andreas Dilger added a comment - This crash is only happening with EL7.9 kernels (tested only on b2_12 with master in interop mode), and not EL8.5, SLES15, etc. kernels, so it is unlikely to be a prime concern going forward, and could eventually be closed.

            It looks like the memory is at least accounted for in this client-side OOM. The vast majority of memory usage is in pagecache:

            active_anon:37356kB inactive_anon:37596kB active_file:47132kB inactive_file:2391964kB
            

            and the client is trying to write a 5502664kB file on a client with 3145312KB of memory. That shouldn't cause the client to run out of memory, since it should have llite.*.max_cached_mb=1535 (1/2 of RAM). However, osc.*.max_dirty_mb may be as large as 1/8 of RAM and there are 7 OSTs, so dirty data could consume up to 7/8 of RAM (per client_adjust_max_dirty()) if this is not also limited by max_cached_mb?

            adilger Andreas Dilger added a comment - It looks like the memory is at least accounted for in this client-side OOM. The vast majority of memory usage is in pagecache: active_anon:37356kB inactive_anon:37596kB active_file:47132kB inactive_file:2391964kB and the client is trying to write a 5502664kB file on a client with 3145312KB of memory. That shouldn't cause the client to run out of memory, since it should have llite.*.max_cached_mb=1535 (1/2 of RAM). However, osc.*.max_dirty_mb may be as large as 1/8 of RAM and there are 7 OSTs, so dirty data could consume up to 7/8 of RAM (per client_adjust_max_dirty() ) if this is not also limited by max_cached_mb ?
            sarah Sarah Liu added a comment - Hit similar one https://testing.whamcloud.com/test_sets/390f3fdd-df3e-4938-9dd3-ed327f57076f
            yujian Jian Yu added a comment - More failures on master branch in zfs test sessions: https://testing.whamcloud.com/test_sets/df51b436-412f-11ea-9630-52540065bddc https://testing.whamcloud.com/test_sets/a8e429aa-4130-11ea-9847-52540065bddc https://testing.whamcloud.com/test_sets/8359ad34-415a-11ea-9847-52540065bddc https://testing.whamcloud.com/test_sets/ae637af0-4164-11ea-9b1e-52540065bddc

            Here's another sanity-benchmark test_iozone OOM crash with a different trace; https://testing.whamcloud.com/test_sets/cecc9558-081e-11ea-bbc3-52540065bddc . Possibly a different root cause?

            [24986.500955] Lustre: DEBUG MARKER: == sanity-benchmark test iozone: iozone ============================================================== 10:05:17 (1573725917)
            [24989.359411] Lustre: DEBUG MARKER: /usr/sbin/lctl mark min OST has 1782612kB available, using 3438440kB file size
            [24989.610198] Lustre: DEBUG MARKER: min OST has 1782612kB available, using 3438440kB file size
            [25048.708540] LNetError: 18994:0:(peer.c:3724:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.5.216@tcp added to recovery queue. Health = 900
            [25049.117039] LNetError: 18996:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.9.5.214@tcp added to recovery queue. Health = 900
            [25049.119238] LustreError: 18996:0:(events.c:200:client_bulk_callback()) event type 1, status -5, desc ffff95ae7ad2bc00
            [25049.127483] Lustre: 19001:0:(client.c:2219:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1573725975/real 1573725975]  req@ffff95ae001c9f80 x1650168582431232/t0(0) o4->lustre-OST0005-osc-ffff95ae65a62000@10.9.5.216@tcp:6/4 lens 488/448 e 0 to 1 dl 1573726020 ref 2 fl Rpc:eXQr/0/ffffffff rc 0/-1 job:'iozone.500'
            [25049.132801] Lustre: lustre-OST0005-osc-ffff95ae65a62000: Connection to lustre-OST0005 (at 10.9.5.216@tcp) was lost; in progress operations using this service will wait for recovery to complete
            [25049.265069] Lustre: lustre-OST0001-osc-ffff95ae65a62000: Connection restored to 10.9.5.216@tcp (at 10.9.5.216@tcp)
            [25086.049686] Lustre: 19004:0:(client.c:2219:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1573725972/real 1573725973]  req@ffff95ae2087db00 x1650168582429888/t0(0) o4->lustre-OST0007-osc-ffff95ae65a62000@10.9.5.216@tcp:6/4 lens 488/448 e 0 to 1 dl 1573726017 ref 2 fl Rpc:XQr/0/ffffffff rc 0/-1 job:'iozone.500'
            [25086.054856] Lustre: 19004:0:(client.c:2219:ptlrpc_expire_one_request()) Skipped 7 previous similar messages
            [25086.056499] Lustre: lustre-OST0007-osc-ffff95ae65a62000: Connection to lustre-OST0007 (at 10.9.5.216@tcp) was lost; in progress operations using this service will wait for recovery to complete
            [25086.059313] Lustre: Skipped 5 previous similar messages
            [25086.062664] Lustre: lustre-OST0007-osc-ffff95ae65a62000: Connection restored to 10.9.5.216@tcp (at 10.9.5.216@tcp)
            [25086.064514] Lustre: Skipped 5 previous similar messages
            [25189.704358] LNetError: 18994:0:(peer.c:3724:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.5.216@tcp added to recovery queue. Health = 900
            [25190.148392] LNetError: 18996:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.9.5.214@tcp added to recovery queue. Health = 900
            [25190.150543] LustreError: 18996:0:(events.c:200:client_bulk_callback()) event type 1, status -5, desc ffff95ae20af0c00
            [25190.152380] LustreError: 18996:0:(events.c:200:client_bulk_callback()) event type 1, status -5, desc ffff95ae66b3c400
            [25190.154242] LustreError: 18996:0:(events.c:200:client_bulk_callback()) event type 1, status -5, desc ffff95ae66b3c400
            [25190.156036] LustreError: 18996:0:(events.c:200:client_bulk_callback()) event type 1, status -5, desc ffff95ae66b3c400
            [25190.158312] Lustre: 19004:0:(client.c:2219:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1573726114/real 1573726115]  req@ffff95ae4d98cd80 x1650168582704448/t0(0) o4->lustre-OST0000-osc-ffff95ae65a62000@10.9.5.216@tcp:6/4 lens 488/448 e 0 to 1 dl 1573726160 ref 2 fl Rpc:eXQr/0/ffffffff rc 0/-1 job:'iozone.500'
            [25190.163356] Lustre: 19004:0:(client.c:2219:ptlrpc_expire_one_request()) Skipped 5 previous similar messages
            [25190.165017] Lustre: lustre-OST0000-osc-ffff95ae65a62000: Connection to lustre-OST0000 (at 10.9.5.216@tcp) was lost; in progress operations using this service will wait for recovery to complete
            [25190.167837] Lustre: Skipped 1 previous similar message
            [25190.202183] LustreError: 18996:0:(events.c:200:client_bulk_callback()) event type 1, status -5, desc ffff95ae66b3c400
            [25190.205003] LustreError: 18996:0:(events.c:200:client_bulk_callback()) event type 1, status -5, desc ffff95ae66b3c800
            [25190.206859] LustreError: 18996:0:(events.c:200:client_bulk_callback()) event type 1, status -5, desc ffff95ae66b3c800
            [25190.208647] LustreError: 18996:0:(events.c:200:client_bulk_callback()) event type 1, status -5, desc ffff95ae66b3d800
            [25190.210440] LustreError: 18996:0:(events.c:200:client_bulk_callback()) event type 1, status -5, desc ffff95ae66b3c800
            [25190.232707] LustreError: 18996:0:(events.c:200:client_bulk_callback()) event type 1, status -5, desc ffff95ae66b3d800
            [25190.234622] LustreError: 18996:0:(events.c:200:client_bulk_callback()) event type 1, status -5, desc ffff95ae66b3d800
            [25190.236420] LustreError: 18996:0:(events.c:200:client_bulk_callback()) event type 1, status -5, desc ffff95ae66b3c800
            [25190.238549] LustreError: 18996:0:(events.c:200:client_bulk_callback()) event type 1, status -5, desc ffff95ae66b3d800
            [25190.240339] LustreError: 18996:0:(events.c:200:client_bulk_callback()) event type 1, status -5, desc ffff95ae66b3e000
            [25190.242117] LustreError: 18996:0:(events.c:200:client_bulk_callback()) event type 1, status -5, desc ffff95ae66b3e400
            [25190.287268] LustreError: 18996:0:(events.c:200:client_bulk_callback()) event type 1, status -5, desc ffff95ae66b3e000
            [25190.289123] LustreError: 18996:0:(events.c:200:client_bulk_callback()) event type 1, status -5, desc ffff95ae66b3e400
            [25190.290920] LustreError: 18996:0:(events.c:200:client_bulk_callback()) event type 1, status -5, desc ffff95ae66b3e400
            [25190.292837] LustreError: 18996:0:(events.c:200:client_bulk_callback()) event type 1, status -5, desc ffff95ae66b3e000
            [25190.294970] LustreError: 18996:0:(events.c:200:client_bulk_callback()) event type 1, status -5, desc ffff95ae66b3e800
            [25190.296757] LustreError: 18996:0:(events.c:200:client_bulk_callback()) event type 1, status -5, desc ffff95ae66b3e800
            [25190.331303] LustreError: 18996:0:(events.c:200:client_bulk_callback()) event type 1, status -5, desc ffff95ae66b3e400
            [25190.333147] LustreError: 18996:0:(events.c:200:client_bulk_callback()) event type 1, status -5, desc ffff95ae66b3e000
            [25190.334915] LustreError: 18996:0:(events.c:200:client_bulk_callback()) event type 1, status -5, desc ffff95ae66b3e800
            [25190.336707] LustreError: 18996:0:(events.c:200:client_bulk_callback()) event type 1, status -5, desc ffff95ae66b3f800
            [25190.338642] LustreError: 18996:0:(events.c:200:client_bulk_callback()) event type 1, status -5, desc ffff95ae66b3f800
            [25190.341149] LustreError: 18996:0:(events.c:200:client_bulk_callback()) event type 1, status -5, desc ffff95ae66b3e800
            [25190.388268] LustreError: 18996:0:(events.c:200:client_bulk_callback()) event type 1, status -5, desc ffff95ae66b3f800
            [25190.390387] LustreError: 18996:0:(events.c:200:client_bulk_callback()) event type 1, status -5, desc ffff95ae66b3f800
            [25190.484817] Lustre: lustre-OST0000-osc-ffff95ae65a62000: Connection restored to 10.9.5.216@tcp (at 10.9.5.216@tcp)
            [25190.486621] Lustre: Skipped 1 previous similar message
            [25258.668206] tuned invoked oom-killer: gfp_mask=0x200da, order=0, oom_score_adj=0
            [25258.669787] tuned cpuset=/ mems_allowed=0
            [25258.670483] CPU: 1 PID: 1369 Comm: tuned Kdump: loaded Tainted: G           OE  ------------   3.10.0-1062.1.1.el7.x86_64 #1
            [25258.672347] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
            [25258.673340] Call Trace:
            [25258.673874]  [<ffffffff8f7792c2>] dump_stack+0x19/0x1b
            [25258.674763]  [<ffffffff8f773c64>] dump_header+0x90/0x229
            [25258.675696]  [<ffffffff8f787b6f>] ? notifier_call_chain+0x4f/0x70
            [25258.676775]  [<ffffffff8f0cb498>] ? __blocking_notifier_call_chain+0x58/0x70
            [25258.677999]  [<ffffffff8f1bff2e>] check_panic_on_oom+0x2e/0x60
            [25258.679002]  [<ffffffff8f1c034b>] out_of_memory+0x23b/0x4f0
            [25258.679967]  [<ffffffff8f77477c>] __alloc_pages_slowpath+0x5d6/0x724
            [25258.681053]  [<ffffffff8f1c6b84>] __alloc_pages_nodemask+0x404/0x420
            [25258.682149]  [<ffffffff8f218105>] alloc_pages_vma+0xb5/0x200
            [25258.683122]  [<ffffffff8f205ac5>] __read_swap_cache_async+0x115/0x190
            [25258.684227]  [<ffffffff8f205b66>] read_swap_cache_async+0x26/0x60
            [25258.685285]  [<ffffffff8f205d4b>] swapin_readahead+0x1ab/0x210
            [25258.686306]  [<ffffffff8f386722>] ? radix_tree_lookup_slot+0x22/0x50
            [25258.687391]  [<ffffffff8f1bb4ee>] ? __find_get_page+0x1e/0xa0
            [25258.688406]  [<ffffffff8f1efe36>] handle_pte_fault+0xd66/0xe20
            [25258.689423]  [<ffffffff8f25f4dd>] ? core_sys_select+0x26d/0x340
            [25258.690435]  [<ffffffff8f1f200d>] handle_mm_fault+0x39d/0x9b0
            [25258.691422]  [<ffffffff8f787653>] __do_page_fault+0x213/0x500
            [25258.692425]  [<ffffffff8f25e7f3>] ? poll_select_copy_remaining+0x113/0x180
            [25258.693589]  [<ffffffff8f787a26>] trace_do_page_fault+0x56/0x150
            [25258.694607]  [<ffffffff8f786fa2>] do_async_page_fault+0x22/0xf0
            [25258.695641]  [<ffffffff8f7837a8>] async_page_fault+0x28/0x30
            [25258.696613] Mem-Info:
            [25258.697032] active_anon:17275 inactive_anon:17292 isolated_anon:0
             active_file:9706 inactive_file:333050 isolated_file:32
             unevictable:0 dirty:3641 writeback:42567 unstable:0
             slab_reclaimable:4776 slab_unreclaimable:13653
             mapped:6755 shmem:21419 pagetables:1312 bounce:0
             free:12854 free_pcp:46 free_cma:0
            [25258.702574] Node 0 DMA free:7048kB min:416kB low:520kB high:624kB active_anon:0kB inactive_anon:12kB active_file:204kB inactive_file:6872kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15992kB managed:15908kB mlocked:0kB dirty:144kB writeback:560kB mapped:172kB shmem:12kB slab_reclaimable:52kB slab_unreclaimable:348kB kernel_stack:32kB pagetables:56kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:10917 all_unreclaimable? yes
            [25258.709666] lowmem_reserve[]: 0 1660 1660 1660
            [25258.710606] Node 0 DMA32 free:44120kB min:44636kB low:55792kB high:66952kB active_anon:69100kB inactive_anon:69156kB active_file:38620kB inactive_file:1325244kB unevictable:0kB isolated(anon):0kB isolated(file):256kB present:2080744kB managed:1703312kB mlocked:0kB dirty:14420kB writeback:169708kB mapped:26848kB shmem:85664kB slab_reclaimable:19052kB slab_unreclaimable:54516kB kernel_stack:2752kB pagetables:5192kB unstable:0kB bounce:0kB free_pcp:204kB local_pcp:128kB free_cma:0kB writeback_tmp:0kB pages_scanned:2791068 all_unreclaimable? yes
            [25258.718385] lowmem_reserve[]: 0 0 0 0
            [25258.719197] Node 0 DMA: 12*4kB (UE) 19*8kB (UE) 14*16kB (UEM) 7*32kB (UE) 8*64kB (UEM) 2*128kB (UE) 10*256kB (UEM) 4*512kB (UEM) 1*1024kB (M) 0*2048kB 0*4096kB = 7048kB
            [25258.722455] Node 0 DMA32: 73*4kB (UEM) 1577*8kB (UEM) 955*16kB (UEM) 352*32kB (UEM) 71*64kB (UM) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 43996kB
            [25258.725528] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
            [25258.727002] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
            [25258.728414] 147236 total pagecache pages
            [25258.729119] 146 pages in swap cache
            [25258.729735] Swap cache stats: add 8528, delete 8382, find 310/572
            [25258.730769] Free swap  = 3638780kB
            [25258.731356] Total swap = 3671036kB
            [25258.731949] 524184 pages RAM
            [25258.732448] 0 pages HighMem/MovableOnly
            [25258.733109] 94379 pages reserved
            [25258.733688] [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
            [25258.735073] [  382]     0   382     9866     1342      24       77             0 systemd-journal
            [25258.736554] [  402]     0   402    29160      239      28       78             0 lvmetad
            [25258.737928] [  409]     0   409    11238      452      23       60         -1000 systemd-udevd
            [25258.739369] [  534]     0   534    13882      157      27      110         -1000 auditd
            [25258.740717] [  559]     0   559     5384      267      16       40             0 irqbalance
            [25258.742125] [  560]    81   560    14563      456      34      165          -900 dbus-daemon
            [25258.743538] [  561]     0   561     6595      376      18       66             0 systemd-logind
            [25258.745007] [  563]     0   563   136973     1685      87      502             0 NetworkManager
            [25258.746479] [  564]   999   564   153061     1172      63     2396             0 polkitd
            [25258.747848] [  568]     0   568    48777      333      37      130             0 gssproxy
            [25258.749245] [  575]    32   575    17319      236      38      141             0 rpcbind
            [25258.750613] [  578]    38   578    11824      455      28      152             0 ntpd
            [25258.751965] [  606]     0   606    25724      871      53      508             0 dhclient
            [25258.753363] [  952]     0   952   143550     3953     101      895             0 tuned
            [25258.754701] [  953]     0   953    74587     2165      72      141             0 rsyslogd
            [25258.756069] [  959]     0   959    28230      852      58      242         -1000 sshd
            [25258.757380] [  964]   997   964    56472      344      22      124             0 munged
            [25258.758737] [  966]    29   966    10609      223      26      209             0 rpc.statd
            [25258.760131] [  968]     0   968     6792      195      17       63             0 xinetd
            [25258.761481] [ 1024]     0  1024   172948     1594      81        0             0 automount
            [25258.762871] [ 1046]     0  1046     6477      235      19        0             0 atd
            [25258.764177] [ 1054]     0  1054    31573      396      19        0             0 crond
            [25258.765505] [ 1063]     0  1063    27527      215      12        0             0 agetty
            [25258.766852] [ 1065]     0  1065    27527      212      10        0             0 agetty
            [25258.768190] [ 1290]     0  1290    22425      532      43        0             0 master
            [25258.769539] [ 1310]    89  1310    22468      996      43        0             0 qmgr
            [25258.770874] [ 9706]     0  9706    39196     1381      80        0             0 sshd
            [25258.772191] [ 9708]     0  9708    28300      374      12        0             0 run_test.sh
            [25258.773620] [10014]     0 10014    29609     1225      17        2             0 bash
            [25258.774952] [18916]     0 18916     4844      164      14        1             0 rpc.idmapd
            [25258.776348] [18922]     0 18922    10942      703      27        0             0 rpc.mountd
            [25258.777819] [26115]     0 26115    29609     1011      14        0             0 bash
            [25258.779136] [26116]     0 26116    26991      165       8        0             0 tee
            [25258.780449] [26313]     0 26313    29580     1198      13        0             0 bash
            [25258.781773] [31794]    89 31794    22451      990      48        0             0 pickup
            [25258.783156] [32386]     0 32386    29613     1020      13        0             0 bash
            [25258.784483] [32387]     0 32387    26991      167       9        0             0 tee
            [25258.785821] [  377]   500   377    12351     4815      22      543             0 iozone
            [25258.787194] [  378]     0   378    26991      166       8        0             0 tee
            [25258.788484] Kernel panic - not syncing: Out of memory: system-wide panic_on_oom is enabled
            
            [25258.790111] CPU: 1 PID: 1369 Comm: tuned Kdump: loaded Tainted: G           OE  ------------   3.10.0-1062.1.1.el7.x86_64 #1
            [25258.791945] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
            [25258.792900] Call Trace:
            [25258.793333]  [<ffffffff8f7792c2>] dump_stack+0x19/0x1b
            [25258.794195]  [<ffffffff8f772941>] panic+0xe8/0x21f
            [25258.795006]  [<ffffffff8f1bff55>] check_panic_on_oom+0x55/0x60
            [25258.796021]  [<ffffffff8f1c034b>] out_of_memory+0x23b/0x4f0
            [25258.796957]  [<ffffffff8f77477c>] __alloc_pages_slowpath+0x5d6/0x724
            [25258.798015]  [<ffffffff8f1c6b84>] __alloc_pages_nodemask+0x404/0x420
            [25258.799073]  [<ffffffff8f218105>] alloc_pages_vma+0xb5/0x200
            [25258.800020]  [<ffffffff8f205ac5>] __read_swap_cache_async+0x115/0x190
            [25258.801088]  [<ffffffff8f205b66>] read_swap_cache_async+0x26/0x60
            [25258.802102]  [<ffffffff8f205d4b>] swapin_readahead+0x1ab/0x210
            [25258.803079]  [<ffffffff8f386722>] ? radix_tree_lookup_slot+0x22/0x50
            [25258.804137]  [<ffffffff8f1bb4ee>] ? __find_get_page+0x1e/0xa0
            [25258.805101]  [<ffffffff8f1efe36>] handle_pte_fault+0xd66/0xe20
            [25258.806080]  [<ffffffff8f25f4dd>] ? core_sys_select+0x26d/0x340
            [25258.807063]  [<ffffffff8f1f200d>] handle_mm_fault+0x39d/0x9b0
            [25258.808022]  [<ffffffff8f787653>] __do_page_fault+0x213/0x500
            [25258.808979]  [<ffffffff8f25e7f3>] ? poll_select_copy_remaining+0x113/0x180
            [25258.810122]  [<ffffffff8f787a26>] trace_do_page_fault+0x56/0x150
            [25258.811124]  [<ffffffff8f786fa2>] do_async_page_fault+0x22/0xf0
            [25258.812118]  [<ffffffff8f7837a8>] async_page_fault+0x28/0x30
            
            jamesanunez James Nunez (Inactive) added a comment - Here's another sanity-benchmark test_iozone OOM crash with a different trace; https://testing.whamcloud.com/test_sets/cecc9558-081e-11ea-bbc3-52540065bddc . Possibly a different root cause? [24986.500955] Lustre: DEBUG MARKER: == sanity-benchmark test iozone: iozone ============================================================== 10:05:17 (1573725917) [24989.359411] Lustre: DEBUG MARKER: /usr/sbin/lctl mark min OST has 1782612kB available, using 3438440kB file size [24989.610198] Lustre: DEBUG MARKER: min OST has 1782612kB available, using 3438440kB file size [25048.708540] LNetError: 18994:0:(peer.c:3724:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.5.216@tcp added to recovery queue. Health = 900 [25049.117039] LNetError: 18996:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.9.5.214@tcp added to recovery queue. Health = 900 [25049.119238] LustreError: 18996:0:(events.c:200:client_bulk_callback()) event type 1, status -5, desc ffff95ae7ad2bc00 [25049.127483] Lustre: 19001:0:(client.c:2219:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1573725975/real 1573725975] req@ffff95ae001c9f80 x1650168582431232/t0(0) o4->lustre-OST0005-osc-ffff95ae65a62000@10.9.5.216@tcp:6/4 lens 488/448 e 0 to 1 dl 1573726020 ref 2 fl Rpc:eXQr/0/ffffffff rc 0/-1 job:'iozone.500' [25049.132801] Lustre: lustre-OST0005-osc-ffff95ae65a62000: Connection to lustre-OST0005 (at 10.9.5.216@tcp) was lost; in progress operations using this service will wait for recovery to complete [25049.265069] Lustre: lustre-OST0001-osc-ffff95ae65a62000: Connection restored to 10.9.5.216@tcp (at 10.9.5.216@tcp) [25086.049686] Lustre: 19004:0:(client.c:2219:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1573725972/real 1573725973] req@ffff95ae2087db00 x1650168582429888/t0(0) o4->lustre-OST0007-osc-ffff95ae65a62000@10.9.5.216@tcp:6/4 lens 488/448 e 0 to 1 dl 1573726017 ref 2 fl Rpc:XQr/0/ffffffff rc 0/-1 job:'iozone.500' [25086.054856] Lustre: 19004:0:(client.c:2219:ptlrpc_expire_one_request()) Skipped 7 previous similar messages [25086.056499] Lustre: lustre-OST0007-osc-ffff95ae65a62000: Connection to lustre-OST0007 (at 10.9.5.216@tcp) was lost; in progress operations using this service will wait for recovery to complete [25086.059313] Lustre: Skipped 5 previous similar messages [25086.062664] Lustre: lustre-OST0007-osc-ffff95ae65a62000: Connection restored to 10.9.5.216@tcp (at 10.9.5.216@tcp) [25086.064514] Lustre: Skipped 5 previous similar messages [25189.704358] LNetError: 18994:0:(peer.c:3724:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.5.216@tcp added to recovery queue. Health = 900 [25190.148392] LNetError: 18996:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 10.9.5.214@tcp added to recovery queue. Health = 900 [25190.150543] LustreError: 18996:0:(events.c:200:client_bulk_callback()) event type 1, status -5, desc ffff95ae20af0c00 [25190.152380] LustreError: 18996:0:(events.c:200:client_bulk_callback()) event type 1, status -5, desc ffff95ae66b3c400 [25190.154242] LustreError: 18996:0:(events.c:200:client_bulk_callback()) event type 1, status -5, desc ffff95ae66b3c400 [25190.156036] LustreError: 18996:0:(events.c:200:client_bulk_callback()) event type 1, status -5, desc ffff95ae66b3c400 [25190.158312] Lustre: 19004:0:(client.c:2219:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1573726114/real 1573726115] req@ffff95ae4d98cd80 x1650168582704448/t0(0) o4->lustre-OST0000-osc-ffff95ae65a62000@10.9.5.216@tcp:6/4 lens 488/448 e 0 to 1 dl 1573726160 ref 2 fl Rpc:eXQr/0/ffffffff rc 0/-1 job:'iozone.500' [25190.163356] Lustre: 19004:0:(client.c:2219:ptlrpc_expire_one_request()) Skipped 5 previous similar messages [25190.165017] Lustre: lustre-OST0000-osc-ffff95ae65a62000: Connection to lustre-OST0000 (at 10.9.5.216@tcp) was lost; in progress operations using this service will wait for recovery to complete [25190.167837] Lustre: Skipped 1 previous similar message [25190.202183] LustreError: 18996:0:(events.c:200:client_bulk_callback()) event type 1, status -5, desc ffff95ae66b3c400 [25190.205003] LustreError: 18996:0:(events.c:200:client_bulk_callback()) event type 1, status -5, desc ffff95ae66b3c800 [25190.206859] LustreError: 18996:0:(events.c:200:client_bulk_callback()) event type 1, status -5, desc ffff95ae66b3c800 [25190.208647] LustreError: 18996:0:(events.c:200:client_bulk_callback()) event type 1, status -5, desc ffff95ae66b3d800 [25190.210440] LustreError: 18996:0:(events.c:200:client_bulk_callback()) event type 1, status -5, desc ffff95ae66b3c800 [25190.232707] LustreError: 18996:0:(events.c:200:client_bulk_callback()) event type 1, status -5, desc ffff95ae66b3d800 [25190.234622] LustreError: 18996:0:(events.c:200:client_bulk_callback()) event type 1, status -5, desc ffff95ae66b3d800 [25190.236420] LustreError: 18996:0:(events.c:200:client_bulk_callback()) event type 1, status -5, desc ffff95ae66b3c800 [25190.238549] LustreError: 18996:0:(events.c:200:client_bulk_callback()) event type 1, status -5, desc ffff95ae66b3d800 [25190.240339] LustreError: 18996:0:(events.c:200:client_bulk_callback()) event type 1, status -5, desc ffff95ae66b3e000 [25190.242117] LustreError: 18996:0:(events.c:200:client_bulk_callback()) event type 1, status -5, desc ffff95ae66b3e400 [25190.287268] LustreError: 18996:0:(events.c:200:client_bulk_callback()) event type 1, status -5, desc ffff95ae66b3e000 [25190.289123] LustreError: 18996:0:(events.c:200:client_bulk_callback()) event type 1, status -5, desc ffff95ae66b3e400 [25190.290920] LustreError: 18996:0:(events.c:200:client_bulk_callback()) event type 1, status -5, desc ffff95ae66b3e400 [25190.292837] LustreError: 18996:0:(events.c:200:client_bulk_callback()) event type 1, status -5, desc ffff95ae66b3e000 [25190.294970] LustreError: 18996:0:(events.c:200:client_bulk_callback()) event type 1, status -5, desc ffff95ae66b3e800 [25190.296757] LustreError: 18996:0:(events.c:200:client_bulk_callback()) event type 1, status -5, desc ffff95ae66b3e800 [25190.331303] LustreError: 18996:0:(events.c:200:client_bulk_callback()) event type 1, status -5, desc ffff95ae66b3e400 [25190.333147] LustreError: 18996:0:(events.c:200:client_bulk_callback()) event type 1, status -5, desc ffff95ae66b3e000 [25190.334915] LustreError: 18996:0:(events.c:200:client_bulk_callback()) event type 1, status -5, desc ffff95ae66b3e800 [25190.336707] LustreError: 18996:0:(events.c:200:client_bulk_callback()) event type 1, status -5, desc ffff95ae66b3f800 [25190.338642] LustreError: 18996:0:(events.c:200:client_bulk_callback()) event type 1, status -5, desc ffff95ae66b3f800 [25190.341149] LustreError: 18996:0:(events.c:200:client_bulk_callback()) event type 1, status -5, desc ffff95ae66b3e800 [25190.388268] LustreError: 18996:0:(events.c:200:client_bulk_callback()) event type 1, status -5, desc ffff95ae66b3f800 [25190.390387] LustreError: 18996:0:(events.c:200:client_bulk_callback()) event type 1, status -5, desc ffff95ae66b3f800 [25190.484817] Lustre: lustre-OST0000-osc-ffff95ae65a62000: Connection restored to 10.9.5.216@tcp (at 10.9.5.216@tcp) [25190.486621] Lustre: Skipped 1 previous similar message [25258.668206] tuned invoked oom-killer: gfp_mask=0x200da, order=0, oom_score_adj=0 [25258.669787] tuned cpuset=/ mems_allowed=0 [25258.670483] CPU: 1 PID: 1369 Comm: tuned Kdump: loaded Tainted: G OE ------------ 3.10.0-1062.1.1.el7.x86_64 #1 [25258.672347] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [25258.673340] Call Trace: [25258.673874] [<ffffffff8f7792c2>] dump_stack+0x19/0x1b [25258.674763] [<ffffffff8f773c64>] dump_header+0x90/0x229 [25258.675696] [<ffffffff8f787b6f>] ? notifier_call_chain+0x4f/0x70 [25258.676775] [<ffffffff8f0cb498>] ? __blocking_notifier_call_chain+0x58/0x70 [25258.677999] [<ffffffff8f1bff2e>] check_panic_on_oom+0x2e/0x60 [25258.679002] [<ffffffff8f1c034b>] out_of_memory+0x23b/0x4f0 [25258.679967] [<ffffffff8f77477c>] __alloc_pages_slowpath+0x5d6/0x724 [25258.681053] [<ffffffff8f1c6b84>] __alloc_pages_nodemask+0x404/0x420 [25258.682149] [<ffffffff8f218105>] alloc_pages_vma+0xb5/0x200 [25258.683122] [<ffffffff8f205ac5>] __read_swap_cache_async+0x115/0x190 [25258.684227] [<ffffffff8f205b66>] read_swap_cache_async+0x26/0x60 [25258.685285] [<ffffffff8f205d4b>] swapin_readahead+0x1ab/0x210 [25258.686306] [<ffffffff8f386722>] ? radix_tree_lookup_slot+0x22/0x50 [25258.687391] [<ffffffff8f1bb4ee>] ? __find_get_page+0x1e/0xa0 [25258.688406] [<ffffffff8f1efe36>] handle_pte_fault+0xd66/0xe20 [25258.689423] [<ffffffff8f25f4dd>] ? core_sys_select+0x26d/0x340 [25258.690435] [<ffffffff8f1f200d>] handle_mm_fault+0x39d/0x9b0 [25258.691422] [<ffffffff8f787653>] __do_page_fault+0x213/0x500 [25258.692425] [<ffffffff8f25e7f3>] ? poll_select_copy_remaining+0x113/0x180 [25258.693589] [<ffffffff8f787a26>] trace_do_page_fault+0x56/0x150 [25258.694607] [<ffffffff8f786fa2>] do_async_page_fault+0x22/0xf0 [25258.695641] [<ffffffff8f7837a8>] async_page_fault+0x28/0x30 [25258.696613] Mem-Info: [25258.697032] active_anon:17275 inactive_anon:17292 isolated_anon:0 active_file:9706 inactive_file:333050 isolated_file:32 unevictable:0 dirty:3641 writeback:42567 unstable:0 slab_reclaimable:4776 slab_unreclaimable:13653 mapped:6755 shmem:21419 pagetables:1312 bounce:0 free:12854 free_pcp:46 free_cma:0 [25258.702574] Node 0 DMA free:7048kB min:416kB low:520kB high:624kB active_anon:0kB inactive_anon:12kB active_file:204kB inactive_file:6872kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15992kB managed:15908kB mlocked:0kB dirty:144kB writeback:560kB mapped:172kB shmem:12kB slab_reclaimable:52kB slab_unreclaimable:348kB kernel_stack:32kB pagetables:56kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:10917 all_unreclaimable? yes [25258.709666] lowmem_reserve[]: 0 1660 1660 1660 [25258.710606] Node 0 DMA32 free:44120kB min:44636kB low:55792kB high:66952kB active_anon:69100kB inactive_anon:69156kB active_file:38620kB inactive_file:1325244kB unevictable:0kB isolated(anon):0kB isolated(file):256kB present:2080744kB managed:1703312kB mlocked:0kB dirty:14420kB writeback:169708kB mapped:26848kB shmem:85664kB slab_reclaimable:19052kB slab_unreclaimable:54516kB kernel_stack:2752kB pagetables:5192kB unstable:0kB bounce:0kB free_pcp:204kB local_pcp:128kB free_cma:0kB writeback_tmp:0kB pages_scanned:2791068 all_unreclaimable? yes [25258.718385] lowmem_reserve[]: 0 0 0 0 [25258.719197] Node 0 DMA: 12*4kB (UE) 19*8kB (UE) 14*16kB (UEM) 7*32kB (UE) 8*64kB (UEM) 2*128kB (UE) 10*256kB (UEM) 4*512kB (UEM) 1*1024kB (M) 0*2048kB 0*4096kB = 7048kB [25258.722455] Node 0 DMA32: 73*4kB (UEM) 1577*8kB (UEM) 955*16kB (UEM) 352*32kB (UEM) 71*64kB (UM) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 43996kB [25258.725528] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB [25258.727002] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB [25258.728414] 147236 total pagecache pages [25258.729119] 146 pages in swap cache [25258.729735] Swap cache stats: add 8528, delete 8382, find 310/572 [25258.730769] Free swap = 3638780kB [25258.731356] Total swap = 3671036kB [25258.731949] 524184 pages RAM [25258.732448] 0 pages HighMem/MovableOnly [25258.733109] 94379 pages reserved [25258.733688] [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name [25258.735073] [ 382] 0 382 9866 1342 24 77 0 systemd-journal [25258.736554] [ 402] 0 402 29160 239 28 78 0 lvmetad [25258.737928] [ 409] 0 409 11238 452 23 60 -1000 systemd-udevd [25258.739369] [ 534] 0 534 13882 157 27 110 -1000 auditd [25258.740717] [ 559] 0 559 5384 267 16 40 0 irqbalance [25258.742125] [ 560] 81 560 14563 456 34 165 -900 dbus-daemon [25258.743538] [ 561] 0 561 6595 376 18 66 0 systemd-logind [25258.745007] [ 563] 0 563 136973 1685 87 502 0 NetworkManager [25258.746479] [ 564] 999 564 153061 1172 63 2396 0 polkitd [25258.747848] [ 568] 0 568 48777 333 37 130 0 gssproxy [25258.749245] [ 575] 32 575 17319 236 38 141 0 rpcbind [25258.750613] [ 578] 38 578 11824 455 28 152 0 ntpd [25258.751965] [ 606] 0 606 25724 871 53 508 0 dhclient [25258.753363] [ 952] 0 952 143550 3953 101 895 0 tuned [25258.754701] [ 953] 0 953 74587 2165 72 141 0 rsyslogd [25258.756069] [ 959] 0 959 28230 852 58 242 -1000 sshd [25258.757380] [ 964] 997 964 56472 344 22 124 0 munged [25258.758737] [ 966] 29 966 10609 223 26 209 0 rpc.statd [25258.760131] [ 968] 0 968 6792 195 17 63 0 xinetd [25258.761481] [ 1024] 0 1024 172948 1594 81 0 0 automount [25258.762871] [ 1046] 0 1046 6477 235 19 0 0 atd [25258.764177] [ 1054] 0 1054 31573 396 19 0 0 crond [25258.765505] [ 1063] 0 1063 27527 215 12 0 0 agetty [25258.766852] [ 1065] 0 1065 27527 212 10 0 0 agetty [25258.768190] [ 1290] 0 1290 22425 532 43 0 0 master [25258.769539] [ 1310] 89 1310 22468 996 43 0 0 qmgr [25258.770874] [ 9706] 0 9706 39196 1381 80 0 0 sshd [25258.772191] [ 9708] 0 9708 28300 374 12 0 0 run_test.sh [25258.773620] [10014] 0 10014 29609 1225 17 2 0 bash [25258.774952] [18916] 0 18916 4844 164 14 1 0 rpc.idmapd [25258.776348] [18922] 0 18922 10942 703 27 0 0 rpc.mountd [25258.777819] [26115] 0 26115 29609 1011 14 0 0 bash [25258.779136] [26116] 0 26116 26991 165 8 0 0 tee [25258.780449] [26313] 0 26313 29580 1198 13 0 0 bash [25258.781773] [31794] 89 31794 22451 990 48 0 0 pickup [25258.783156] [32386] 0 32386 29613 1020 13 0 0 bash [25258.784483] [32387] 0 32387 26991 167 9 0 0 tee [25258.785821] [ 377] 500 377 12351 4815 22 543 0 iozone [25258.787194] [ 378] 0 378 26991 166 8 0 0 tee [25258.788484] Kernel panic - not syncing: Out of memory: system-wide panic_on_oom is enabled [25258.790111] CPU: 1 PID: 1369 Comm: tuned Kdump: loaded Tainted: G OE ------------ 3.10.0-1062.1.1.el7.x86_64 #1 [25258.791945] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [25258.792900] Call Trace: [25258.793333] [<ffffffff8f7792c2>] dump_stack+0x19/0x1b [25258.794195] [<ffffffff8f772941>] panic+0xe8/0x21f [25258.795006] [<ffffffff8f1bff55>] check_panic_on_oom+0x55/0x60 [25258.796021] [<ffffffff8f1c034b>] out_of_memory+0x23b/0x4f0 [25258.796957] [<ffffffff8f77477c>] __alloc_pages_slowpath+0x5d6/0x724 [25258.798015] [<ffffffff8f1c6b84>] __alloc_pages_nodemask+0x404/0x420 [25258.799073] [<ffffffff8f218105>] alloc_pages_vma+0xb5/0x200 [25258.800020] [<ffffffff8f205ac5>] __read_swap_cache_async+0x115/0x190 [25258.801088] [<ffffffff8f205b66>] read_swap_cache_async+0x26/0x60 [25258.802102] [<ffffffff8f205d4b>] swapin_readahead+0x1ab/0x210 [25258.803079] [<ffffffff8f386722>] ? radix_tree_lookup_slot+0x22/0x50 [25258.804137] [<ffffffff8f1bb4ee>] ? __find_get_page+0x1e/0xa0 [25258.805101] [<ffffffff8f1efe36>] handle_pte_fault+0xd66/0xe20 [25258.806080] [<ffffffff8f25f4dd>] ? core_sys_select+0x26d/0x340 [25258.807063] [<ffffffff8f1f200d>] handle_mm_fault+0x39d/0x9b0 [25258.808022] [<ffffffff8f787653>] __do_page_fault+0x213/0x500 [25258.808979] [<ffffffff8f25e7f3>] ? poll_select_copy_remaining+0x113/0x180 [25258.810122] [<ffffffff8f787a26>] trace_do_page_fault+0x56/0x150 [25258.811124] [<ffffffff8f786fa2>] do_async_page_fault+0x22/0xf0 [25258.812118] [<ffffffff8f7837a8>] async_page_fault+0x28/0x30

            Here's another sanity-benchmark test_iozone OOM crash, but there is no ext4 in the call stack; https://testing.whamcloud.com/test_sets/b90b44f2-036a-11ea-a9d7-52540065bddc .

            ============================================================== 20:10:07 \(1573243807\)
            [16314.428102] Lustre: DEBUG MARKER: == sanity-benchmark test iozone: iozone ============================================================== 20:10:07 (1573243807)
            [16326.606181] Lustre: DEBUG MARKER: /usr/sbin/lctl mark min OST has 1786000kB available, using 3438440kB file size
            [16326.840073] Lustre: DEBUG MARKER: min OST has 1786000kB available, using 3438440kB file size
            [16366.210817] irqbalance invoked oom-killer: gfp_mask=0x280da, order=0, oom_score_adj=0
            [16366.211778] irqbalance cpuset=/ mems_allowed=0
            [16366.212273] CPU: 1 PID: 563 Comm: irqbalance Kdump: loaded Tainted: G           OE  ------------   3.10.0-1062.1.1.el7.x86_64 #1
            [16366.213410] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
            [16366.213972] Call Trace:
            [16366.214280]  [<ffffffff867792c2>] dump_stack+0x19/0x1b
            [16366.214811]  [<ffffffff86773c64>] dump_header+0x90/0x229
            [16366.215346]  [<ffffffff86787b6f>] ? notifier_call_chain+0x4f/0x70
            [16366.215979]  [<ffffffff860cb498>] ? __blocking_notifier_call_chain+0x58/0x70
            [16366.216697]  [<ffffffff861bff2e>] check_panic_on_oom+0x2e/0x60
            [16366.217275]  [<ffffffff861c034b>] out_of_memory+0x23b/0x4f0
            [16366.217824]  [<ffffffff8677477c>] __alloc_pages_slowpath+0x5d6/0x724
            [16366.218462]  [<ffffffff861c6b84>] __alloc_pages_nodemask+0x404/0x420
            [16366.219079]  [<ffffffff86218105>] alloc_pages_vma+0xb5/0x200
            [16366.219657]  [<ffffffff861efa54>] handle_pte_fault+0x984/0xe20
            [16366.220238]  [<ffffffff862237cd>] ? kmem_cache_alloc_trace+0x17d/0x200
            [16366.220904]  [<ffffffff8626f456>] ? seq_open+0x106/0x170
            [16366.221438]  [<ffffffff861f200d>] handle_mm_fault+0x39d/0x9b0
            [16366.222005]  [<ffffffff86787653>] __do_page_fault+0x213/0x500
            [16366.222605]  [<ffffffff8638c984>] ? vsnprintf+0x234/0x6a0
            [16366.223144]  [<ffffffff86787a26>] trace_do_page_fault+0x56/0x150
            [16366.223756]  [<ffffffff86786fa2>] do_async_page_fault+0x22/0xf0
            [16366.224334]  [<ffffffff867837a8>] async_page_fault+0x28/0x30
            [16366.224898]  [<ffffffff8638e240>] ? copy_user_generic_string+0x30/0x40
            [16366.225543]  [<ffffffff862707dc>] ? seq_read+0x31c/0x440
            [16366.226069]  [<ffffffff862c09b0>] proc_reg_read+0x40/0x80
            [16366.226642]  [<ffffffff862484ff>] vfs_read+0x9f/0x170
            [16366.227130]  [<ffffffff862493bf>] SyS_read+0x7f/0xf0
            [16366.227634]  [<ffffffff8678ce21>] ? system_call_after_swapgs+0xae/0x146
            [16366.228282]  [<ffffffff8678cede>] system_call_fastpath+0x25/0x2a
            [16366.228875]  [<ffffffff8678ce21>] ? system_call_after_swapgs+0xae/0x146
            [16366.229534] Mem-Info:
            [16366.229780] active_anon:18502 inactive_anon:18729 isolated_anon:0
             active_file:11174 inactive_file:330695 isolated_file:32
             unevictable:0 dirty:3314 writeback:50958 unstable:0
             slab_reclaimable:4453 slab_unreclaimable:19547
             mapped:6477 shmem:22606 pagetables:1309 bounce:0
             free:6303 free_pcp:51 free_cma:0
            [16366.233041] Node 0 DMA free:7056kB min:416kB low:520kB high:624kB active_anon:128kB inactive_anon:924kB active_file:584kB inactive_file:5440kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15992kB managed:15908kB mlocked:0kB dirty:480kB writeback:3276kB mapped:332kB shmem:940kB slab_reclaimable:68kB slab_unreclaimable:412kB kernel_stack:16kB pagetables:88kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:10635 all_unreclaimable? yes
            [16366.237187] lowmem_reserve[]: 0 1660 1660 1660
            [16366.237765] Node 0 DMA32 free:18032kB min:44636kB low:55792kB high:66952kB active_anon:73880kB inactive_anon:73992kB active_file:44112kB inactive_file:1317256kB unevictable:0kB isolated(anon):0kB isolated(file):256kB present:2080744kB managed:1703312kB mlocked:0kB dirty:12776kB writeback:200556kB mapped:25576kB shmem:89484kB slab_reclaimable:17744kB slab_unreclaimable:77944kB kernel_stack:2768kB pagetables:5148kB unstable:0kB bounce:0kB free_pcp:196kB local_pcp:84kB free_cma:0kB writeback_tmp:0kB pages_scanned:6901848 all_unreclaimable? yes
            [16366.242350] lowmem_reserve[]: 0 0 0 0
            [16366.242830] Node 0 DMA: 8*4kB (UEM) 10*8kB (UEM) 6*16kB (UEM) 4*32kB (UEM) 5*64kB (U) 4*128kB (UM) 3*256kB (UEM) 2*512kB (EM) 2*1024kB (UE) 1*2048kB (U) 0*4096kB = 7056kB
            [16366.244795] Node 0 DMA32: 79*4kB (EM) 107*8kB (EM) 14*16kB (E) 158*32kB (UEM) 129*64kB (UEM) 15*128kB (U) 1*256kB (M) 0*512kB 1*1024kB (M) 0*2048kB 0*4096kB = 17908kB
            [16366.246696] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
            [16366.247561] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
            [16366.248393] 152591 total pagecache pages
            [16366.248796] 26 pages in swap cache
            [16366.249140] Swap cache stats: add 4377, delete 4351, find 71/112
            [16366.249732] Free swap  = 3653628kB
            [16366.250087] Total swap = 3671036kB
            [16366.250440] 524184 pages RAM
            [16366.250735] 0 pages HighMem/MovableOnly
            [16366.251118] 94379 pages reserved
            [16366.251470] [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
            [16366.252265] [  383]     0   383     9867     1116      23       45             0 systemd-journal
            [16366.253127] [  412]     0   412    29160      239      26       78             0 lvmetad
            [16366.253927] [  417]     0   417    11244      475      24       42         -1000 systemd-udevd
            [16366.254791] [  533]     0   533    13882      186      28       80         -1000 auditd
            [16366.255597] [  561]   999   561   153061     2255      62     1287             0 polkitd
            [16366.256373] [  562]     0   562     6595      409      18       33             0 systemd-logind
            [16366.257230] [  563]     0   563     5385      268      17       39             0 irqbalance
            [16366.258052] [  564]    32   564    17319      277      38      100             0 rpcbind
            [16366.258849] [  567]    81   567    14566      536      32       83          -900 dbus-daemon
            [16366.259695] [  570]     0   570    48777      416      37       46             0 gssproxy
            [16366.260495] [  578]     0   578   136972     2448      84      248             0 NetworkManager
            [16366.261333] [  585]    38   585    11824      463      27      144             0 ntpd
            [16366.262093] [  611]     0   611    25724      910      55      470             0 dhclient
            [16366.262923] [  952]     0   952   143550     4027      97      312             0 tuned
            [16366.263713] [  953]     0   953    28230      851      59      243         -1000 sshd
            [16366.264500] [  957]     0   957    74586     1590      76        0             0 rsyslogd
            [16366.265302] [  962]     0   962     6792      205      18       53             0 xinetd
            [16366.266091] [  966]    29   966    10609      309      25      123             0 rpc.statd
            [16366.266908] [  967]   997   967    56472      458      22       75             0 munged
            [16366.267700] [ 1026]     0  1026   172947     1592      81        0             0 automount
            [16366.268528] [ 1031]     0  1031     6477      234      18        0             0 atd
            [16366.269282] [ 1033]     0  1033    31572      394      21        0             0 crond
            [16366.270068] [ 1042]     0  1042    27527      212      10        0             0 agetty
            [16366.270860] [ 1044]     0  1044    27527      216      11        0             0 agetty
            [16366.271651] [ 1330]     0  1330    22425      532      42        0             0 master
            [16366.272433] [ 1347]    89  1347    22468      995      45        0             0 qmgr
            [16366.273170] [ 9701]     0  9701    39196     1351      80       30             0 sshd
            [16366.273942] [ 9703]     0  9703    28300      373      12        0             0 run_test.sh
            [16366.274788] [ 9973]     0  9973    29615     1181      16       40             0 bash
            [16366.275573] [23461]    89 23461    22451      970      45       20             0 pickup
            [16366.276357] [30809]     0 30809     4844      163      13        0             0 rpc.idmapd
            [16366.277161] [30810]     0 30810    10942      703      25        0             0 rpc.mountd
            [16366.277995] [18940]     0 18940    29615      972      14       33             0 bash
            [16366.278775] [18941]     0 18941    26990      166       9        0             0 tee
            [16366.279554] [19128]     0 19128    29577     1194      15        0             0 bash
            [16366.280336] [24468]     0 24468    29610     1016      14        0             0 bash
            [16366.281122] [24469]     0 24469    26991      167       9        0             0 tee
            [16366.281904] [25190]   500 25190    12351     4336      23      512             0 iozone
            [16366.282700] [25191]     0 25191    26990      166       8        0             0 tee
            [16366.283455] Kernel panic - not syncing: Out of memory: system-wide panic_on_oom is enabled
            
            [16366.284384] CPU: 1 PID: 563 Comm: irqbalance Kdump: loaded Tainted: G           OE  ------------   3.10.0-1062.1.1.el7.x86_64 #1
            [16366.285507] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
            [16366.286057] Call Trace:
            [16366.286317]  [<ffffffff867792c2>] dump_stack+0x19/0x1b
            [16366.286825]  [<ffffffff86772941>] panic+0xe8/0x21f
            [16366.287315]  [<ffffffff861bff55>] check_panic_on_oom+0x55/0x60
            [16366.287881]  [<ffffffff861c034b>] out_of_memory+0x23b/0x4f0
            [16366.288435]  [<ffffffff8677477c>] __alloc_pages_slowpath+0x5d6/0x724
            [16366.289051]  [<ffffffff861c6b84>] __alloc_pages_nodemask+0x404/0x420
            [16366.289680]  [<ffffffff86218105>] alloc_pages_vma+0xb5/0x200
            [16366.290229]  [<ffffffff861efa54>] handle_pte_fault+0x984/0xe20
            [16366.290804]  [<ffffffff862237cd>] ? kmem_cache_alloc_trace+0x17d/0x200
            [16366.291449]  [<ffffffff8626f456>] ? seq_open+0x106/0x170
            [16366.291959]  [<ffffffff861f200d>] handle_mm_fault+0x39d/0x9b0
            [16366.292531]  [<ffffffff86787653>] __do_page_fault+0x213/0x500
            [16366.293090]  [<ffffffff8638c984>] ? vsnprintf+0x234/0x6a0
            [16366.293622]  [<ffffffff86787a26>] trace_do_page_fault+0x56/0x150
            [16366.294209]  [<ffffffff86786fa2>] do_async_page_fault+0x22/0xf0
            [16366.294793]  [<ffffffff867837a8>] async_page_fault+0x28/0x30
            [16366.295348]  [<ffffffff8638e240>] ? copy_user_generic_string+0x30/0x40
            [16366.295991]  [<ffffffff862707dc>] ? seq_read+0x31c/0x440
            [16366.296525]  [<ffffffff862c09b0>] proc_reg_read+0x40/0x80
            [16366.297050]  [<ffffffff862484ff>] vfs_read+0x9f/0x170
            [16366.297548]  [<ffffffff862493bf>] SyS_read+0x7f/0xf0
            [16366.298034]  [<ffffffff8678ce21>] ? system_call_after_swapgs+0xae/0x146
            [16366.298692]  [<ffffffff8678cede>] system_call_fastpath+0x25/0x2a
            [16366.299289]  [<ffffffff8678ce21>] ? system_call_after_swapgs+0xae/0x146
            
            jamesanunez James Nunez (Inactive) added a comment - Here's another sanity-benchmark test_iozone OOM crash, but there is no ext4 in the call stack; https://testing.whamcloud.com/test_sets/b90b44f2-036a-11ea-a9d7-52540065bddc . ============================================================== 20:10:07 \(1573243807\) [16314.428102] Lustre: DEBUG MARKER: == sanity-benchmark test iozone: iozone ============================================================== 20:10:07 (1573243807) [16326.606181] Lustre: DEBUG MARKER: /usr/sbin/lctl mark min OST has 1786000kB available, using 3438440kB file size [16326.840073] Lustre: DEBUG MARKER: min OST has 1786000kB available, using 3438440kB file size [16366.210817] irqbalance invoked oom-killer: gfp_mask=0x280da, order=0, oom_score_adj=0 [16366.211778] irqbalance cpuset=/ mems_allowed=0 [16366.212273] CPU: 1 PID: 563 Comm: irqbalance Kdump: loaded Tainted: G OE ------------ 3.10.0-1062.1.1.el7.x86_64 #1 [16366.213410] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [16366.213972] Call Trace: [16366.214280] [<ffffffff867792c2>] dump_stack+0x19/0x1b [16366.214811] [<ffffffff86773c64>] dump_header+0x90/0x229 [16366.215346] [<ffffffff86787b6f>] ? notifier_call_chain+0x4f/0x70 [16366.215979] [<ffffffff860cb498>] ? __blocking_notifier_call_chain+0x58/0x70 [16366.216697] [<ffffffff861bff2e>] check_panic_on_oom+0x2e/0x60 [16366.217275] [<ffffffff861c034b>] out_of_memory+0x23b/0x4f0 [16366.217824] [<ffffffff8677477c>] __alloc_pages_slowpath+0x5d6/0x724 [16366.218462] [<ffffffff861c6b84>] __alloc_pages_nodemask+0x404/0x420 [16366.219079] [<ffffffff86218105>] alloc_pages_vma+0xb5/0x200 [16366.219657] [<ffffffff861efa54>] handle_pte_fault+0x984/0xe20 [16366.220238] [<ffffffff862237cd>] ? kmem_cache_alloc_trace+0x17d/0x200 [16366.220904] [<ffffffff8626f456>] ? seq_open+0x106/0x170 [16366.221438] [<ffffffff861f200d>] handle_mm_fault+0x39d/0x9b0 [16366.222005] [<ffffffff86787653>] __do_page_fault+0x213/0x500 [16366.222605] [<ffffffff8638c984>] ? vsnprintf+0x234/0x6a0 [16366.223144] [<ffffffff86787a26>] trace_do_page_fault+0x56/0x150 [16366.223756] [<ffffffff86786fa2>] do_async_page_fault+0x22/0xf0 [16366.224334] [<ffffffff867837a8>] async_page_fault+0x28/0x30 [16366.224898] [<ffffffff8638e240>] ? copy_user_generic_string+0x30/0x40 [16366.225543] [<ffffffff862707dc>] ? seq_read+0x31c/0x440 [16366.226069] [<ffffffff862c09b0>] proc_reg_read+0x40/0x80 [16366.226642] [<ffffffff862484ff>] vfs_read+0x9f/0x170 [16366.227130] [<ffffffff862493bf>] SyS_read+0x7f/0xf0 [16366.227634] [<ffffffff8678ce21>] ? system_call_after_swapgs+0xae/0x146 [16366.228282] [<ffffffff8678cede>] system_call_fastpath+0x25/0x2a [16366.228875] [<ffffffff8678ce21>] ? system_call_after_swapgs+0xae/0x146 [16366.229534] Mem-Info: [16366.229780] active_anon:18502 inactive_anon:18729 isolated_anon:0 active_file:11174 inactive_file:330695 isolated_file:32 unevictable:0 dirty:3314 writeback:50958 unstable:0 slab_reclaimable:4453 slab_unreclaimable:19547 mapped:6477 shmem:22606 pagetables:1309 bounce:0 free:6303 free_pcp:51 free_cma:0 [16366.233041] Node 0 DMA free:7056kB min:416kB low:520kB high:624kB active_anon:128kB inactive_anon:924kB active_file:584kB inactive_file:5440kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15992kB managed:15908kB mlocked:0kB dirty:480kB writeback:3276kB mapped:332kB shmem:940kB slab_reclaimable:68kB slab_unreclaimable:412kB kernel_stack:16kB pagetables:88kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:10635 all_unreclaimable? yes [16366.237187] lowmem_reserve[]: 0 1660 1660 1660 [16366.237765] Node 0 DMA32 free:18032kB min:44636kB low:55792kB high:66952kB active_anon:73880kB inactive_anon:73992kB active_file:44112kB inactive_file:1317256kB unevictable:0kB isolated(anon):0kB isolated(file):256kB present:2080744kB managed:1703312kB mlocked:0kB dirty:12776kB writeback:200556kB mapped:25576kB shmem:89484kB slab_reclaimable:17744kB slab_unreclaimable:77944kB kernel_stack:2768kB pagetables:5148kB unstable:0kB bounce:0kB free_pcp:196kB local_pcp:84kB free_cma:0kB writeback_tmp:0kB pages_scanned:6901848 all_unreclaimable? yes [16366.242350] lowmem_reserve[]: 0 0 0 0 [16366.242830] Node 0 DMA: 8*4kB (UEM) 10*8kB (UEM) 6*16kB (UEM) 4*32kB (UEM) 5*64kB (U) 4*128kB (UM) 3*256kB (UEM) 2*512kB (EM) 2*1024kB (UE) 1*2048kB (U) 0*4096kB = 7056kB [16366.244795] Node 0 DMA32: 79*4kB (EM) 107*8kB (EM) 14*16kB (E) 158*32kB (UEM) 129*64kB (UEM) 15*128kB (U) 1*256kB (M) 0*512kB 1*1024kB (M) 0*2048kB 0*4096kB = 17908kB [16366.246696] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB [16366.247561] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB [16366.248393] 152591 total pagecache pages [16366.248796] 26 pages in swap cache [16366.249140] Swap cache stats: add 4377, delete 4351, find 71/112 [16366.249732] Free swap = 3653628kB [16366.250087] Total swap = 3671036kB [16366.250440] 524184 pages RAM [16366.250735] 0 pages HighMem/MovableOnly [16366.251118] 94379 pages reserved [16366.251470] [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name [16366.252265] [ 383] 0 383 9867 1116 23 45 0 systemd-journal [16366.253127] [ 412] 0 412 29160 239 26 78 0 lvmetad [16366.253927] [ 417] 0 417 11244 475 24 42 -1000 systemd-udevd [16366.254791] [ 533] 0 533 13882 186 28 80 -1000 auditd [16366.255597] [ 561] 999 561 153061 2255 62 1287 0 polkitd [16366.256373] [ 562] 0 562 6595 409 18 33 0 systemd-logind [16366.257230] [ 563] 0 563 5385 268 17 39 0 irqbalance [16366.258052] [ 564] 32 564 17319 277 38 100 0 rpcbind [16366.258849] [ 567] 81 567 14566 536 32 83 -900 dbus-daemon [16366.259695] [ 570] 0 570 48777 416 37 46 0 gssproxy [16366.260495] [ 578] 0 578 136972 2448 84 248 0 NetworkManager [16366.261333] [ 585] 38 585 11824 463 27 144 0 ntpd [16366.262093] [ 611] 0 611 25724 910 55 470 0 dhclient [16366.262923] [ 952] 0 952 143550 4027 97 312 0 tuned [16366.263713] [ 953] 0 953 28230 851 59 243 -1000 sshd [16366.264500] [ 957] 0 957 74586 1590 76 0 0 rsyslogd [16366.265302] [ 962] 0 962 6792 205 18 53 0 xinetd [16366.266091] [ 966] 29 966 10609 309 25 123 0 rpc.statd [16366.266908] [ 967] 997 967 56472 458 22 75 0 munged [16366.267700] [ 1026] 0 1026 172947 1592 81 0 0 automount [16366.268528] [ 1031] 0 1031 6477 234 18 0 0 atd [16366.269282] [ 1033] 0 1033 31572 394 21 0 0 crond [16366.270068] [ 1042] 0 1042 27527 212 10 0 0 agetty [16366.270860] [ 1044] 0 1044 27527 216 11 0 0 agetty [16366.271651] [ 1330] 0 1330 22425 532 42 0 0 master [16366.272433] [ 1347] 89 1347 22468 995 45 0 0 qmgr [16366.273170] [ 9701] 0 9701 39196 1351 80 30 0 sshd [16366.273942] [ 9703] 0 9703 28300 373 12 0 0 run_test.sh [16366.274788] [ 9973] 0 9973 29615 1181 16 40 0 bash [16366.275573] [23461] 89 23461 22451 970 45 20 0 pickup [16366.276357] [30809] 0 30809 4844 163 13 0 0 rpc.idmapd [16366.277161] [30810] 0 30810 10942 703 25 0 0 rpc.mountd [16366.277995] [18940] 0 18940 29615 972 14 33 0 bash [16366.278775] [18941] 0 18941 26990 166 9 0 0 tee [16366.279554] [19128] 0 19128 29577 1194 15 0 0 bash [16366.280336] [24468] 0 24468 29610 1016 14 0 0 bash [16366.281122] [24469] 0 24469 26991 167 9 0 0 tee [16366.281904] [25190] 500 25190 12351 4336 23 512 0 iozone [16366.282700] [25191] 0 25191 26990 166 8 0 0 tee [16366.283455] Kernel panic - not syncing: Out of memory: system-wide panic_on_oom is enabled [16366.284384] CPU: 1 PID: 563 Comm: irqbalance Kdump: loaded Tainted: G OE ------------ 3.10.0-1062.1.1.el7.x86_64 #1 [16366.285507] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [16366.286057] Call Trace: [16366.286317] [<ffffffff867792c2>] dump_stack+0x19/0x1b [16366.286825] [<ffffffff86772941>] panic+0xe8/0x21f [16366.287315] [<ffffffff861bff55>] check_panic_on_oom+0x55/0x60 [16366.287881] [<ffffffff861c034b>] out_of_memory+0x23b/0x4f0 [16366.288435] [<ffffffff8677477c>] __alloc_pages_slowpath+0x5d6/0x724 [16366.289051] [<ffffffff861c6b84>] __alloc_pages_nodemask+0x404/0x420 [16366.289680] [<ffffffff86218105>] alloc_pages_vma+0xb5/0x200 [16366.290229] [<ffffffff861efa54>] handle_pte_fault+0x984/0xe20 [16366.290804] [<ffffffff862237cd>] ? kmem_cache_alloc_trace+0x17d/0x200 [16366.291449] [<ffffffff8626f456>] ? seq_open+0x106/0x170 [16366.291959] [<ffffffff861f200d>] handle_mm_fault+0x39d/0x9b0 [16366.292531] [<ffffffff86787653>] __do_page_fault+0x213/0x500 [16366.293090] [<ffffffff8638c984>] ? vsnprintf+0x234/0x6a0 [16366.293622] [<ffffffff86787a26>] trace_do_page_fault+0x56/0x150 [16366.294209] [<ffffffff86786fa2>] do_async_page_fault+0x22/0xf0 [16366.294793] [<ffffffff867837a8>] async_page_fault+0x28/0x30 [16366.295348] [<ffffffff8638e240>] ? copy_user_generic_string+0x30/0x40 [16366.295991] [<ffffffff862707dc>] ? seq_read+0x31c/0x440 [16366.296525] [<ffffffff862c09b0>] proc_reg_read+0x40/0x80 [16366.297050] [<ffffffff862484ff>] vfs_read+0x9f/0x170 [16366.297548] [<ffffffff862493bf>] SyS_read+0x7f/0xf0 [16366.298034] [<ffffffff8678ce21>] ? system_call_after_swapgs+0xae/0x146 [16366.298692] [<ffffffff8678cede>] system_call_fastpath+0x25/0x2a [16366.299289] [<ffffffff8678ce21>] ? system_call_after_swapgs+0xae/0x146

            Bobi Jam (bobijam@hotmail.com) uploaded a new patch: https://review.whamcloud.com/36513
            Subject: LU-12864 test: enable unstable check accounting in test
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 2247e29ab79410ba1f3c3a25f634bb55f98c8822

            gerrit Gerrit Updater added a comment - Bobi Jam (bobijam@hotmail.com) uploaded a new patch: https://review.whamcloud.com/36513 Subject: LU-12864 test: enable unstable check accounting in test Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 2247e29ab79410ba1f3c3a25f634bb55f98c8822

            It looks like these stacks are processes faulting in pages for the executable, rather than doing writes to the ext4 filesystem. I guess this makes sense that the stacks would always be in the ext4 code because that is where the binaries are located. That means my idea that ext4 is the cause of the memory shortage is unlikely. Instead, it seems more likely that the large write (3GB, which is larger than the 2GB of RAM in the VM) is the cause of the problem. This is likely because the client has pages pinned for IO but is not throttling the writing process.

            At one time Jinshan implemented the "NFS unstable page" handling in order to better balance memory pressure on the client and server, but this was later disabled:

                    /* turn unstable check off by default as it impacts performance */
                    cache->ccc_unstable_check = 0;
                    atomic_long_set(&cache->ccc_unstable_nr, 0);
            

            If we have a good reproducer for this issue we could try enabling this funcitonality again to see if that solves the problem.

            It may also be totally unrelated, I can't say for sure. In any case, failing order=0 allocations is pretty serious, as that means the client is totally out of memory.

            Bobijam, could you please take a look.

            adilger Andreas Dilger added a comment - It looks like these stacks are processes faulting in pages for the executable, rather than doing writes to the ext4 filesystem. I guess this makes sense that the stacks would always be in the ext4 code because that is where the binaries are located. That means my idea that ext4 is the cause of the memory shortage is unlikely. Instead, it seems more likely that the large write (3GB, which is larger than the 2GB of RAM in the VM) is the cause of the problem. This is likely because the client has pages pinned for IO but is not throttling the writing process. At one time Jinshan implemented the "NFS unstable page" handling in order to better balance memory pressure on the client and server, but this was later disabled: /* turn unstable check off by default as it impacts performance */ cache->ccc_unstable_check = 0; atomic_long_set(&cache->ccc_unstable_nr, 0); If we have a good reproducer for this issue we could try enabling this funcitonality again to see if that solves the problem. It may also be totally unrelated, I can't say for sure. In any case, failing order=0 allocations is pretty serious, as that means the client is totally out of memory. Bobijam, could you please take a look.

            I've found two cases where test_bonnie crashes with ext4_filemap_fault in the stack; https://testing.whamcloud.com/test_sets/4155636c-d54c-11e9-98c8-52540065bddc and https://testing.whamcloud.com/test_sets/d989415a-cc35-11e9-a25b-52540065bddc .

            The call trace is

            [16452.592500] Lustre: DEBUG MARKER: /usr/sbin/lctl mark min OST has 1782784kB available, using 3438736kB file size
            [16452.843542] Lustre: DEBUG MARKER: min OST has 1782784kB available, using 3438736kB file size
            [17330.131283] tuned invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
            [17330.132184] tuned cpuset=/ mems_allowed=0
            [17330.132645] CPU: 0 PID: 1886 Comm: tuned Kdump: loaded Tainted: P           OE  ------------   3.10.0-957.21.3.el7_lustre.x86_64 #1
            [17330.133780] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
            [17330.134342] Call Trace:
            [17330.134640]  [<ffffffff9a963107>] dump_stack+0x19/0x1b
            [17330.135145]  [<ffffffff9a95db2a>] dump_header+0x90/0x229
            [17330.135682]  [<ffffffff9a970b1f>] ? notifier_call_chain+0x4f/0x70
            [17330.136281]  [<ffffffff9a2c80e8>] ? __blocking_notifier_call_chain+0x58/0x70
            [17330.137086]  [<ffffffff9a3ba9de>] check_panic_on_oom+0x2e/0x60
            [17330.137674]  [<ffffffff9a3badfb>] out_of_memory+0x23b/0x4f0
            [17330.138293]  [<ffffffff9a95e62e>] __alloc_pages_slowpath+0x5d6/0x724
            [17330.139061]  [<ffffffff9a3c1454>] __alloc_pages_nodemask+0x404/0x420
            [17330.139895]  [<ffffffff9a40e308>] alloc_pages_current+0x98/0x110
            [17330.140840]  [<ffffffff9a3b6697>] __page_cache_alloc+0x97/0xb0
            [17330.141624]  [<ffffffff9a3b92f8>] filemap_fault+0x298/0x490
            [17330.142319]  [<ffffffffc04691c6>] ext4_filemap_fault+0x36/0x50 [ext4]
            [17330.143027]  [<ffffffff9a3e47ea>] __do_fault.isra.59+0x8a/0x100
            [17330.143618]  [<ffffffff9a3e4d9c>] do_read_fault.isra.61+0x4c/0x1b0
            [17330.144551]  [<ffffffff9a3e9744>] handle_pte_fault+0x2f4/0xd10
            [17330.145335]  [<ffffffff9a4584bd>] ? core_sys_select+0x26d/0x340
            [17330.145985]  [<ffffffff9a3ec27d>] handle_mm_fault+0x39d/0x9b0
            [17330.146929]  [<ffffffff9a970603>] __do_page_fault+0x203/0x4f0
            [17330.147603]  [<ffffffff9a4577d3>] ? poll_select_copy_remaining+0x113/0x180
            [17330.148384]  [<ffffffff9a9709d6>] trace_do_page_fault+0x56/0x150
            [17330.149399]  [<ffffffff9a96ff62>] do_async_page_fault+0x22/0xf0
            [17330.150217]  [<ffffffff9a96c798>] async_page_fault+0x28/0x30
            
            jamesanunez James Nunez (Inactive) added a comment - I've found two cases where test_bonnie crashes with ext4_filemap_fault in the stack; https://testing.whamcloud.com/test_sets/4155636c-d54c-11e9-98c8-52540065bddc and https://testing.whamcloud.com/test_sets/d989415a-cc35-11e9-a25b-52540065bddc . The call trace is [16452.592500] Lustre: DEBUG MARKER: /usr/sbin/lctl mark min OST has 1782784kB available, using 3438736kB file size [16452.843542] Lustre: DEBUG MARKER: min OST has 1782784kB available, using 3438736kB file size [17330.131283] tuned invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0 [17330.132184] tuned cpuset=/ mems_allowed=0 [17330.132645] CPU: 0 PID: 1886 Comm: tuned Kdump: loaded Tainted: P OE ------------ 3.10.0-957.21.3.el7_lustre.x86_64 #1 [17330.133780] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [17330.134342] Call Trace: [17330.134640] [<ffffffff9a963107>] dump_stack+0x19/0x1b [17330.135145] [<ffffffff9a95db2a>] dump_header+0x90/0x229 [17330.135682] [<ffffffff9a970b1f>] ? notifier_call_chain+0x4f/0x70 [17330.136281] [<ffffffff9a2c80e8>] ? __blocking_notifier_call_chain+0x58/0x70 [17330.137086] [<ffffffff9a3ba9de>] check_panic_on_oom+0x2e/0x60 [17330.137674] [<ffffffff9a3badfb>] out_of_memory+0x23b/0x4f0 [17330.138293] [<ffffffff9a95e62e>] __alloc_pages_slowpath+0x5d6/0x724 [17330.139061] [<ffffffff9a3c1454>] __alloc_pages_nodemask+0x404/0x420 [17330.139895] [<ffffffff9a40e308>] alloc_pages_current+0x98/0x110 [17330.140840] [<ffffffff9a3b6697>] __page_cache_alloc+0x97/0xb0 [17330.141624] [<ffffffff9a3b92f8>] filemap_fault+0x298/0x490 [17330.142319] [<ffffffffc04691c6>] ext4_filemap_fault+0x36/0x50 [ext4] [17330.143027] [<ffffffff9a3e47ea>] __do_fault.isra.59+0x8a/0x100 [17330.143618] [<ffffffff9a3e4d9c>] do_read_fault.isra.61+0x4c/0x1b0 [17330.144551] [<ffffffff9a3e9744>] handle_pte_fault+0x2f4/0xd10 [17330.145335] [<ffffffff9a4584bd>] ? core_sys_select+0x26d/0x340 [17330.145985] [<ffffffff9a3ec27d>] handle_mm_fault+0x39d/0x9b0 [17330.146929] [<ffffffff9a970603>] __do_page_fault+0x203/0x4f0 [17330.147603] [<ffffffff9a4577d3>] ? poll_select_copy_remaining+0x113/0x180 [17330.148384] [<ffffffff9a9709d6>] trace_do_page_fault+0x56/0x150 [17330.149399] [<ffffffff9a96ff62>] do_async_page_fault+0x22/0xf0 [17330.150217] [<ffffffff9a96c798>] async_page_fault+0x28/0x30

            Strange, I thought I commented on this ticket when it was first filed...

            It is strange that all of the OOM cases are showing ext4 in the stack? I'd think that if there is too much Lustre memory usage that the OOM would kill iozone. It is entirely possible that there is a CLIO page cleanup issue under memory pressure, I've wondered this for many years, but the symptoms seen here don't necessarily point in that direction.

            adilger Andreas Dilger added a comment - Strange, I thought I commented on this ticket when it was first filed... It is strange that all of the OOM cases are showing ext4 in the stack? I'd think that if there is too much Lustre memory usage that the OOM would kill iozone. It is entirely possible that there is a CLIO page cleanup issue under memory pressure, I've wondered this for many years, but the symptoms seen here don't necessarily point in that direction.

            People

              bobijam Zhenyu Xu
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

                Created:
                Updated: