[LU-11878] sanity test 103b: OOM because of too many bash processes: page allocation stalls for 18420ms Created: 22/Jan/19  Updated: 20/Jan/22  Resolved: 06/Feb/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.0, Lustre 2.13.0
Fix Version/s: Lustre 2.13.0, Lustre 2.12.1

Type: Bug Priority: Minor
Reporter: Maloo Assignee: Andreas Dilger
Resolution: Fixed Votes: 0
Labels: arm

Issue Links:
Duplicate
is duplicated by LU-12767 review-ldiskfs-arm crashed during san... Resolved
Gantt End to Start
Related
is related to LU-11200 Centos 8 arm64 server support Resolved
is related to LU-9864 sanity: test_103b failed at ldlm_reso... Open
is related to LU-10073 lnet-selftest test_smoke: lst Error f... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for Andreas Dilger <adilger@whamcloud.com>

This issue relates to the following test suite run:
https://testing.whamcloud.com/test_sets/adfb4bd4-1978-11e9-8388-52540065bddc

The test_103b code is running 512 parallel bash processes to verify that different values umask are working properly. On the x86 clients there is either not as much kernel debugging enabled, or the smaller pages (== smaller stack) doesn't cause as much grief. On ARM the client crashes because of slow allocation and OOM with the following stack trace:

[ 5945.554571] bash: page allocation stalls for 18420ms, order:0, mode:0x14200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null)
[ 5945.562347] bash cpuset=/ mems_allowed=0
[ 5945.564625] CPU: 1 PID: 20442 Comm: bash Kdump: loaded Tainted: G           OE  ------------   4.14.0-115.2.2.el7a.aarch64 #1
[ 5945.578547] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
[ 5945.586497] Call trace:
[ 5945.588107] [<ffff000008089e14>] dump_backtrace+0x0/0x23c
[ 5945.599468] [<ffff00000808a074>] show_stack+0x24/0x2c
[ 5945.603148] [<ffff000008855c28>] dump_stack+0x84/0xa8
[ 5945.606676] [<ffff000008216e34>] warn_alloc+0x11c/0x1ac
[ 5945.614536] [<ffff000008217ddc>] __alloc_pages_nodemask+0xe90/0xec0
[ 5945.624463] [<ffff00000827bca4>] alloc_pages_vma+0x90/0x1c0
[ 5945.628873] [<ffff00000824b574>] wp_page_copy+0x94/0x670
[ 5945.633271] [<ffff00000824ea40>] do_wp_page+0xbc/0x63c
[ 5945.639748] [<ffff000008251868>] __handle_mm_fault+0x4d0/0x560
[ 5945.650364] [<ffff0000082519d8>] handle_mm_fault+0xe0/0x178
[ 5945.655960] [<ffff000008872dc4>] do_page_fault+0x1c4/0x3cc
[ 5945.663762] [<ffff0000080813e8>] do_mem_abort+0x64/0xe4
[ 5945.756137] Mem-Info:
[ 5945.759687] active_anon:4916 inactive_anon:4896 isolated_anon:584
 active_file:65 inactive_file:50 isolated_file:0
 unevictable:0 dirty:0 writeback:58 unstable:0
 slab_reclaimable:353 slab_unreclaimable:2005
 mapped:86 shmem:5 pagetables:4117 bounce:0
 free:2810 free_pcp:10 free_cma:0
[ 5945.783426] Node 0 active_anon:307392kB inactive_anon:307648kB active_file:2752kB inactive_file:3200kB unevictable:0kB isolated(anon):37376kB isolated(file):0kB mapped:5504kB dirty:0kB writeback:2368kB shmem:320kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
[ 5945.800403] Node 0 DMA free:195968kB min:75520kB low:94400kB high:113280kB active_anon:309184kB inactive_anon:311360kB active_file:4928kB inactive_file:4608kB unevictable:0kB writepending:0kB present:2097152kB managed:1537088kB mlocked:0kB kernel_stack:76544kB pagetables:263488kB bounce:0kB free_pcp:640kB local_pcp:320kB free_cma:0kB
[ 5945.817944] lowmem_reserve[]: 0 0 0
[ 5945.820200] Node 0 DMA: 1794*64kB (U) 236*128kB (U) 36*256kB (U) 1*512kB (U) 0*1024kB 1*2048kB (U) 0*4096kB 0*8192kB 1*16384kB (U) 1*32768kB (U) 0*65536kB 0*131072kB 0*262144kB 0*524288kB = 205952kB
[ 5945.830444] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[ 5945.835293] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=524288kB
[ 5945.845101] 1568 total pagecache pages
[ 5945.850497] 1505 pages in swap cache
[ 5945.854516] Swap cache stats: add 131425, delete 129953, find 93983/140484
[ 5945.861924] Free swap  = 208256kB
[ 5945.864822] Total swap = 2098112kB
[ 5945.867189] 32768 pages RAM
[ 5945.869354] 0 pages HighMem/MovableOnly
[ 5945.873040] 8751 pages reserved
[ 5945.876243] 0 pages hwpoisoned
[ 5979.408965] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
[ 5979.414229] [ 1334]     0  1334      237        3       4       2       37             0 systemd-journal
[ 5979.419778] [ 1354]     0  1354     1282        0       4       2       43             0 lvmetad
[ 5979.425682] [ 1364]     0  1364      243        2       4       2       42         -1000 systemd-udevd
:
:
[ 5979.569754] [11382]     0 11382     1739        0       4       2       15             0 run_test.sh
[ 5979.575016] [11652]     0 11652     1785        2       3       2       62             0 bash
[ 5979.579985] [19821]     0 19821     1785        1       3       2       62             0 bash
[ 5979.584878] [19822]     0 19822     1715        1       3       2        8             0 tee
[ 5979.589861] [20003]     0 20003     1828        2       4       2      104             0 bash
[ 5979.594729] [15391]     0 15391     1743        1       5       2       23             0 anacron
[ 5979.599854] [17647]     0 17647     1834        4       4       2      108             0 bash
[ 5979.604748] [17648]     0 17648     1715        1       4       2        9             0 tee
[ 5979.609712] [17832]     0 17832     1831       30       4       2       76             0 bash
[ 5979.614600] [17834]     0 17834     1831        0       4       2      109             0 bash
[ 5979.619561] [17835]     0 17835     1828        9       4       2       97             0 bash
[ 5979.624770] [17836]     0 17836     1831       23       4       2       89             0 bash
[ 5979.629739] [17841]     0 17841     1828        0       4       2      109             0 bash
:
:
[ 5986.229441] [22230]     0 22230     1831       24       4       2       83             0 bash
[ 5986.234602] [22231]     0 22231     1828       26       4       2       79             0 bash
[ 5986.239474] [22232]     0 22232     1834       24       4       2       86             0 bash
[ 5986.244709] [22233]     0 22233     1831       15       4       2       92             0 bash
[ 5986.249630] [22234]     0 22234     1831       20       4       2       86             0 bash
[ 5986.254535] [22235]     0 22235     1834       22       4       2       88             0 bash
[ 5986.259377] Kernel panic - not syncing: Out of memory: system-wide panic_on_oom is enabled

It was initially a bit of a surprise that there was any swap in use, because Lustre runs in the kernel and cannot be swapped out, but this space is used by the many (nearly 1000) bash processes that are running on the node, in addition to many lfs and rm processes.

VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
sanity test_103b - onyx-90vm17 crashed during sanity test_103b



 Comments   
Comment by Gerrit Updater [ 22/Jan/19 ]

Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34082
Subject: LU-11878 tests: don't fork-bomb sanity test_103b
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: c2d990eb8ca072e05d2e585212056e1bda5ebbbd

Comment by Andreas Dilger [ 23/Jan/19 ]

Have also seen crashes on x86:
https://testing.whamcloud.com/test_sets/0d516998-0fb9-11e9-b7d4-52540065bddc

Comment by Gerrit Updater [ 06/Feb/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34082/
Subject: LU-11878 tests: don't fork-bomb sanity test_103b
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 42c5c9c2ca3e44cb1c3e8ecb144bdd20fb35cddb

Comment by Peter Jones [ 06/Feb/19 ]

Landed for 2.13

Comment by Gerrit Updater [ 19/Mar/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34202/
Subject: LU-11878 tests: don't fork-bomb sanity test_103b
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: c84aa2ac96cf11258140834d58b36d8cd11c76c3

Generated at Sat Feb 10 02:47:43 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.