[LU-16140] lnet-selftest test_smoke: OOM crash on aarch64 and arm Created: 07/Sep/22 Updated: 15/Sep/22 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Maloo | Assignee: | Alexey Lyashkov |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
This issue was created by maloo for Andreas Dilger <adilger@whamcloud.com> This issue relates to the following test suite run: test_smoke failed with the following error: onyx-91vm6 crashed during lnet-selftest test_smoke [ 468.793223] Lustre: DEBUG MARKER: onyx-91vm5.onyx.whamcloud.com: executing lst_setup [ 469.145323] Lustre: DEBUG MARKER: onyx-91vm6.onyx.whamcloud.com: executing lst_setup [ 469.232386] Lustre: DEBUG MARKER: onyx-91vm6.onyx.whamcloud.com: executing lst_setup [ 471.585548] obd_memory max: 3765627, obd_memory current: 3440400 [ 471.585577] NetworkManager invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0 [ 471.585593] CPU: 0 PID: 676 Comm: NetworkManager Kdump: loaded Tainted: G OE --------- - - 4.18.0-372.16.1.el8_6.aarch64 #1 [ 471.585601] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015 [ 471.585606] Call trace: [ 471.585609] dump_backtrace+0x0/0x160 [ 471.585675] show_stack+0x28/0x38 [ 471.585679] dump_stack+0x5c/0x74 [ 471.585723] dump_header+0x4c/0x1e0 [ 471.585751] out_of_memory+0x410/0x510 [ 471.585755] __alloc_pages_nodemask+0xd74/0xde8 [ 471.585766] alloc_pages_vma+0x94/0x1f8 [ 471.585771] __read_swap_cache_async+0x100/0x280 [ 471.585780] read_swap_cache_async+0x60/0xa8 [ 471.585786] swap_cluster_readahead+0x294/0x2f8 [ 471.585791] swapin_readahead+0x2a4/0x3c4 [ 471.585795] do_swap_page+0x55c/0x868 [ 471.585803] __handle_mm_fault+0x4c4/0x590 [ 471.585808] handle_mm_fault+0xe0/0x180 [ 471.585813] do_page_fault+0x164/0x488 [ 471.585824] do_translation_fault+0xa0/0xb0 [ 471.585830] do_mem_abort+0x54/0xb0 [ 471.585835] el1_da+0x1c/0x98 [ 471.585840] do_sys_poll+0x3c0/0x560 [ 471.585854] __arm64_sys_ppoll+0xc8/0x120 [ 471.585859] el0_svc_handler+0xb4/0x188 [ 471.585871] el0_svc+0x8/0xc [ 471.585878] Mem-Info: [ 471.585883] active_anon:20 inactive_anon:0 isolated_anon:0 active_file:21 inactive_file:46 isolated_file:0 unevictable:0 dirty:0 writeback:0 slab_reclaimable:425 slab_unreclaimable:926 mapped:92 shmem:1 pagetables:163 bounce:0 free:2328 free_pcp:30 free_cma:0 [ 471.585896] Node 0 active_anon:1280kB inactive_anon:0kB active_file:1344kB inactive_file:2944kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:5888kB dirty:0kB writeback:0kB shmem:64kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB kernel_stack:9920kB pagetables:10432kB all_unreclaimable? no [ 471.585910] Node 0 DMA32 free:148992kB min:660032kB low:693952kB high:727872kB active_anon:2880kB inactive_anon:0kB active_file:0kB inactive_file:7936kB unevictable:0kB writepending:0kB present:3145728kB managed:2758144kB mlocked:0kB bounce:0kB free_pcp:1920kB local_pcp:960kB free_cma:0kB [ 471.585927] lowmem_reserve[]: 0 0 0 [ 471.585938] Node 0 DMA32: 232*64kB (UM) 50*128kB (UM) 97*256kB (UM) 46*512kB (M) 26*1024kB (M) 8*2048kB (M) 4*4096kB (M) 3*8192kB (M) 0*16384kB 0*32768kB 0*65536kB 0*131072kB 0*262144kB 0*524288kB = 153600kB [ 471.587258] Kernel panic - not syncing: Out of memory: system-wide panic_on_oom is enabled All lnet-selftest runs with aarch64 clients since 2022-09-01 have started crashing due to OOM, while none of the test runs before that date have failed. This means it is very likely a regression caused by a patch landing on 2022-09-01. The patches landed at that time are below, and since the failure is in lnet-selftest only libcfs/lnet patches could be causing the problem: git log --oneline --after 2022-08-31 --before 2022-09-02 286924f8a0 LU-9859 libcfs: remove Lustre specific bitmap handling 0e48653c27 LU-16085 llite: fix stat attributes_mask dfc6beade3 LU-16093 kernel: kernel update SLES12 SP5 [4.12.14-122.130.1] fef1db004c LU-16084 tests: fix lustre-patched filefrag check 162336079d LU-15994 tests: add testing for io_uring via fio ac6528af8d LU-15548 tests: skip conf-sanity/131 for older servers d54114d0c5 LU-15873 obd: skip checking read-only fs health d851381ea6 LU-1904 idl: add checks for OBD_CONNECT flags 155cbc22ba LU-16012 sec: fix detection of SELinux enforcement 807f3a0779 LU-16048 build: Update ZFS version to 2.1.5 afa4c31087 LU-16045 enc: force use of new enc xattr on new servers 2612cf4ad8 LU-16035 kfilnd: Initial kfilnd implementation 62b470a023 LU-16027 tests: sanity:test_66: specify blocksize explicitly 9899144862 LU-15999 tests: format journal with correct block size 52057d85ea LU-15393 tests: check QoS hang with OST failover 26beb8664f LU-16081 lnet: Memory leak on adding existing interface * d4978678b4 LU-15694 quota: keep grace time while setting default limits b515c6ec2a LU-15642 obdclass: use consistent stats units 84b1ca8618 LU-15930 lnet: Remove duplicate checks for peer sensitivity * 2431e099b1 LU-15929 lnet: Correct net selection for router ping * caf6095ade LU-15595 lnet: LNet peer aliveness broken * 8ee85e1541 LU-15595 tests: Add various router tests ff3322fd0c LU-14955 lnet: Use fatal NI if none other available * 8a7aa8d590 LU-16058 build: proc_ops check fails with SUBARCH undefined fab404836d LU-12514 target: move server mount code to target layer f1c8ac1156 LU-15811 llite: Refactor DIO/AIO free code 36c34af607 LU-15811 llite: Unify range unlock f3fe144b85 LU-15003 sec: use enc pool for bounce pages 9ca348e876 LU-14719 utils: dir migration stop on error a20b78a81d LU-15357 iokit: fix the obsolete usage of cfg_device 51c1853933 LU-15811 llite: Rework upper/lower DIO/AIO 9e2df7e5cc LU-10391 lnet: change ni_status in lnet_ni to u32* * c92bdd97d9 LU-16056 libcfs: restore umask handling in kernel threads * 76c3fa96dc LU-16082 ldiskfs: old-style EA inode handling fix 2447564e12 LU-16011 lnet: use preallocate bulk for server *** It is noteworthy that the three test runs that have passed since 2022-09-01 are based on commit 3f0bee2502, " VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV |
| Comments |
| Comment by Andreas Dilger [ 07/Sep/22 ] |
|
Shadow, could you please take a look at this. It looks like your patch https://review.whamcloud.com/47952 " |
| Comment by Alexey Lyashkov [ 07/Sep/22 ] |
|
@Andreas, this patch have passed a Oleg testing with 2Gb ram per VM. for must more memory it's limited with number work items. |
| Comment by Alexey Lyashkov [ 07/Sep/22 ] |
|
please don't forget, o2iblnd have a leak introduced by "babf023227 kib_connd_waits = {
next = 0xffff88826f956a18,
prev = 0xffff88851866ec18
},
crash> list 0xffff88826f956a18 | wc -l
15009
I known, this revert already created - but - did you have arm64 test with this revert ? |
| Comment by Andreas Dilger [ 07/Sep/22 ] |
I don't think Oleg tests with aarch64 clients, and the OOM is not being hit on x86_64 clients.
There are coredumps available, but I think they can only be downloaded directly via our VPN. green, would you be able to help Shadow get access to the coredump? The one from the most recent failure https://testing.whamcloud.com/test_sets/bcf91e7c-563a-4c7f-8137-17f785785b8b reports at the top of the vmcore-dmesg file: Core file can be found in /scratch/dumps/trevis-108vm2.trevis.whamcloud.com/10.240.44.205-2022-09-07-14:53:37 |
| Comment by Alexey Lyashkov [ 07/Sep/22 ] |
|
btw. how much RAM in our's aarch64 VM ? I will create an own aarch64 guests with 2G ram (soft emu bug - don't able to have more with IB card). |
| Comment by Andreas Dilger [ 07/Sep/22 ] |
|
I was just looking at the console messages in the original test failure and it shows a more useful stack trace in brw_srpc_init(): [ 474.320712] Kernel panic - not syncing: Out of memory: system-wide panic_on_oom is enabled [ 474.320712] [ 474.335774] CPU: 1 PID: 19827 Comm: modprobe Kdump: loaded Tainted: G OE --------- - - 4.18.0-372.16.1.el8_6.aarch64 #1 [ 474.344303] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015 [ 474.349142] Call trace: [ 474.350920] dump_backtrace+0x0/0x160 [ 474.353462] show_stack+0x28/0x38 [ 474.355839] dump_stack+0x5c/0x74 [ 474.358227] panic+0x140/0x30c [ 474.360369] out_of_memory+0x438/0x510 [ 474.363003] __alloc_pages_nodemask+0xd74/0xde8 [ 474.366230] srpc_alloc_bulk+0x130/0x5d0 [lnet_selftest] [ 474.370112] brw_srpc_init+0x2c/0x60 [lnet_selftest] [ 474.373562] srpc_add_service+0x2a4/0x740 [lnet_selftest] [ 474.377337] sfw_startup+0x23c/0x898 [lnet_selftest] [ 474.380852] lnet_selftest_init+0x5f4/0x10000 [lnet_selftest] [ 474.384850] do_one_initcall+0x58/0x1f0 [ 474.387581] do_init_module+0x64/0x238 [ 474.390197] load_module+0x13ec/0x1690 [ 474.392839] __se_sys_finit_module+0xa4/0x100 [ 474.395916] __arm64_sys_finit_module+0x28/0x38 [ 474.399093] el0_svc_handler+0xb4/0x188 [ 474.401796] el0_svc+0x8/0xc [ 474.405020] SMP: stopping secondary CPUs [ 474.410883] Starting crashdump kernel... [ 474.413584] Bye! It looks like there was OOM hit on both aarch64 nodes (console.onyx-91vm5.log and console.onyx-91vm6.log at the same time.
There should be 3GB of RAM for the client VMs. I believe all of the the testing is with TCP and not IB on these nodes. |
| Comment by Andreas Dilger [ 07/Sep/22 ] |
|
I pushed (untested) patch https://review.whamcloud.com/48476 with some debugging to run on aarch64 to show how large the allocation is that fails. |
| Comment by Alexey Lyashkov [ 08/Sep/22 ] |
|
Andreas, it looks I found a root cause of it. LST want a something like a " |
| Comment by Xinliang Liu [ 08/Sep/22 ] |
|
Hi Alexey and Andreas, thanks for analyzing and addressing this issue. We also found that lnet-selftest test_smoke crashs on our arm64 CI recently. Looking forwarding to the fixed patch |
| Comment by Alexey Lyashkov [ 08/Sep/22 ] |
|
It looks I able to replicate this issue locally. I hope patch will be ready on begin of next week. PS. lustre test-framework too bad works with different path and different arch without installing from rpm. |
| Comment by Andreas Dilger [ 08/Sep/22 ] |
|
I had just found the same thing with my debug patch: [ 476.508122] Lustre: 14798:0:(brw_test.c:544:brw_init_test_service()) cache_size = 20578/41157 pages (80 IOVs) [ 476.515256] Lustre: 14798:0:(brw_test.c:549:brw_init_test_service()) want 2048 work items, got 80 [ 476.522193] LNet: 14798:0:(rpc.c:148:srpc_alloc_bulk()) allocate 256 bulk pages, 4120 bytes on cpt 0 [ 476.528390] LNet: 14798:0:(rpc.c:162:srpc_alloc_bulk()) allocate bulk page 1/256 on cpt 0 [ 476.534440] LNet: 14798:0:(rpc.c:148:srpc_alloc_bulk()) allocate 256 bulk pages, 4120 bytes on cpt 0 **** repeats 70 times **** [ 477.019823] LNet: 14798:0:(rpc.c:148:srpc_alloc_bulk()) allocate 256 bulk pages, 4120 bytes on cpt 0 [ 477.047753] LNet: 14798:0:(rpc.c:162:srpc_alloc_bulk()) allocate bulk page 12/256 on cpt 0 [ 477.053733] LNet: 14798:0:(rpc.c:162:srpc_alloc_bulk()) Skipped 18442 previous similar messages [ 477.061430] LNet: 14798:0:(rpc.c:148:srpc_alloc_bulk()) allocate 256 bulk pages, 4120 bytes on cpt 0 **** repeats 68 times **** [ 478.028197] LNet: 14798:0:(rpc.c:148:srpc_alloc_bulk()) allocate 256 bulk pages, 4120 bytes on cpt 0 [ 478.110577] obd_memory max: 3765625, obd_memory current: 3440400 [ 478.114656] systemd-journal invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0 What is also unexpected is that this is preallocating a lot more than 80 bulk descriptors, about 140 in total before OOM is hit. Since 80 desc * 256 pages/desc * 64KiB/page = 1310720 KiB = 1.2 GiB is already half of RAM (20480 of 41157 pages), the OOM is happening when over 2.2 GiB 87% of RAM has been allocated by these descriptors. So in addition to the 4KB vs. 64KB PAGE_SIZE issue, there is also a problem with the cache_size calculation in brw_init_test_service(). That appears to be twice as much RAM allocated as expected, maybe it is per-NID or per-CPT or two "services" are started, which is bad. This needs to be limited to a more reasonable amount of memory. |
| Comment by Gerrit Updater [ 09/Sep/22 ] |
|
"Alexey Lyashkov <alexey.lyashkov@hpe.com>" uploaded a new patch: https://review.whamcloud.com/48507 |
| Comment by Alexey Lyashkov [ 09/Sep/22 ] |
|
>[ 476.522193] LNet: 14798:0:(rpc.c:148:srpc_alloc_bulk()) allocate 256 bulk pages, 4120 bytes on cpt 0 it's primary fail (for both - client and server parts). Andreas, can you run attached patch on the our's arm? it passed my verification with 3G ram. Loading modules from /work/lustre/wc/LU-16140-arm-mem/lustre/lustre
detected 8 online CPUs by sysfs
libcfs will create CPU partition based on online CPUs
../lnet/lnet/lnet options: 'networks="tcp0(eth0)" lnet_peer_discovery_disabled=0 lnet_health_sensitivity=1 accept=all'
gss/krb5 is not supported
loading modules on: 'devel1'
devel1: devel1: executing load_modules_local
devel1: Loading modules from /work/lustre/wc/lustre-basic/lustre
devel1: /work/lustre/wc/lustre-basic/lustre//tests/test-framework.sh: line 693: echo: write error: Device or resource busy
devel1: /work/lustre/wc/lustre-basic/lustre//tests/test-framework.sh: line 694: echo: write error: Device or resource busy
devel1: detected 8 online CPUs by sysfs
devel1: libcfs will create CPU partition based on online CPUs
devel1: ../lnet/lnet/lnet options: 'networks="tcp0(eth0)" lnet_peer_discovery_disabled=0 lnet_health_sensitivity=1 accept=all'
devel1: ptlrpc/ptlrpc options: 'lbug_on_grant_miscount=1'
devel1: gss/krb5 is not supported
devel1: quota/lquota options: 'hash_lqs_cur_bits=3'
192.168.100.1: LNET configured
LNET configured
12345-0@lo
12345-192.168.100.1@tcp
SESSION: read/write FEATURES: 1 TIMEOUT: 300 FORCE: No
192.168.100.1@tcp are added to session
192.168.100.5@tcp are added to session
Test was added successfully
bulk_rw is running now
[LNet Rates of servers]
[R] Avg: 119 RPC/s Min: 119 RPC/s Max: 119 RPC/s
[W] Avg: 123 RPC/s Min: 123 RPC/s Max: 123 RPC/s
[LNet Bandwidth of servers]
[R] Avg: 61.79 MiB/s Min: 61.79 MiB/s Max: 61.79 MiB/s
[W] Avg: 0.01 MiB/s Min: 0.01 MiB/s Max: 0.01 MiB/s
[LNet Rates of servers]
[R] Avg: 119 RPC/s Min: 119 RPC/s Max: 119 RPC/s
[W] Avg: 120 RPC/s Min: 120 RPC/s Max: 120 RPC/s
[LNet Bandwidth of servers]
[R] Avg: 60.27 MiB/s Min: 60.27 MiB/s Max: 60.27 MiB/s
[W] Avg: 0.01 MiB/s Min: 0.01 MiB/s Max: 0.01 MiB/s
[LNet Rates of servers]
[R] Avg: 125 RPC/s Min: 125 RPC/s Max: 125 RPC/s
[W] Avg: 124 RPC/s Min: 124 RPC/s Max: 124 RPC/s
[LNet Bandwidth of servers]
[R] Avg: 61.92 MiB/s Min: 61.92 MiB/s Max: 61.92 MiB/s
[W] Avg: 0.01 MiB/s Min: 0.01 MiB/s Max: 0.01 MiB/s
[LNet Rates of servers]
[R] Avg: 122 RPC/s Min: 122 RPC/s Max: 122 RPC/s
[W] Avg: 122 RPC/s Min: 122 RPC/s Max: 122 RPC/s
[LNet Bandwidth of servers]
[R] Avg: 61.10 MiB/s Min: 61.10 MiB/s Max: 61.10 MiB/s
[W] Avg: 0.01 MiB/s Min: 0.01 MiB/s Max: 0.01 MiB/s
[LNet Rates of servers]
[R] Avg: 123 RPC/s Min: 123 RPC/s Max: 123 RPC/s
[W] Avg: 123 RPC/s Min: 123 RPC/s Max: 123 RPC/s
[LNet Bandwidth of servers]
[R] Avg: 61.66 MiB/s Min: 61.66 MiB/s Max: 61.66 MiB/s
[W] Avg: 0.01 MiB/s Min: 0.01 MiB/s Max: 0.01 MiB/s
session is ended
test-tcp.sh: строка 32: 5280 Завершено lst stat servers
pdsh@devel4: devel1: ssh exited with exit code 126
pdsh@devel4: devel1: ssh exited with exit code 126
pdsh@devel4: devel1: ssh exited with exit code 126
pdsh@devel4: devel1: ssh exited with exit code 126
pdsh@devel4: devel1: ssh exited with exit code 126
pdsh@devel4: devel1: ssh exited with exit code 126
pdsh@devel4: devel1: ssh exited with exit code 126
pdsh@devel4: devel1: ssh exited with exit code 126
pdsh@devel4: devel1: ssh exited with exit code 126
Stopping clients: devel4 /mnt/lustre (opts:-f)
Stopping clients: devel4 /mnt/lustre2 (opts:-f)
pdsh@devel4: devel1: ssh exited with exit code 1
pdsh@devel4: devel1: ssh exited with exit code 1
pdsh@devel4: devel1: ssh exited with exit code 1
pdsh@devel4: devel1: ssh exited with exit code 1
unloading modules on: 'devel1'
devel1: devel1: executing unload_modules_local
devel1: /work/lustre/wc/lustre-basic/lustre/scripts/lustre_rmmod: line 99: echo: write error: Device or resource busy
devel1: rmmod: ERROR: Module ksocklnd is in use
pdsh@devel4: devel1: ssh exited with exit code 2
~
[root@devel4 ~]# uname -r
4.18.0-348.7.1.el8_5.aarch64
[root@devel4 ~]# free -h
total used free shared buff/cache available
Mem: 2,9Gi 411Mi 2,1Gi 18Mi 390Mi 2,1Gi
Swap: 2,0Gi 0B 2,0Gi
[root@devel4 ~]#
|
| Comment by Xinliang Liu [ 14/Sep/22 ] |
|
With patch 48507, our Arm CI won't crash now. See https://testing.whamcloud.com/test_sessions/0f397625-0376-4391-948c-a3c2e8386d09 |
| Comment by Andreas Dilger [ 14/Sep/22 ] |
It didn't crash with OOM, but it didn't pass either... Also, the v1/v2 488507 patch is crashing both aarch64 as well as x86_64, so that isn't a proper fix yet:
|
| Comment by Gerrit Updater [ 14/Sep/22 ] |
|
"Andreas Dilger <adilger@whamcloud.com>" merged in patch https://review.whamcloud.com/48457/ |