Details
-
Bug
-
Resolution: Won't Fix
-
Major
-
None
-
None
-
None
-
3
-
9223372036854775807
Description
This issue was created by maloo for Andreas Dilger <adilger@whamcloud.com>
This issue relates to the following test suite run:
https://testing.whamcloud.com/test_sets/f550b8a6-bef4-4010-a6ac-a2ca28ee60fe
test_smoke failed with the following error:
onyx-91vm6 crashed during lnet-selftest test_smoke [ 468.793223] Lustre: DEBUG MARKER: onyx-91vm5.onyx.whamcloud.com: executing lst_setup [ 469.145323] Lustre: DEBUG MARKER: onyx-91vm6.onyx.whamcloud.com: executing lst_setup [ 469.232386] Lustre: DEBUG MARKER: onyx-91vm6.onyx.whamcloud.com: executing lst_setup [ 471.585548] obd_memory max: 3765627, obd_memory current: 3440400 [ 471.585577] NetworkManager invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0 [ 471.585593] CPU: 0 PID: 676 Comm: NetworkManager Kdump: loaded Tainted: G OE --------- - - 4.18.0-372.16.1.el8_6.aarch64 #1 [ 471.585601] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015 [ 471.585606] Call trace: [ 471.585609] dump_backtrace+0x0/0x160 [ 471.585675] show_stack+0x28/0x38 [ 471.585679] dump_stack+0x5c/0x74 [ 471.585723] dump_header+0x4c/0x1e0 [ 471.585751] out_of_memory+0x410/0x510 [ 471.585755] __alloc_pages_nodemask+0xd74/0xde8 [ 471.585766] alloc_pages_vma+0x94/0x1f8 [ 471.585771] __read_swap_cache_async+0x100/0x280 [ 471.585780] read_swap_cache_async+0x60/0xa8 [ 471.585786] swap_cluster_readahead+0x294/0x2f8 [ 471.585791] swapin_readahead+0x2a4/0x3c4 [ 471.585795] do_swap_page+0x55c/0x868 [ 471.585803] __handle_mm_fault+0x4c4/0x590 [ 471.585808] handle_mm_fault+0xe0/0x180 [ 471.585813] do_page_fault+0x164/0x488 [ 471.585824] do_translation_fault+0xa0/0xb0 [ 471.585830] do_mem_abort+0x54/0xb0 [ 471.585835] el1_da+0x1c/0x98 [ 471.585840] do_sys_poll+0x3c0/0x560 [ 471.585854] __arm64_sys_ppoll+0xc8/0x120 [ 471.585859] el0_svc_handler+0xb4/0x188 [ 471.585871] el0_svc+0x8/0xc [ 471.585878] Mem-Info: [ 471.585883] active_anon:20 inactive_anon:0 isolated_anon:0 active_file:21 inactive_file:46 isolated_file:0 unevictable:0 dirty:0 writeback:0 slab_reclaimable:425 slab_unreclaimable:926 mapped:92 shmem:1 pagetables:163 bounce:0 free:2328 free_pcp:30 free_cma:0 [ 471.585896] Node 0 active_anon:1280kB inactive_anon:0kB active_file:1344kB inactive_file:2944kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:5888kB dirty:0kB writeback:0kB shmem:64kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB kernel_stack:9920kB pagetables:10432kB all_unreclaimable? no [ 471.585910] Node 0 DMA32 free:148992kB min:660032kB low:693952kB high:727872kB active_anon:2880kB inactive_anon:0kB active_file:0kB inactive_file:7936kB unevictable:0kB writepending:0kB present:3145728kB managed:2758144kB mlocked:0kB bounce:0kB free_pcp:1920kB local_pcp:960kB free_cma:0kB [ 471.585927] lowmem_reserve[]: 0 0 0 [ 471.585938] Node 0 DMA32: 232*64kB (UM) 50*128kB (UM) 97*256kB (UM) 46*512kB (M) 26*1024kB (M) 8*2048kB (M) 4*4096kB (M) 3*8192kB (M) 0*16384kB 0*32768kB 0*65536kB 0*131072kB 0*262144kB 0*524288kB = 153600kB [ 471.587258] Kernel panic - not syncing: Out of memory: system-wide panic_on_oom is enabled
All lnet-selftest runs with aarch64 clients since 2022-09-01 have started crashing due to OOM, while none of the test runs before that date have failed.
https://testing.whamcloud.com/search?client_architecture_type_id=a697f55a-d8a4-11e8-975a-52540065bddc&test_set_script_id=c24874b2-4a56-11e0-a7f6-52540025f9af&sub_test_script_id=c252c2c8-4a56-11e0-a7f6-52540025f9af&start_date=2022-08-29&end_date=2022-09-07&source=sub_tests#redirect
This means it is very likely a regression caused by a patch landing on 2022-09-01. The patches landed at that time are below, and since the failure is in lnet-selftest only libcfs/lnet patches could be causing the problem:
git log --oneline --after 2022-08-31 --before 2022-09-02 286924f8a0 LU-9859 libcfs: remove Lustre specific bitmap handling 0e48653c27 LU-16085 llite: fix stat attributes_mask dfc6beade3 LU-16093 kernel: kernel update SLES12 SP5 [4.12.14-122.130.1] fef1db004c LU-16084 tests: fix lustre-patched filefrag check 162336079d LU-15994 tests: add testing for io_uring via fio ac6528af8d LU-15548 tests: skip conf-sanity/131 for older servers d54114d0c5 LU-15873 obd: skip checking read-only fs health d851381ea6 LU-1904 idl: add checks for OBD_CONNECT flags 155cbc22ba LU-16012 sec: fix detection of SELinux enforcement 807f3a0779 LU-16048 build: Update ZFS version to 2.1.5 afa4c31087 LU-16045 enc: force use of new enc xattr on new servers 2612cf4ad8 LU-16035 kfilnd: Initial kfilnd implementation 62b470a023 LU-16027 tests: sanity:test_66: specify blocksize explicitly 9899144862 LU-15999 tests: format journal with correct block size 52057d85ea LU-15393 tests: check QoS hang with OST failover 26beb8664f LU-16081 lnet: Memory leak on adding existing interface * d4978678b4 LU-15694 quota: keep grace time while setting default limits b515c6ec2a LU-15642 obdclass: use consistent stats units 84b1ca8618 LU-15930 lnet: Remove duplicate checks for peer sensitivity * 2431e099b1 LU-15929 lnet: Correct net selection for router ping * caf6095ade LU-15595 lnet: LNet peer aliveness broken * 8ee85e1541 LU-15595 tests: Add various router tests ff3322fd0c LU-14955 lnet: Use fatal NI if none other available * 8a7aa8d590 LU-16058 build: proc_ops check fails with SUBARCH undefined fab404836d LU-12514 target: move server mount code to target layer f1c8ac1156 LU-15811 llite: Refactor DIO/AIO free code 36c34af607 LU-15811 llite: Unify range unlock f3fe144b85 LU-15003 sec: use enc pool for bounce pages 9ca348e876 LU-14719 utils: dir migration stop on error a20b78a81d LU-15357 iokit: fix the obsolete usage of cfg_device 51c1853933 LU-15811 llite: Rework upper/lower DIO/AIO 9e2df7e5cc LU-10391 lnet: change ni_status in lnet_ni to u32* * c92bdd97d9 LU-16056 libcfs: restore umask handling in kernel threads * 76c3fa96dc LU-16082 ldiskfs: old-style EA inode handling fix 2447564e12 LU-16011 lnet: use preallocate bulk for server ***
It is noteworthy that the three test runs that have passed since 2022-09-01 are based on commit 3f0bee2502, "LU-15959 kernel: new kernel [SLES15 SP4 5.14.21-150400.24.18.1]", which is one patch before the first commit 2447564e12 "LU-16011 lnet: use preallocate bulk for server" in the above list.
VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
lnet-selftest test_smoke - onyx-91vm6 crashed during lnet-selftest test_smoke
Attachments
Issue Links
- is duplicated by
-
LU-16151 lnet-selftest: Crash OOM on aarch64
-
- Resolved
-