Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-16140

lnet-selftest test_smoke: OOM crash on aarch64 and arm

    XMLWordPrintable

Details

    • Bug
    • Resolution: Won't Fix
    • Major
    • None
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for Andreas Dilger <adilger@whamcloud.com>

      This issue relates to the following test suite run:
      https://testing.whamcloud.com/test_sets/f550b8a6-bef4-4010-a6ac-a2ca28ee60fe

      test_smoke failed with the following error:

      onyx-91vm6 crashed during lnet-selftest test_smoke
      
      [  468.793223] Lustre: DEBUG MARKER: onyx-91vm5.onyx.whamcloud.com: executing lst_setup
      [  469.145323] Lustre: DEBUG MARKER: onyx-91vm6.onyx.whamcloud.com: executing lst_setup
      [  469.232386] Lustre: DEBUG MARKER: onyx-91vm6.onyx.whamcloud.com: executing lst_setup
      [  471.585548] obd_memory max: 3765627, obd_memory current: 3440400
      [  471.585577] NetworkManager invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
      [  471.585593] CPU: 0 PID: 676 Comm: NetworkManager Kdump: loaded Tainted: G           OE    --------- -  - 4.18.0-372.16.1.el8_6.aarch64 #1
      [  471.585601] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
      [  471.585606] Call trace:
      [  471.585609]  dump_backtrace+0x0/0x160
      [  471.585675]  show_stack+0x28/0x38
      [  471.585679]  dump_stack+0x5c/0x74
      [  471.585723]  dump_header+0x4c/0x1e0
      [  471.585751]  out_of_memory+0x410/0x510
      [  471.585755]  __alloc_pages_nodemask+0xd74/0xde8
      [  471.585766]  alloc_pages_vma+0x94/0x1f8
      [  471.585771]  __read_swap_cache_async+0x100/0x280
      [  471.585780]  read_swap_cache_async+0x60/0xa8
      [  471.585786]  swap_cluster_readahead+0x294/0x2f8
      [  471.585791]  swapin_readahead+0x2a4/0x3c4
      [  471.585795]  do_swap_page+0x55c/0x868
      [  471.585803]  __handle_mm_fault+0x4c4/0x590
      [  471.585808]  handle_mm_fault+0xe0/0x180
      [  471.585813]  do_page_fault+0x164/0x488
      [  471.585824]  do_translation_fault+0xa0/0xb0
      [  471.585830]  do_mem_abort+0x54/0xb0
      [  471.585835]  el1_da+0x1c/0x98
      [  471.585840]  do_sys_poll+0x3c0/0x560
      [  471.585854]  __arm64_sys_ppoll+0xc8/0x120
      [  471.585859]  el0_svc_handler+0xb4/0x188
      [  471.585871]  el0_svc+0x8/0xc
      [  471.585878] Mem-Info:
      [  471.585883] active_anon:20 inactive_anon:0 isolated_anon:0
       active_file:21 inactive_file:46 isolated_file:0
       unevictable:0 dirty:0 writeback:0
       slab_reclaimable:425 slab_unreclaimable:926
       mapped:92 shmem:1 pagetables:163 bounce:0
       free:2328 free_pcp:30 free_cma:0
      [  471.585896] Node 0 active_anon:1280kB inactive_anon:0kB active_file:1344kB inactive_file:2944kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:5888kB dirty:0kB writeback:0kB shmem:64kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB kernel_stack:9920kB pagetables:10432kB all_unreclaimable? no
      [  471.585910] Node 0 DMA32 free:148992kB min:660032kB low:693952kB high:727872kB active_anon:2880kB inactive_anon:0kB active_file:0kB inactive_file:7936kB unevictable:0kB writepending:0kB present:3145728kB managed:2758144kB mlocked:0kB bounce:0kB free_pcp:1920kB local_pcp:960kB free_cma:0kB
      [  471.585927] lowmem_reserve[]: 0 0 0
      [  471.585938] Node 0 DMA32: 232*64kB (UM) 50*128kB (UM) 97*256kB (UM) 46*512kB (M) 26*1024kB (M) 8*2048kB (M) 4*4096kB (M) 3*8192kB (M) 0*16384kB 0*32768kB 0*65536kB 0*131072kB 0*262144kB 0*524288kB = 153600kB
      [  471.587258] Kernel panic - not syncing: Out of memory: system-wide panic_on_oom is enabled
      

      All lnet-selftest runs with aarch64 clients since 2022-09-01 have started crashing due to OOM, while none of the test runs before that date have failed.
      https://testing.whamcloud.com/search?client_architecture_type_id=a697f55a-d8a4-11e8-975a-52540065bddc&test_set_script_id=c24874b2-4a56-11e0-a7f6-52540025f9af&sub_test_script_id=c252c2c8-4a56-11e0-a7f6-52540025f9af&start_date=2022-08-29&end_date=2022-09-07&source=sub_tests#redirect

      This means it is very likely a regression caused by a patch landing on 2022-09-01. The patches landed at that time are below, and since the failure is in lnet-selftest only libcfs/lnet patches could be causing the problem:

      git log --oneline --after 2022-08-31 --before 2022-09-02
      286924f8a0 LU-9859 libcfs: remove Lustre specific bitmap handling
      0e48653c27 LU-16085 llite: fix stat attributes_mask
      dfc6beade3 LU-16093 kernel: kernel update SLES12 SP5 [4.12.14-122.130.1]
      fef1db004c LU-16084 tests: fix lustre-patched filefrag check
      162336079d LU-15994 tests: add testing for io_uring via fio
      ac6528af8d LU-15548 tests: skip conf-sanity/131 for older servers
      d54114d0c5 LU-15873 obd: skip checking read-only fs health
      d851381ea6 LU-1904 idl: add checks for OBD_CONNECT flags
      155cbc22ba LU-16012 sec: fix detection of SELinux enforcement
      807f3a0779 LU-16048 build: Update ZFS version to 2.1.5
      afa4c31087 LU-16045 enc: force use of new enc xattr on new servers
      2612cf4ad8 LU-16035 kfilnd: Initial kfilnd implementation
      62b470a023 LU-16027 tests: sanity:test_66: specify blocksize explicitly
      9899144862 LU-15999 tests: format journal with correct block size
      52057d85ea LU-15393 tests: check QoS hang with OST failover
      26beb8664f LU-16081 lnet: Memory leak on adding existing interface *
      d4978678b4 LU-15694 quota: keep grace time while setting default limits
      b515c6ec2a LU-15642 obdclass: use consistent stats units
      84b1ca8618 LU-15930 lnet: Remove duplicate checks for peer sensitivity *
      2431e099b1 LU-15929 lnet: Correct net selection for router ping *
      caf6095ade LU-15595 lnet: LNet peer aliveness broken *
      8ee85e1541 LU-15595 tests: Add various router tests
      ff3322fd0c LU-14955 lnet: Use fatal NI if none other available *
      8a7aa8d590 LU-16058 build: proc_ops check fails with SUBARCH undefined
      fab404836d LU-12514 target: move server mount code to target layer
      f1c8ac1156 LU-15811 llite: Refactor DIO/AIO free code
      36c34af607 LU-15811 llite: Unify range unlock
      f3fe144b85 LU-15003 sec: use enc pool for bounce pages
      9ca348e876 LU-14719 utils: dir migration stop on error
      a20b78a81d LU-15357 iokit: fix the obsolete usage of cfg_device
      51c1853933 LU-15811 llite: Rework upper/lower DIO/AIO
      9e2df7e5cc LU-10391 lnet: change ni_status in lnet_ni to u32* *
      c92bdd97d9 LU-16056 libcfs: restore umask handling in kernel threads *
      76c3fa96dc LU-16082 ldiskfs: old-style EA inode handling fix
      2447564e12 LU-16011 lnet: use preallocate bulk for server ***
      

      It is noteworthy that the three test runs that have passed since 2022-09-01 are based on commit 3f0bee2502, "LU-15959 kernel: new kernel [SLES15 SP4 5.14.21-150400.24.18.1]", which is one patch before the first commit 2447564e12 "LU-16011 lnet: use preallocate bulk for server" in the above list.

      VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
      lnet-selftest test_smoke - onyx-91vm6 crashed during lnet-selftest test_smoke

      Attachments

        Issue Links

          Activity

            People

              shadow Alexey Lyashkov
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: