Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-17417

(Durham University) Grace Hopper + Rocky 9 aarch64 + kernel-64k + Lustre 2.15.4 client = kernel panic

    XMLWordPrintable

Details

    • Bug
    • Resolution: Not a Bug
    • Minor
    • None
    • Lustre 2.15.4
    • Rocky 9.3 aarch64
      Lustre 2.15.4 client (2.12.x server)
      NVIDIA Grace Hopper seed unit (integrated arm cpu + gpu socket)
      InfiniBand (in tree modules)
      No gpu modules loaded
    • 3
    • 9223372036854775807

    Description

      Hi there!

      We are lucky enough to have a few 1 socket Grace Hopper servers and we would like them to mount our Lustre filesystem. Unfortunately, starting up lnet causes the client to panic, for example:
      ```
      [ 8919.610649] libcfs: loading out-of-tree module taints kernel.
      [ 8919.610870] libcfs: module verification failed: signature and/or required key missing - tainting kernel
      [ 8919.627075] Unable to handle kernel paging request at virtual address 00000196a9025cc5
      [ 8919.635176] Mem abort info:
      [ 8919.638025] ESR = 0x0000000096000005
      [ 8919.641855] EC = 0x25: DABT (current EL), IL = 32 bits
      [ 8919.647282] SET = 0, FnV = 0
      [ 8919.650399] EA = 0, S1PTW = 0
      [ 8919.653606] FSC = 0x05: level 1 translation fault
      [ 8919.658589] Data abort info:
      [ 8919.661531] ISV = 0, ISS = 0x00000005
      [ 8919.665447] CM = 0, WnR = 0
      [ 8919.668473] user pgtable: 64k pages, 48-bit VAs, pgdp=0000000155cd0400
      [ 8919.675150] [00000196a9025cc5] pgd=0000000000000000, p4d=0000000000000000, pud=0000000000000000
      [ 8919.684050] Internal error: Oops: 0000000096000005 1 SMP
      [ 8919.689746] Modules linked in: libcfs(OE+) 8021q garp mrp stp llc nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib rfkill nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink rpcrdma rdma_ucm ib_srpt ib_isert iscsi_target_mod target_core_mod ib_iser libiscsi scsi_transport_iscsi ib_umad rdma_cm iw_cm ib_ipoib ib_cm vfat fat drm_display_helper ast acpi_ipmi drm_shmem_helper ses ipmi_ssif enclosure cec i2c_smbus drm_ttm_helper spi_nor ttm i2c_algo_bit ipmi_devintf drm_kms_helper mtd syscopyarea sysfillrect sysimgblt ipmi_msghandler mlx5_ib ib_uverbs coresight_stm coresight_tmc coresight_funnel stm_core ib_core coresight cppc_cpufreq auth_rpcgss drm sunrpc fuse xfs libcrc32c mlx5_core sg crct10dif_ce ghash_ce sha2_ce sha256_arm64 mpt3sas sha1_ce sbsa_gwdt nv
      me nvme_core mlxfw tls raid_class scsi_transport_sas nvme_common psample pci_hyperv_intf spi_tegra210_quad acpi_power_meter dm_mirror
      [ 8919.689783] dm_region_hash dm_log dm_mod
      [ 8919.783038] CPU: 38 PID: 105046 Comm: modprobe Kdump: loaded Tainted: G OE ------- — 5.14.0-362.13.1.el9_3.aarch64+64k #1
      [ 8919.795846] Hardware name: Quanta Cloud Technology Inc. QuantaGrid S74G-2U 1S7GZ9Z0000/S7G MB (CG1), BIOS 3A06 10/05/2023
      [ 8919.807054] pstate: 23400009 (nzCv daif +PAN UAO +TCO +DIT -SSBS BTYPE=-)
      [ 8919.814173] pc : mod_sysfs_setup+0x1a4/0x290
      [ 8919.818542] lr : mod_sysfs_setup+0x174/0x290
      [ 8919.822903] sp : ffff80009682fa70
      [ 8919.826286] x29: ffff80009682fa70 x28: ffff80009682fbf0 x27: ffffa0608ae23948
      [ 8919.833580] x26: ffffa06042663b88 x25: ffff80009682fbf0 x24: ffffa06042630cf8
      [ 8919.840874] x23: ffffa06042648890 x22: ffffa06042663818 x21: ffffa06042663850
      [ 8919.848168] x20: 0000000000000000 x19: ffffa06042663800 x18: 0000000000000000
      [ 8919.855462] x17: 00000000000001a4 x16: ffffa06042640d58 x15: ffffa06088c1a560
      [ 8919.862757] x14: ffffa06088c19e00 x13: 0073656761705f6f x12: 74707972635f636f
      [ 8919.870050] x11: 0000000000000000 x10: 0000000000000000 x9 : ffffa060897f2e6c
      [ 8919.877344] x8 : 0101010101010101 x7 : 7f7f7f7f7f7f7f7f x6 : 736877645e727872
      [ 8919.884639] x5 : 0000000000000000 x4 : 0000000000000030 x3 : 0000000000000000
      [ 8919.891933] x2 : ffffa06042663818 x1 : ffffa06042663850 x0 : 90000196a9025bf5
      [ 8919.899229] Call trace:
      [ 8919.901723] mod_sysfs_setup+0x1a4/0x290
      [ 8919.905728] load_module+0xaec/0xc6c
      [ 8919.909382] __do_sys_finit_module+0xa4/0x110
      [ 8919.913832] __arm64_sys_finit_module+0x24/0x30
      [ 8919.918461] invoke_syscall.constprop.0+0x7c/0xd0
      [ 8919.923276] el0_svc_common.constprop.0+0x140/0x150
      [ 8919.928259] do_el0_svc+0x38/0xa0
      [ 8919.931642] el0_svc+0x38/0x18c
      [ 8919.934853] el0t_64_sync_handler+0xb4/0x130
      [ 8919.939216] el0t_64_sync+0x17c/0x180
      [ 8919.942958] Code: 540004a0 f9401700 aa1603e2 aa1503e1 (f9406800)
      [ 8919.949189] SMP: stopping secondary CPUs
      [ 8919.955258] Starting crashdump kernel...
      [ 8919.959265] Bye!
      ```

      We prefer a dkms build but, as we are in testing, the client was built with the more usual:
      ```
      git clone git://git.whamcloud.com/fs/lustre-release.git
      cd lustre-release
      git checkout 2.15.4
      kernel=`uname -r`
      sh autogen.sh
      ./configure --with-linux=/usr/src/kernels/$kernel
      make rpms
      ```

      Tried backing off to the more usual 4k kernel using the same method and successfully mounted our lustre filesystem, although attempting to move to a dkms build for that 4k kernel strangely results in the panic returning.

      Can you help, please?

      Thanks,

      Mark

      Attachments

        Activity

          People

            wc-triage WC Triage
            bodgerer Mark Dixon
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: