Details
-
Bug
-
Resolution: Not a Bug
-
Minor
-
None
-
Lustre 2.15.4
-
Rocky 9.3 aarch64
Lustre 2.15.4 client (2.12.x server)
NVIDIA Grace Hopper seed unit (integrated arm cpu + gpu socket)
InfiniBand (in tree modules)
No gpu modules loaded
-
3
-
9223372036854775807
Description
Hi there!
We are lucky enough to have a few 1 socket Grace Hopper servers and we would like them to mount our Lustre filesystem. Unfortunately, starting up lnet causes the client to panic, for example:
```
[ 8919.610649] libcfs: loading out-of-tree module taints kernel.
[ 8919.610870] libcfs: module verification failed: signature and/or required key missing - tainting kernel
[ 8919.627075] Unable to handle kernel paging request at virtual address 00000196a9025cc5
[ 8919.635176] Mem abort info:
[ 8919.638025] ESR = 0x0000000096000005
[ 8919.641855] EC = 0x25: DABT (current EL), IL = 32 bits
[ 8919.647282] SET = 0, FnV = 0
[ 8919.650399] EA = 0, S1PTW = 0
[ 8919.653606] FSC = 0x05: level 1 translation fault
[ 8919.658589] Data abort info:
[ 8919.661531] ISV = 0, ISS = 0x00000005
[ 8919.665447] CM = 0, WnR = 0
[ 8919.668473] user pgtable: 64k pages, 48-bit VAs, pgdp=0000000155cd0400
[ 8919.675150] [00000196a9025cc5] pgd=0000000000000000, p4d=0000000000000000, pud=0000000000000000
[ 8919.684050] Internal error: Oops: 0000000096000005 1 SMP
[ 8919.689746] Modules linked in: libcfs(OE+) 8021q garp mrp stp llc nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib rfkill nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink rpcrdma rdma_ucm ib_srpt ib_isert iscsi_target_mod target_core_mod ib_iser libiscsi scsi_transport_iscsi ib_umad rdma_cm iw_cm ib_ipoib ib_cm vfat fat drm_display_helper ast acpi_ipmi drm_shmem_helper ses ipmi_ssif enclosure cec i2c_smbus drm_ttm_helper spi_nor ttm i2c_algo_bit ipmi_devintf drm_kms_helper mtd syscopyarea sysfillrect sysimgblt ipmi_msghandler mlx5_ib ib_uverbs coresight_stm coresight_tmc coresight_funnel stm_core ib_core coresight cppc_cpufreq auth_rpcgss drm sunrpc fuse xfs libcrc32c mlx5_core sg crct10dif_ce ghash_ce sha2_ce sha256_arm64 mpt3sas sha1_ce sbsa_gwdt nv
me nvme_core mlxfw tls raid_class scsi_transport_sas nvme_common psample pci_hyperv_intf spi_tegra210_quad acpi_power_meter dm_mirror
[ 8919.689783] dm_region_hash dm_log dm_mod
[ 8919.783038] CPU: 38 PID: 105046 Comm: modprobe Kdump: loaded Tainted: G OE ------- — 5.14.0-362.13.1.el9_3.aarch64+64k #1
[ 8919.795846] Hardware name: Quanta Cloud Technology Inc. QuantaGrid S74G-2U 1S7GZ9Z0000/S7G MB (CG1), BIOS 3A06 10/05/2023
[ 8919.807054] pstate: 23400009 (nzCv daif +PAN UAO +TCO +DIT -SSBS BTYPE=-)
[ 8919.814173] pc : mod_sysfs_setup+0x1a4/0x290
[ 8919.818542] lr : mod_sysfs_setup+0x174/0x290
[ 8919.822903] sp : ffff80009682fa70
[ 8919.826286] x29: ffff80009682fa70 x28: ffff80009682fbf0 x27: ffffa0608ae23948
[ 8919.833580] x26: ffffa06042663b88 x25: ffff80009682fbf0 x24: ffffa06042630cf8
[ 8919.840874] x23: ffffa06042648890 x22: ffffa06042663818 x21: ffffa06042663850
[ 8919.848168] x20: 0000000000000000 x19: ffffa06042663800 x18: 0000000000000000
[ 8919.855462] x17: 00000000000001a4 x16: ffffa06042640d58 x15: ffffa06088c1a560
[ 8919.862757] x14: ffffa06088c19e00 x13: 0073656761705f6f x12: 74707972635f636f
[ 8919.870050] x11: 0000000000000000 x10: 0000000000000000 x9 : ffffa060897f2e6c
[ 8919.877344] x8 : 0101010101010101 x7 : 7f7f7f7f7f7f7f7f x6 : 736877645e727872
[ 8919.884639] x5 : 0000000000000000 x4 : 0000000000000030 x3 : 0000000000000000
[ 8919.891933] x2 : ffffa06042663818 x1 : ffffa06042663850 x0 : 90000196a9025bf5
[ 8919.899229] Call trace:
[ 8919.901723] mod_sysfs_setup+0x1a4/0x290
[ 8919.905728] load_module+0xaec/0xc6c
[ 8919.909382] __do_sys_finit_module+0xa4/0x110
[ 8919.913832] __arm64_sys_finit_module+0x24/0x30
[ 8919.918461] invoke_syscall.constprop.0+0x7c/0xd0
[ 8919.923276] el0_svc_common.constprop.0+0x140/0x150
[ 8919.928259] do_el0_svc+0x38/0xa0
[ 8919.931642] el0_svc+0x38/0x18c
[ 8919.934853] el0t_64_sync_handler+0xb4/0x130
[ 8919.939216] el0t_64_sync+0x17c/0x180
[ 8919.942958] Code: 540004a0 f9401700 aa1603e2 aa1503e1 (f9406800)
[ 8919.949189] SMP: stopping secondary CPUs
[ 8919.955258] Starting crashdump kernel...
[ 8919.959265] Bye!
```
We prefer a dkms build but, as we are in testing, the client was built with the more usual:
```
git clone git://git.whamcloud.com/fs/lustre-release.git
cd lustre-release
git checkout 2.15.4
kernel=`uname -r`
sh autogen.sh
./configure --with-linux=/usr/src/kernels/$kernel
make rpms
```
Tried backing off to the more usual 4k kernel using the same method and successfully mounted our lustre filesystem, although attempting to move to a dkms build for that 4k kernel strangely results in the panic returning.
Can you help, please?
Thanks,
Mark