[LU-17417] (Durham University) Grace Hopper + Rocky 9 aarch64 + kernel-64k + Lustre 2.15.4 client = kernel panic Created: 11/Jan/24 Updated: 15/Jan/24 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.15.4 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Mark Dixon | Assignee: | WC Triage |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | arm | ||
| Environment: |
Rocky 9.3 aarch64 |
||
| Issue Links: |
|
||||
| Severity: | 3 | ||||
| Rank (Obsolete): | 9223372036854775807 | ||||
| Description |
|
Hi there! We are lucky enough to have a few 1 socket Grace Hopper servers and we would like them to mount our Lustre filesystem. Unfortunately, starting up lnet causes the client to panic, for example: We prefer a dkms build but, as we are in testing, the client was built with the more usual: Tried backing off to the more usual 4k kernel using the same method and successfully mounted our lustre filesystem, although attempting to move to a dkms build for that 4k kernel strangely results in the panic returning. Can you help, please? Thanks, Mark |
| Comments |
| Comment by Andreas Dilger [ 11/Jan/24 ] |
|
Have you tried any other kernels or Lustre versions? Was the lustre code built on this node? Just wondering if there is a chance of kernel module version mismatch? |
| Comment by Andreas Dilger [ 11/Jan/24 ] |
|
kevin.zhao, xinliang any thoughts on this? It looks very early in module loading, and hasn't even called the module init function AFAICS. |
| Comment by Mark Dixon [ 11/Jan/24 ] |
|
Hi Andreas, thanks for taking a look at this. For lustre, the code was built on the same node. I first attempted 2.12.9 but it wouldn't complete configure and so quickly switched to 2.15.4. The kernel versions I've played with are the latest Rocky 9.3, so 5.14.0-362.13.1.el9_3.aarch64+64k and its 4k page equivalent, 5.14.0-362.13.1.el9_3.aarch64. I had played with in-tree InfiniBand modules vs. MLNX_OFED_LINUX-23.10-1.1.9.0-rhel9.3-aarch64, but uninstalled MLNX_OFED. With the 64k kernel booted I've just removed the lustre rpms, checked that "find /lib/modules | grep libcfs" didn't return anything, built a fresh set of 2.15.4 rpms, installed as above, checked "find /lib/modules | grep libcfs" reported a real file under /extra and not a symlink under /weak-modules, ran "modprobe libcfs" - then get a similar kernel oops. |
| Comment by Peter Jones [ 11/Jan/24 ] |
|
Mark Given that this is still a bit experimental at this stage, I wonder if it is worth seeing if you have better success with the tip of master. It could well be that some recent useful landings have not been back ported to the LTS branch yet... Peter |
| Comment by Mark Dixon [ 11/Jan/24 ] |
|
Thanks Peter. For some reason I never think of trying the bleeding edge - but sadly the same result : ( |
| Comment by Xinliang Liu [ 12/Jan/24 ] |
|
Hi, The kernel address 00000196a9025cc5 is not a valid one(starting with fffxxxxxx), so it makes a data abort oops. See https://www.kernel.org/doc/html/latest/arch/arm64/memory.html Our arm64 master CI is still running on Rocky9.2, will try Rocky9.3 as there is a ldiskfs patch for rocky9.3 now. See http://213.146.155.72:8080/job/test-periodically-lustre-master-rhel9/ and http://213.146.155.72:8080/job/build-lustre-master-rhel9/
bodgerer , you said the 4k page size kernel has no this issue in the bug description. We are also testing 2.15 and master for Rocky8 which is 64K page size, and no this issue. Cou you try only tcp no rdma? |
| Comment by Mark Dixon [ 12/Jan/24 ] |
|
Hi! At the moment we just have the default /etc/lnet.conf which only contains comments, so it shouldn't be trying to setup any o2ib or tcp devices. We do have InfiniBand devices, but if we get rid of the drivers we won't have any ethernet: the unit also has a Mellanox SFP28+ ethernet card. |
| Comment by Mark Dixon [ 12/Jan/24 ] |
|
I think that 4k vs. 64k page size and InfiniBand are a red herrings. Rebuilt the tip of master with --with-o2ib=no and added a modprobe.d file to block the kernel from loading ib_core, mlx5_core and mlxfw, so there shouldnt be any rdma business going on. /etc/lnet.conf is filled with comment lines only. Unfortunately, `modprobe lnet` still panics the 64k kernel: [ 75.635984] libcfs: loading out-of-tree module taints kernel. It's not quite true that I said the 4k kernel didn't have this issue. For 2.15.4 I did manage to build and use the kmod lustre client rpm successfully, but the 2.15.4 dkms lustre client rpm generated a panic. Rebuilt the tip of master against the 4k kernel and, with the same settings/options to avoid Infiniband as above, installing the kmod lustre client rpm and `modprobe lnet` gave this oops: ^[[?2004l^M[ 945.766464] libcfs: loading out-of-tree module taints kernel.^M Also tried rebuilding 2.15.4 on 4k kernel and now a modprobe lnet with the kmod rpm installed generate the panic when they didn't before, so I guess I just got lucky initially. |
| Comment by Xinliang Liu [ 15/Jan/24 ] |
|
Hi bodgerer , I didn't reproduce the oops issue on my aarch64 test VM for both 4K and 64K kernels. 4K page size kernel, try ~10 times
...
rocky@rocky9-test-01 lustre-release]$ sudo modprobe libcfs
[rocky@rocky9-test-01 lustre-release]$ sudo modprobe lnet
[rocky@rocky9-test-01 lustre-release]$ lsmod |grep -E 'libcfs|lnet'
lnet 778240 0
libcfs 237568 1 lnet
sunrpc 626688 2 lnet
[rocky@rocky9-test-01 lustre-release]$ uname -r
5.14.0-362.13.1.el9_3.aarch64
64K page size kernel, try ~10 times
...
[rocky@rocky9-test-01 ~]$ sudo modprobe lnet && lsmod | grep -E "(libcfs|lnet)" && sudo modprobe -r lnet
lnet 917504 0
libcfs 458752 1 lnet
sunrpc 851968 2 lnet
[rocky@rocky9-test-01 ~]$ uname -r
5.14.0-362.13.1.el9_3.aarch64+64k
I guess you are encountering a use-after-free or out-of-bounds or other memory-corrupted issue caused by another process (here is another ko). You can try the debug kernels and see what kernel outputs more, a.k.a the kernel-debug and kernel-64k-debug [1] (hoping KASAN and KFENCE[2] are enabled). Also, you can try to unload other ko one by one[3], to find out which ko causes the problem, usually, it might be the driver ko. It seems a tough issue to troubleshoot. but there are ways to solve it, just take time. [1] https://download.rockylinux.org/pub/rocky/9/BaseOS/aarch64/debug/tree/Packages/k/ |