Details
-
Bug
-
Resolution: Unresolved
-
Minor
-
None
-
Lustre 2.15.3
-
None
-
Alma Linux 9.2
Kernel 5.14.0-284.30.1.el9_2.x86_64
kmod-ib_qib-1.11-6.el9_2.elrepo.x86_64
-
3
-
9223372036854775807
Description
LNET loads the tcp interface fine, but o2ib fails with this kernel message:
LNetError: 701:0:(o2iblnd.c:2647:kiblnd_hdev_get_attr()) Invalid mr size: 0xffffffffffffffff
LNetError: 701:0:(o2iblnd.c:2880:kiblnd_dev_failover()) Can't get device attributes: -22
LNetError: 701:0:(o2iblnd.c:3354:kiblnd_startup()) ko2iblnd: Can't initialize device: rc = -22
LNetError: 105-4: Error -100 starting up LNI o2ib
We are trying (perhaps over-hopefully) to get the lustre client to work in EL9 on old Qlogic/Intel Truescale Infiniband hardware. RedHat had removed the qib module back in EL8, although it remains in the mainline kernels from kernel.org. The ELRepo repository maintains a few of these RH-deprecated kernel modules compiled against the RHEL kernel. As of kmod-ib_qib-1.11-6.el9_2.elrepo, this module actually works.
The closest bug report I could find is LU-10549, which suggests a mismatch in real vs. expected data fields reported by the module. I suspect no-one has actually tried the EL9 kernel ib_qib with lustre, considering it only started working last week.
In the mean time, I'll try to swap out the EL9.2 kernel + kmod with the ELRepo-maintained kernel-lt, which includes the standard kernel.org qib module.