[LU-17201] LNetError in o2iblnd.c with qib HCA under EL9.2 Created: 16/Oct/23  Updated: 21/Oct/23

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.15.3
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Nathan Crawford Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None
Environment:

Alma Linux 9.2
Kernel 5.14.0-284.30.1.el9_2.x86_64
kmod-ib_qib-1.11-6.el9_2.elrepo.x86_64


Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

  LNET loads the tcp interface fine, but o2ib fails with this kernel message:
LNetError: 701:0:(o2iblnd.c:2647:kiblnd_hdev_get_attr()) Invalid mr size: 0xffffffffffffffff
LNetError: 701:0:(o2iblnd.c:2880:kiblnd_dev_failover()) Can't get device attributes: -22
LNetError: 701:0:(o2iblnd.c:3354:kiblnd_startup()) ko2iblnd: Can't initialize device: rc = -22
LNetError: 105-4: Error -100 starting up LNI o2ib

  We are trying (perhaps over-hopefully) to get the lustre client to work in EL9 on old Qlogic/Intel Truescale Infiniband hardware. RedHat had removed the qib module back in EL8, although it remains in the mainline kernels from kernel.org. The ELRepo repository maintains a few of these RH-deprecated kernel modules compiled against the RHEL kernel. As of kmod-ib_qib-1.11-6.el9_2.elrepo, this module actually works.

  The closest bug report I could find is LU-10549, which suggests a mismatch in real vs. expected data fields reported by the module. I suspect no-one has actually tried the EL9 kernel ib_qib with lustre, considering it only started working last week.

  In the mean time, I'll try to swap out the EL9.2 kernel + kmod with the ELRepo-maintained kernel-lt, which includes the standard kernel.org qib module.

 



 Comments   
Comment by Fredrik Nyström [ 18/Oct/23 ]

Quick question, did you also install infinipath-psm (provides /etc/udev/rules.d/60-ipath.rules)?

We did some tests late last year with Rocky 9 + ib_qib with the patch that is now in elrepo. Had to install infinipath-psm from CentOS 7 (failed to rebuild it for el9). We did not try Lustre o2ib.

Comment by Nathan Crawford [ 21/Oct/23 ]

We haven't tried to install the psm libs, but they weren't needed for the Lustre client on Rocky 8.

The proposed workaround isn't going to work as the ELRepo kernel-lt for el9 is already too new (6.1.58). May need to dig a bit more.

Generated at Sat Feb 10 03:33:28 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.