Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-17201

LNetError in o2iblnd.c with qib HCA under EL9.2

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.15.3
    • None
    • Alma Linux 9.2
      Kernel 5.14.0-284.30.1.el9_2.x86_64
      kmod-ib_qib-1.11-6.el9_2.elrepo.x86_64
    • 3
    • 9223372036854775807

    Description

        LNET loads the tcp interface fine, but o2ib fails with this kernel message:
      LNetError: 701:0:(o2iblnd.c:2647:kiblnd_hdev_get_attr()) Invalid mr size: 0xffffffffffffffff
      LNetError: 701:0:(o2iblnd.c:2880:kiblnd_dev_failover()) Can't get device attributes: -22
      LNetError: 701:0:(o2iblnd.c:3354:kiblnd_startup()) ko2iblnd: Can't initialize device: rc = -22
      LNetError: 105-4: Error -100 starting up LNI o2ib

        We are trying (perhaps over-hopefully) to get the lustre client to work in EL9 on old Qlogic/Intel Truescale Infiniband hardware. RedHat had removed the qib module back in EL8, although it remains in the mainline kernels from kernel.org. The ELRepo repository maintains a few of these RH-deprecated kernel modules compiled against the RHEL kernel. As of kmod-ib_qib-1.11-6.el9_2.elrepo, this module actually works.

        The closest bug report I could find is LU-10549, which suggests a mismatch in real vs. expected data fields reported by the module. I suspect no-one has actually tried the EL9 kernel ib_qib with lustre, considering it only started working last week.

        In the mean time, I'll try to swap out the EL9.2 kernel + kmod with the ELRepo-maintained kernel-lt, which includes the standard kernel.org qib module.

       

      Attachments

        Activity

          People

            wc-triage WC Triage
            nathan.crawford@uci.edu Nathan Crawford
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: