Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-14675

LNet not working over IB (RHEL8.3 MOFED 5.2 ppc64le)

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.12.6
    • None
    • 3
    • 9223372036854775807

    Description

      Hi,

      I'm trying to get the Lustre client working with RHEL 8.3 and MOFED 5.2 or later on the ppc64le architecture, and have run into trouble.

      With the help of cherry picking the commit for LU-13783, Lustre 2.12.6 builds. Once installed I can configure lnet, but the box is unable to lnetctl ping itself over InfiniBand:

      [root@infer004 ~]# systemctl start lnet
      [root@infer004 ~]# lnetctl ping 172.16.44.4@tcp
      ping:

      • primary nid: 172.16.44.4@tcp
        Multi-Rail: False
        peer ni:
      • nid: 172.16.50.204@o2ib
      • nid: 172.16.44.4@tcp
        [root@infer004 ~]# lnetctl ping 172.16.50.204@o2ib
        manage:
      • ping:
        errno: -1
        descr: failed to ping 172.16.50.204@o2ib: Input/output error

      Syslog contains:

      May 7 12:51:17 infer004 kernel: LNet: HW NUMA nodes: 2, HW CPU cores: 160, npartitions: 2
      May 7 12:51:17 infer004 kernel: alg: No test for adler32 (adler32-zlib)
      May 7 12:51:17 infer004 kernel: alg: hash: digest failed on test 1 for crc32-table: ret=126
      May 7 12:51:17 infer004 kernel: LNet: Using FastReg for registration
      May 7 12:51:19 infer004 kernel: LNet: Added LNI 172.16.50.204@o2ib [32/1024/0/180]
      May 7 12:51:19 infer004 kernel: LNet: Added LNI 172.16.44.4@tcp [8/256/0/180]
      May 7 12:51:19 infer004 kernel: LNet: Accept secure, port 988
      May 7 12:51:17 infer004 systemd[1]: Starting lnet management...
      May 7 12:51:19 infer004 systemd[1]: Started lnet management.
      May 7 12:51:41 infer004 kernel: LNet: 9655:0:(o2iblnd_cb.c:3420:kiblnd_check_conns()) Timed out tx for 172.16.50.204@o2ib: 217 seconds
      May 7 12:51:42 infer004 kernel: LNetError: 9649:0:(lib-move.c:2955:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-172.16.50.204@o2ib: -125
      May 7 12:51:42 infer004 kernel: LNet: 9655:0:(o2iblnd_cb.c:3420:kiblnd_check_conns()) Timed out tx for 172.16.50.204@o2ib: 218 seconds

      After attempting to ping over InfiniBand, the idle system's load average goes from ~0.00 to 1.00, "systemctl stop lnet" hangs and the following is added to syslog:

      May 7 12:57:01 infer004 systemd[1]: Stopping lnet management...
      May 7 12:57:04 infer004 kernel: LNet: Removed LNI 172.16.44.4@tcp
      May 7 12:57:05 infer004 kernel: LNet: 9702:0:(o2iblnd.c:3012:kiblnd_shutdown()) 172.16.50.204@o2ib: waiting for 1 peers to disconnect
      May 7 12:57:09 infer004 kernel: LNet: 9702:0:(o2iblnd.c:3012:kiblnd_shutdown()) 172.16.50.204@o2ib: waiting for 1 peers to disconnect
      May 7 12:57:17 infer004 kernel: LNet: 9702:0:(o2iblnd.c:3012:kiblnd_shutdown()) 172.16.50.204@o2ib: waiting for 1 peers to disconnect
      May 7 12:57:34 infer004 kernel: LNet: 9702:0:(o2iblnd.c:3012:kiblnd_shutdown()) 172.16.50.204@o2ib: waiting for 1 peers to disconnect
      May 7 12:58:07 infer004 kernel: LNet: 9702:0:(o2iblnd.c:3012:kiblnd_shutdown()) 172.16.50.204@o2ib: waiting for 1 peers to disconnect

      If I downgrade MOFED to 5.1-2.5.8.0 and rebuild Lustre 2.12.6 + LU-13783, the box is able to lnetctl ping itself on its InfiniBand interface.

      Any ideas, please?

      Thanks,

      Mark

      Attachments

        Activity

          People

            wc-triage WC Triage
            bodgerer Mark Dixon
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: