Details
-
Bug
-
Resolution: Unresolved
-
Minor
-
None
-
Lustre 2.12.6
-
None
-
Client: RHEL 8.3 (4.18.0-240.el8.ppc64le), MOFED 5.2-2.2.0 (prebuilt Mellanox binaries), ppc64le, Lustre 2.12.6 +
LU-13783
Lustre client compiled with:
sh autogen.sh && ./configure --with-linux=/usr/src/kernels/4.18.0-240.el8.ppc64le --with-o2ib=/usr/src/ofa_kernel/default && make rpms
ko2iblnd options:
options ko2iblnd peer_credits=32 peer_credits_hiw=16 credits=1024 concurrent_sends=64 ntx=2048 map_on_demand=16 fmr_pool_size=2048 fmr_flush_trigger=512 fmr_cache=1 conns_per_peer=4
lnet.conf:
net:
- net type: o2ib
local NI(s):
- nid: 172.16.50.204@o2ib
status: up
interfaces:
0: ib0
- net type: tcp
local NI(s):
- nid: 172.16.44.4@tcp
status: up
interfaces:
0: enP49p3s0f1
Interfaces:
ib0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 2044
inet 172.16.50.204 netmask 255.255.252.0 broadcast 172.16.51.255
inet6 fe80::1e34:da03:7d:6c0e prefixlen 64 scopeid 0x20<link>
Infiniband hardware address can be incorrect! Please read BUGS section in ifconfig(8).
infiniband 00:00:10:87:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 txqueuelen 256 (InfiniBand)
RX packets 172 bytes 34188 (33.3 KiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 231 bytes 29724 (29.0 KiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
enP49p3s0f1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 172.16.44.4 netmask 255.255.248.0 broadcast 172.16.47.255
inet6 fe80::a94:efff:fe80:db5f prefixlen 64 scopeid 0x20<link>
ether 08:94:ef:80:db:5f txqueuelen 1000 (Ethernet)
RX packets 1873 bytes 395482 (386.2 KiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 2644 bytes 421936 (412.0 KiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
device interrupt 59
Server: CentOS 7 + MOFED 4.9 on x86_64, Lustre 2.12.5 (but not touched during this test)Client: RHEL 8.3 (4.18.0-240.el8.ppc64le), MOFED 5.2-2.2.0 (prebuilt Mellanox binaries), ppc64le, Lustre 2.12.6 + LU-13783 Lustre client compiled with: sh autogen.sh && ./configure --with-linux=/usr/src/kernels/4.18.0-240.el8.ppc64le --with-o2ib=/usr/src/ofa_kernel/default && make rpms ko2iblnd options: options ko2iblnd peer_credits=32 peer_credits_hiw=16 credits=1024 concurrent_sends=64 ntx=2048 map_on_demand=16 fmr_pool_size=2048 fmr_flush_trigger=512 fmr_cache=1 conns_per_peer=4 lnet.conf: net: - net type: o2ib local NI(s): - nid: 172.16.50.204@o2ib status: up interfaces: 0: ib0 - net type: tcp local NI(s): - nid: 172.16.44.4@tcp status: up interfaces: 0: enP49p3s0f1 Interfaces: ib0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 2044 inet 172.16.50.204 netmask 255.255.252.0 broadcast 172.16.51.255 inet6 fe80::1e34:da03:7d:6c0e prefixlen 64 scopeid 0x20<link> Infiniband hardware address can be incorrect! Please read BUGS section in ifconfig(8). infiniband 00:00:10:87:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 txqueuelen 256 (InfiniBand) RX packets 172 bytes 34188 (33.3 KiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 231 bytes 29724 (29.0 KiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 enP49p3s0f1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 inet 172.16.44.4 netmask 255.255.248.0 broadcast 172.16.47.255 inet6 fe80::a94:efff:fe80:db5f prefixlen 64 scopeid 0x20<link> ether 08:94:ef:80:db:5f txqueuelen 1000 (Ethernet) RX packets 1873 bytes 395482 (386.2 KiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 2644 bytes 421936 (412.0 KiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 device interrupt 59 Server: CentOS 7 + MOFED 4.9 on x86_64, Lustre 2.12.5 (but not touched during this test)
Description
Hi,
I'm trying to get the Lustre client working with RHEL 8.3 and MOFED 5.2 or later on the ppc64le architecture, and have run into trouble.
With the help of cherry picking the commit for LU-13783, Lustre 2.12.6 builds. Once installed I can configure lnet, but the box is unable to lnetctl ping itself over InfiniBand:
[root@infer004 ~]# systemctl start lnet
[root@infer004 ~]# lnetctl ping 172.16.44.4@tcp
ping:
- primary nid: 172.16.44.4@tcp
Multi-Rail: False
peer ni: - nid: 172.16.50.204@o2ib
- nid: 172.16.44.4@tcp
[root@infer004 ~]# lnetctl ping 172.16.50.204@o2ib
manage: - ping:
errno: -1
descr: failed to ping 172.16.50.204@o2ib: Input/output error
Syslog contains:
May 7 12:51:17 infer004 kernel: LNet: HW NUMA nodes: 2, HW CPU cores: 160, npartitions: 2
May 7 12:51:17 infer004 kernel: alg: No test for adler32 (adler32-zlib)
May 7 12:51:17 infer004 kernel: alg: hash: digest failed on test 1 for crc32-table: ret=126
May 7 12:51:17 infer004 kernel: LNet: Using FastReg for registration
May 7 12:51:19 infer004 kernel: LNet: Added LNI 172.16.50.204@o2ib [32/1024/0/180]
May 7 12:51:19 infer004 kernel: LNet: Added LNI 172.16.44.4@tcp [8/256/0/180]
May 7 12:51:19 infer004 kernel: LNet: Accept secure, port 988
May 7 12:51:17 infer004 systemd[1]: Starting lnet management...
May 7 12:51:19 infer004 systemd[1]: Started lnet management.
May 7 12:51:41 infer004 kernel: LNet: 9655:0:(o2iblnd_cb.c:3420:kiblnd_check_conns()) Timed out tx for 172.16.50.204@o2ib: 217 seconds
May 7 12:51:42 infer004 kernel: LNetError: 9649:0:(lib-move.c:2955:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-172.16.50.204@o2ib: -125
May 7 12:51:42 infer004 kernel: LNet: 9655:0:(o2iblnd_cb.c:3420:kiblnd_check_conns()) Timed out tx for 172.16.50.204@o2ib: 218 seconds
After attempting to ping over InfiniBand, the idle system's load average goes from ~0.00 to 1.00, "systemctl stop lnet" hangs and the following is added to syslog:
May 7 12:57:01 infer004 systemd[1]: Stopping lnet management...
May 7 12:57:04 infer004 kernel: LNet: Removed LNI 172.16.44.4@tcp
May 7 12:57:05 infer004 kernel: LNet: 9702:0:(o2iblnd.c:3012:kiblnd_shutdown()) 172.16.50.204@o2ib: waiting for 1 peers to disconnect
May 7 12:57:09 infer004 kernel: LNet: 9702:0:(o2iblnd.c:3012:kiblnd_shutdown()) 172.16.50.204@o2ib: waiting for 1 peers to disconnect
May 7 12:57:17 infer004 kernel: LNet: 9702:0:(o2iblnd.c:3012:kiblnd_shutdown()) 172.16.50.204@o2ib: waiting for 1 peers to disconnect
May 7 12:57:34 infer004 kernel: LNet: 9702:0:(o2iblnd.c:3012:kiblnd_shutdown()) 172.16.50.204@o2ib: waiting for 1 peers to disconnect
May 7 12:58:07 infer004 kernel: LNet: 9702:0:(o2iblnd.c:3012:kiblnd_shutdown()) 172.16.50.204@o2ib: waiting for 1 peers to disconnect
If I downgrade MOFED to 5.1-2.5.8.0 and rebuild Lustre 2.12.6 + LU-13783, the box is able to lnetctl ping itself on its InfiniBand interface.
Any ideas, please?
Thanks,
Mark