Details
-
Bug
-
Resolution: Unresolved
-
Minor
-
None
-
Lustre 2.15.2, Lustre 2.15.3, Lustre 2.15.5
-
None
-
3
-
9223372036854775807
Description
Recently an older mailing list post of mine got a reply from someone seeing very similar symptoms to something I'd run into in the past. In that particular case we were hitting a different bug with opa on el8+, but the symptoms were similar; connectivity issues with clients without any obvious issues with the underlying fabric.
After we worked around our issues with opa (by moving to socklnd, then ultimately away from opa to infiniband with o2iblnd) we noticed that having discovery enabled in our environment was leading to connectivity issues with clients. I chalked this up to some other eccentricity in our environment given a lack of similar issues being reported by folks and have been habitually disabling discovery on clients without a second thought since.
Jesse's similar issues, coupled with them seemingly being resolved by disabling discovery does make we think there's a bug in peer discovery somewhere. Or both of us are doing things wrong, which is possible.
I don't have any logs or notes left from when I experienced this first hand, hopefully Jesse can provide some, but I do have some observations from the current state of things that are weird.
Our servers are configured with two nids, one tcp0 and one o2ib0, each to facilitate access to the volume by clients on both the remaining opa nodes and the ib nodes. The network the o2ib0 interfaces use is a partition on a fabric, so each node has two logical ib interfaces. I mention that just in case having multiple ib interfaces could contribute somehow.
On a client that's mounted the volume I see the following peer nids:
[root@amp-8 ~]# lnetctl peer list peer list: - nid: 172.16.200.250@o2ib - nid: 172.16.100.250@tcp - nid: 172.16.200.251@o2ib - nid: 172.16.200.252@o2ib - nid: 172.16.100.251@tcp - nid: 172.16.100.252@tcp - nid: 172.16.200.253@o2ib - nid: 172.16.100.253@tcp
with the interface configured in a modprobe.d file as:
[root@amp-8 ~]# cat /etc/modprobe.d/lnet.conf options lnet networks=o2ib0(ibp38s0.8a51) options lnet lnet_peer_discovery_disabled=1
The 172.16.100.0/24 network is the older opa network, the 172.16.200.0/24 network is the 8a51 pkey infiniband network.
This client shouldn't have any knowledge of the opa network, it certainly doesn't have any configuration for that network or connectivity to it. But it does see the tcp nids of the servers it's connected to via o2ib. I'm wondering if, with discovery enabled, this leads to multi rail being configured incorrectly and the servers trying to send traffic down the tcp network to clients that only exist on the o2ib network or similar.
Okay, we've had enough stability now that I think we can rule out Multi-Rail in for my case. It looks like this was entirely due to Infinband RDMA communication problems, which i was able to observe with failed rpings and lctl pings between systems using the o2ib LND when the issue occurred. The same IP addresses given the sock LND worked fine. rping showed this useful diagnostic error:
ibv_create_cq failed
setup_qp failed: 12
I'm not sure if Shane observed the same issue, nor do i think this is necessarily widespread. The systems in question were upgraded in place to rocky 8, and is running older ConnectX-3 IB hardware so the RDMA issue here may be specific to this paricular combination.
I resolved this issue for us by switching these specific lustre servers with ksocklnd instead of o2iblnd, using replace_nids and updating the appropriate ZFS properties, and adding both o2ib/tcp NIDs to the cluster nodes (which communicate with other IB storage and still need o2ib).