Details
-
Bug
-
Resolution: Unresolved
-
Blocker
-
None
-
Lustre 2.10.8
-
None
-
RDMA over Ethernet:
- Mellanox ConnectX-5 adapters , ES7990
- o2iblnd(bond0.881)
- bond0.881: mlx5_0 + mlx5_1 (interfaces from separate HCAs)
- bond0 type is active-passive
-
3
-
9223372036854775807
Description
We have encountered a problem when running RoCEv2 over bonded interfaces.
When bond0 interface is created on top of two slave interfaces beloging to separate HCAs and primary interface fails
after RDMA QPs are created LNET connection is not properly re-established using the backup link.
In such case only solution is to reenable/fix primary interface or restart lnet by reloading kernel modules.
Problem has been seen on ES7990 as well as in vanilla lustre 2.10.*
Normaly when bonding is created on top of two ports belonging to the same HCA - mlx driver is handling link failure by moving QPs. In case described above link failure must be handled in ko2iblnd driver.
Log message related to the described bug is logged when problem occurs:
e0-oss03 kernel: LNetError: 4598:0:(o2iblnd.c:831:kiblnd_create_conn()) cmid HCA(mlx5_0), kib_dev(bond0.881) need failover