[LU-12453] ko2iblnd: problem handling link failures on bonded interfaces Created: 19/Jun/19  Updated: 03/Oct/19

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.8
Fix Version/s: None

Type: Bug Priority: Blocker
Reporter: Lukasz Flis Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None
Environment:

RDMA over Ethernet:

  • Mellanox ConnectX-5 adapters , ES7990
  • o2iblnd(bond0.881)
  • bond0.881: mlx5_0 + mlx5_1 (interfaces from separate HCAs)
  • bond0 type is active-passive

Issue Links:
Related
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

 

We have encountered a problem when running RoCEv2 over bonded interfaces.
When bond0 interface is created on top of two slave interfaces beloging to separate HCAs and primary interface fails
after RDMA QPs are created LNET connection is not properly re-established using the backup link.

In such case only solution is to reenable/fix primary interface or restart lnet by reloading kernel modules.

Problem has been seen on ES7990 as well as in vanilla lustre 2.10.*

Normaly when bonding is created on top of two ports belonging to the same HCA - mlx driver is handling link failure by moving QPs. In case described above link failure must be handled in ko2iblnd driver.

Log message related to the described bug is logged when problem occurs:

e0-oss03 kernel: LNetError: 4598:0:(o2iblnd.c:831:kiblnd_create_conn()) cmid HCA(mlx5_0), kib_dev(bond0.881) need failover

 

 


Generated at Sat Feb 10 02:52:42 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.