Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12453

ko2iblnd: problem handling link failures on bonded interfaces

Details

    • Bug
    • Resolution: Unresolved
    • Blocker
    • None
    • Lustre 2.10.8
    • None
    • RDMA over Ethernet:
       - Mellanox ConnectX-5 adapters , ES7990
       - o2iblnd(bond0.881)
       - bond0.881: mlx5_0 + mlx5_1 (interfaces from separate HCAs)
       - bond0 type is active-passive
    • 3
    • 9223372036854775807

    Description

       

      We have encountered a problem when running RoCEv2 over bonded interfaces.
      When bond0 interface is created on top of two slave interfaces beloging to separate HCAs and primary interface fails
      after RDMA QPs are created LNET connection is not properly re-established using the backup link.

      In such case only solution is to reenable/fix primary interface or restart lnet by reloading kernel modules.

      Problem has been seen on ES7990 as well as in vanilla lustre 2.10.*

      Normaly when bonding is created on top of two ports belonging to the same HCA - mlx driver is handling link failure by moving QPs. In case described above link failure must be handled in ko2iblnd driver.

      Log message related to the described bug is logged when problem occurs:

      e0-oss03 kernel: LNetError: 4598:0:(o2iblnd.c:831:kiblnd_create_conn()) cmid HCA(mlx5_0), kib_dev(bond0.881) need failover

       

       

      Attachments

        Activity

          People

            wc-triage WC Triage
            lflis Lukasz Flis
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: