[LU-14806] o2iblnd: IB HCA failover with o2ib bonding is broken Created: 02/Jul/21 Updated: 17/Sep/21 Resolved: 31/Jul/21 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.15.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Serguei Smirnov | Assignee: | Serguei Smirnov |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | bonding, failover, lnet, o2iblnd | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
It has been observed that "ko2iblnd dev_failover=1" option used on a node with o2ib bonding doesn't behave properly. |
| Comments |
| Comment by Serguei Smirnov [ 05/Jul/21 ] |
|
Here are the steps to reproduce (2 ib-enabled nodes are required, one with two ib interfaces):
3: ib0: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 2044 qdisc mq master bond0 state UP group default qlen 256 link/infiniband a0:00:02:10:fe:80:00:00:00:00:00:00:00:02:c9:03:00:5a:63:2b brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff 4: ib1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 65520 qdisc mq master bond0 state UP group default qlen 256 link/infiniband a0:00:02:20:fe:80:00:00:00:00:00:00:00:02:c9:03:00:5a:63:2c brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff 5: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 2044 qdisc noqueue state UP group default qlen 1000 link/infiniband a0:00:02:10:fe:80:00:00:00:00:00:00:00:02:c9:03:00:5a:63:2b brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff inet 10.1.0.9/24 brd 10.1.0.255 scope global noprefixroute bond0 valid_lft forever preferred_lft forever inet6 fe80::202:c903:5a:632b/64 scope link valid_lft forever preferred_lft forever
lnetctl lnet configure
lnetctl net add --net o2ib --if bond0
lnetctl discover 10.1.0.10@o2ib
lnetctl ping 10.1.0.10@o2ib (repeat multiple times)
cat /proc/net/bonding/bond0 Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)Bonding Mode: fault-tolerance (active-backup) (fail_over_mac active) Primary Slave: ib0 (primary_reselect always) Currently Active Slave: ib0 MII Status: up MII Polling Interval (ms): 100 Up Delay (ms): 100 Down Delay (ms): 100 Peer Notification Delay (ms): 0Slave Interface: ib0 MII Status: up Speed: 20000 Mbps Duplex: full Link Failure Count: 18 Permanent HW addr: a0:00:02:10:fe:80:00:00:00:00:00:00:00:02:c9:03:00:5a:63:2b Slave queue ID: 0Slave Interface: ib1 MII Status: up Speed: 20000 Mbps Duplex: full Link Failure Count: 7 Permanent HW addr: a0:00:02:20:fe:80:00:00:00:00:00:00:00:02:c9:03:00:5a:63:2c Slave queue ID: 0
ping 10.1.0.10
|
| Comment by Gerrit Updater [ 05/Jul/21 ] |
|
Serguei Smirnov (ssmirnov@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/44139 |
| Comment by Gerrit Updater [ 31/Jul/21 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/44139/ |
| Comment by Peter Jones [ 31/Jul/21 ] |
|
Landed for 2.15 |