Details
-
Bug
-
Resolution: Unresolved
-
Minor
-
None
-
Lustre 2.12.1
-
None
-
3
-
9223372036854775807
Description
LNet MultiRail selects routes from local to remote in order.
Local device is detemined in round-robin fashion,
so even if Health value of a remote device(A) is smaller than another remote device(B),
a local device(A) which is the peer of the remote device(A) may be selected.
If device(A) and device(B) are on different subnets, a failure route will be selected.
Subnet1 Subnet2 (DOWN) | REMOTE IB(A) | IB(B) ↑ | x | | | LOCAL IB(A) | IB(B) local device is selected in round-robin fashion regardless of remote side device status
We modify the finding best local device algorithm as follows,
- get the maximum health value of remote device
- if the value is smaller than the best health value, don't use this device
- if the value is bigger than the best health value, update the best device
- if the value is identical with the best value, update the best device by conventional way
- update the best health value
We've made some changes as part of the Multi-Rail Routing feature, which I believe should accommodate the issues you mentioned here. If you can please look at these patch series, as they are planned to be merged into master in the near feature
The tip of the patch series starts here: https://review.whamcloud.com/#/c/34580/