[LU-12289] Route with fault remote device selected on separated IB subnet Created: 13/May/19  Updated: 16/Oct/20

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.1
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Tatsushi Takamura Assignee: Tatsushi Takamura
Resolution: Unresolved Votes: 0
Labels: None

Issue Links:
Related
is related to LU-13025 LNet Health: Peer net health not cons... Resolved
Epic/Theme: lnet
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

LNet MultiRail selects routes from local to remote in order.
Local device is detemined in round-robin fashion,
so even if Health value of a remote device(A) is smaller than another remote device(B),
a local device(A) which is the peer of the remote device(A) may be selected.
If device(A) and device(B) are on different subnets, a failure route will be selected.

          Subnet1     Subnet2

           (DOWN)  |
    REMOTE  IB(A)  |   IB(B)
            ↑      |
            x      |
            |      |
     LOCAL  IB(A)  |   IB(B)

local device is selected in round-robin fashion regardless of remote side device status

We modify the finding best local device algorithm as follows,

  1. get the maximum health value of remote device
  2. if the value is smaller than the best health value, don't use this device
  3. if the value is bigger than the best health value, update the best device
  4. if the value is identical with the best value, update the best device by conventional way
  5. update the best health value


 Comments   
Comment by Amir Shehata (Inactive) [ 16/May/19 ]

We've made some changes as part of the Multi-Rail Routing feature, which I believe should accommodate the issues you mentioned here. If you can please look at these patch series, as they are planned to be merged into master in the near feature

The tip of the patch series starts here: https://review.whamcloud.com/#/c/34580/

Generated at Sat Feb 10 02:51:16 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.