[LU-12291] Wrong NI selection on asymmetric Multi-rail environment Created: 13/May/19  Updated: 16/Oct/20

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Tatsushi Takamura Assignee: Tatsushi Takamura
Resolution: Unresolved Votes: 0
Labels: None

Epic/Theme: lnet
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

If the sending node is MultiRail and the receiving node is non-MultiRail,
the sending node use always the same NI (even if the sending NI is blocken, the blocken NI is used).
This may be the specification of MultiRail, but blocken device should be used.

    REMOTE  IB0              (non-MultiRail node)
            ↑
            x
            |
     LOCAL  IB0      IB1     <- always use IB0(not in round-robin fashion)
            failure    

If the receiving node is non-MultiRail, we check whether its device is normal or out of service and reset the device in case of failure.



 Comments   
Comment by Amir Shehata (Inactive) [ 16/May/19 ]

The reason we always stick with the same device is because doing otherwise will confuse the non-MR peer. If the non-MR peer initiated the connection on a specific NID, it always expects communication from that same NID. If the MR node uses another NID, then it will consider it communication from a different node.

The reset of the device on failure sounds interesting. How do you do that?

Generated at Sat Feb 10 02:51:17 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.