[LU-10153] route via two different networks not supported Created: 24/Oct/17  Updated: 24/Jul/19  Resolved: 24/Jul/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.1
Fix Version/s: Lustre 2.13.0

Type: Bug Priority: Critical
Reporter: Stephen Champion Assignee: Amir Shehata (Inactive)
Resolution: Fixed Votes: 0
Labels: lnet-router

Issue Links:
Duplicate
Related
Epic/Theme: lnet
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

We were trying to setup routers between EDR storage fabrics and an FDR client fabric.

The idea was to have two independent, redundant EDR storage fabrics - o2ib8 and o2ib9 - with routers and servers having interfaces on both fabrics. The routers would have two ports on the single FDR fabric - o2ib3 - with clients having a single port on the FDR fabric.

To do this in a way that uses the route checker to balance between client and router ports we were going to setup:
Clients with two routes from the FDR network (o2ib3) to each EDR network (o2ib8 & o2ib9), one via each of the router's interfaces on the FDR network.
Servers with with two routes, one from each EDR fabric to the FDR network via the two EDR ports on the router.

But ran into a problem configuring servers:
60258.285963] LNetError: 70723:0:(router.c:489:lnet_check_routes()) Routes to o2ib3 via 10.112.5.108@o2ib9 and 10.112.1.88@o2ib8 not supported

Everyone I have asked doubts that this restriction was ever necessary, and it almost certainly does not make sense now.



 Comments   
Comment by Peter Jones [ 24/Oct/17 ]

Amir is looking into this

Comment by Doug Oucharek (Inactive) [ 05/Jul/18 ]

Some customers make use of "virtual" lnets for various purposes.  For example, Cray likes to have two gni-based LNets to separate DVS traffic from Lustre traffic.  They are using the same GNI network interface, they are just different NIDs (i.e. gni6 and gni99). 

Now that we have Dynamic Discovery turned on by default in 2.11, the clients are advertising both of these NIDs to the servers.  If there is an LNet router in the middle (which there is with Cray: gni <-> o2ib), the servers need to have two sets of routing entries: one for gni6 and one for gni99.  With this bug, we cannot set that up.

The only work around is to 1- not have the two lnets (not a good solution), or 2- turn off DD/MR.  Number 2 is fine for now until there is a need to have multiple real interfaces in the servers.  Then this becomes an urgent issue.

Comment by Amir Shehata (Inactive) [ 24/Sep/18 ]

This is being fixed as part of a set of routing changes that aim at aligning the routing infrastructure with Multi-Rail.

Comment by Gerrit Updater [ 23/Oct/18 ]

Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33447
Subject: LU-10153 lnet: remove route add restriction
Project: fs/lustre-release
Branch: multi-rail
Current Patch Set: 1
Commit: 1de567d6eb9bed7ebab85cd7fccf1248a8b68e50

Comment by Gerrit Updater [ 07/Jun/19 ]

Amir Shehata (ashehata@whamcloud.com) merged in patch https://review.whamcloud.com/33447/
Subject: LU-10153 lnet: remove route add restriction
Project: fs/lustre-release
Branch: multi-rail
Current Patch Set:
Commit: 79ea6af86f57741bdd0b6bb49b380d8be454bf91

Comment by Chris Horn [ 24/Jul/19 ]

pjones I believe this ticket can be closed with the merge of Amir's MR routing feature.

Comment by Peter Jones [ 24/Jul/19 ]

Thanks for the tipoff Chris - it looks like you're correct.

Generated at Sat Feb 10 02:32:31 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.