[LU-3679] /proc/sys/lnet/routes should accurately reflect routing with ARF when LNet router has one or more down NIs Created: 31/Jul/13  Updated: 04/Feb/14  Resolved: 04/Feb/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.0
Fix Version/s: Lustre 2.6.0, Lustre 2.5.1

Type: Improvement Priority: Minor
Reporter: Chris Horn Assignee: Dmitry Eremin (Inactive)
Resolution: Fixed Votes: 0
Labels: patch

Rank (Obsolete): 9498

 Description   

On a system where an LNet router has more than one NI, ARF is configured on clients and servers, and one or more of the LNet router's NIs goes "down", /proc/sys/lnet/routes on clients/servers should show routes for that router as "down" rather than "up".

The story: A site was doing some tests of FGR where LNet routers had two IB interfaces. After seeing wide variations in packet counts between ib0 and ib1, they noticed that some NIs were down on the routers

> lnet6: nid status alive refs peer rtr max tx min
> lnet6: 0@lo up 0 2 0 0 0 0 0
> lnet6: 454@gni up 0 679 16 0 2048 2048 1664
> lnet6: 10.100.100.160@o2ib1000 up 18 3 63 128 2048 2048 2047
> lnet6: 10.100.100.160@o2ib1002 up 12 4 63 128 2048 2048 2047
> lnet6: 10.100.100.160@o2ib1004 up 0 4 63 128 2048 2048 1859
> lnet6: 10.100.100.161@o2ib1006 down 66420 1 63 128 2048 2048 2048
> lnet6: 10.100.100.161@o2ib1007 down 66420 1 63 128 2048 2048 2048

but were up for IPOIB. This caused some confusion, and was compounded by the fact that clients show these routes as still functional:

cat /proc/sys/lnet/routes | grep 454
o2ib1000 2 up 454@gni
o2ib1002 2 up 454@gni
o2ib1004 1 up 454@gni
o2ib1006 1 up 454@gni
o2ib1007 2 up 454@gni

This lead people to believe that clients were still trying to use routes that were actually down resulting in performance problems. Since ARF was configured, we know this wasn't actually the case. Clients will not use a router if that router has one or more down NIs. This should be reflected in the output of /proc/sys/lnet/routes.



 Comments   
Comment by Isaac Huang (Inactive) [ 06/Aug/13 ]

Yes, a route should be considered "down" if the router is down or the router NI for the target network is down.

Comment by Chris Horn [ 07/Aug/13 ]

Correct me if I'm wrong, but if the NI for the target network is up and an NI for a different target network is down the router still won't be used due to ARF, right?

Comment by Isaac Huang (Inactive) [ 07/Aug/13 ]

In that case the route will still used. For example, if router 454@gni has @o2ib1000 NI down but @o2ib1002 NI up, there is no reason why 454@gni can't be used as a route to @o2ib1002. Note that route != router (a router can serve as next hop in multiple routes), in the example, the route to @o2ib1000 via 454@gni is down, but the route to @o2ib1002 via 454@gni is up.

Comment by Chris Horn [ 08/Aug/13 ]

Ah right. I had missed the bit of code in lnet_parse_rc_info() that ignored other down NIs on a router if the NI for the destination network was up.

Comment by Chris Horn [ 01/Oct/13 ]

FYI, I have a patch for this awaiting testing and a push into Gerrit for review. Just don't want anyone to duplicate effort here.

Comment by Chris Horn [ 04/Oct/13 ]

For your review: http://review.whamcloud.com/#/c/7857/

Comment by Peter Jones [ 27/Oct/13 ]

Landed for 2.6

Comment by Chris Horn [ 06/Nov/13 ]

Can we get this on b2_5?

Comment by Dmitry Eremin (Inactive) [ 06/Nov/13 ]

patch for b2_5 is http://review.whamcloud.com/8195

Comment by Dmitry Eremin (Inactive) [ 04/Feb/14 ]

Landed to b2_5

Generated at Sat Feb 10 01:35:59 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.