Details
-
Improvement
-
Resolution: Fixed
-
Minor
-
Lustre 2.5.0
-
9498
Description
On a system where an LNet router has more than one NI, ARF is configured on clients and servers, and one or more of the LNet router's NIs goes "down", /proc/sys/lnet/routes on clients/servers should show routes for that router as "down" rather than "up".
The story: A site was doing some tests of FGR where LNet routers had two IB interfaces. After seeing wide variations in packet counts between ib0 and ib1, they noticed that some NIs were down on the routers
> lnet6: nid status alive refs peer rtr max tx min
> lnet6: 0@lo up 0 2 0 0 0 0 0
> lnet6: 454@gni up 0 679 16 0 2048 2048 1664
> lnet6: 10.100.100.160@o2ib1000 up 18 3 63 128 2048 2048 2047
> lnet6: 10.100.100.160@o2ib1002 up 12 4 63 128 2048 2048 2047
> lnet6: 10.100.100.160@o2ib1004 up 0 4 63 128 2048 2048 1859
> lnet6: 10.100.100.161@o2ib1006 down 66420 1 63 128 2048 2048 2048
> lnet6: 10.100.100.161@o2ib1007 down 66420 1 63 128 2048 2048 2048
but were up for IPOIB. This caused some confusion, and was compounded by the fact that clients show these routes as still functional:
cat /proc/sys/lnet/routes | grep 454
o2ib1000 2 up 454@gni
o2ib1002 2 up 454@gni
o2ib1004 1 up 454@gni
o2ib1006 1 up 454@gni
o2ib1007 2 up 454@gni
This lead people to believe that clients were still trying to use routes that were actually down resulting in performance problems. Since ARF was configured, we know this wasn't actually the case. Clients will not use a router if that router has one or more down NIs. This should be reflected in the output of /proc/sys/lnet/routes.