Affects Version/s: Lustre 2.5.3
Fix Version/s: Lustre 2.7.0
When using Asymmetric router failure detection, the system appears unable to determine the lack of an expected interface. While a defined but non-functional interface is detected, the clients do not seem to detect when they have a route to a network via a router but that router has no means of getting the traffic there.
Take for example a few nodes, login1, rtr5, rtr6, and mgs. This was demonstrated on live hardware, although the following example is abstracted/has changed addresses and names.
Host: interfaces (routes)
login1: 30@gni1 (o2ib1 via 27@gni1, o2ib1 via 31@gni1)
rtr5: 27@gni1 10.1.1.5@o2ib1 ()
rtr6: 31@gni1 10.1.1.6@o2ib1 ()
mgs: 10.1.1.1@o2ib1 (gni1 via 10.1.1.5@o2ib1 and gni1 via 10.1.1.6@o2ib1)
In other words, we have two routers with two interfaces each sitting between LNET1 and GNI1.
Enable ARF via configs, ensure running
Configure interface ib0 on rtr5 to not start on boot.
Reboot rtr5 (ifconfig ib0 shows no ib0 down / no IP)
start lnet (lctl net up)
show missing interface on rtr5 via lctl list_nids
rtr5:~ # lctl list_nids
on login1 ping mgs
lctl ping 10.1.1.1@o2ib1 (result is 50% success, 50% I/O error)
login1:~ # lctl show_route
net o2ib1 hops 1 gw 27@gni1 up pri 0
net o2ib1 hops 1 gw 31@gni1 up pri 0
look for down_ni
login1:~ # cat /proc/sys/lnet/routers
ref rtr_ref alive_cnt state last_ping ping_sent deadline down_ni router
4 1 1 up 28 1 NA 0 27@gni1
4 1 1 up 28 1 NA 0 31@gni1
In other words, there is no way to get to o2ib1 via rtr5, but arf does not detect this. Presumably, at least in a non-multihop configuration, clients should be concerned not with whether the router has defined routes that aren't working, but wether the client has a defined route that a router can't handle due to a down interface or a lack of an interface.