Details
-
Bug
-
Resolution: Fixed
-
Major
-
Lustre 2.5.3
-
None
-
3
-
16875
Description
When using Asymmetric router failure detection, the system appears unable to determine the lack of an expected interface. While a defined but non-functional interface is detected, the clients do not seem to detect when they have a route to a network via a router but that router has no means of getting the traffic there.
Take for example a few nodes, login1, rtr5, rtr6, and mgs. This was demonstrated on live hardware, although the following example is abstracted/has changed addresses and names.
Host: interfaces (routes)
login1: 30@gni1 (o2ib1 via 27@gni1, o2ib1 via 31@gni1)
rtr5: 27@gni1 10.1.1.5@o2ib1 ()
rtr6: 31@gni1 10.1.1.6@o2ib1 ()
mgs: 10.1.1.1@o2ib1 (gni1 via 10.1.1.5@o2ib1 and gni1 via 10.1.1.6@o2ib1)
In other words, we have two routers with two interfaces each sitting between LNET1 and GNI1.
Reproduction steps:
Enable ARF via configs, ensure running
Configure interface ib0 on rtr5 to not start on boot.
Reboot rtr5 (ifconfig ib0 shows no ib0 down / no IP)
start lnet (lctl net up)
show missing interface on rtr5 via lctl list_nids
rtr5:~ # lctl list_nids
27@gni1
rtr5:~ #
on login1 ping mgs
lctl ping 10.1.1.1@o2ib1 (result is 50% success, 50% I/O error)
show routes
login1:~ # lctl show_route
net o2ib1 hops 1 gw 27@gni1 up pri 0
net o2ib1 hops 1 gw 31@gni1 up pri 0
look for down_ni
login1:~ # cat /proc/sys/lnet/routers
ref rtr_ref alive_cnt state last_ping ping_sent deadline down_ni router
4 1 1 up 28 1 NA 0 27@gni1
4 1 1 up 28 1 NA 0 31@gni1
In other words, there is no way to get to o2ib1 via rtr5, but arf does not detect this. Presumably, at least in a non-multihop configuration, clients should be concerned not with whether the router has defined routes that aren't working, but wether the client has a defined route that a router can't handle due to a down interface or a lack of an interface.