[LU-6060] ARF doesn't detect lack of interface on a router Created: 19/Dec/14  Updated: 14/Jul/15  Resolved: 20/Jan/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.3
Fix Version/s: Lustre 2.7.0

Type: Bug Priority: Major
Reporter: John Lewis (Inactive) Assignee: WC Triage
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-6851 LU-6060 patch breaks multi-hop routin... Resolved
is related to LU-5758 enabling avoid_asym_router_failure pr... Resolved
is related to LU-5485 first mount always fail with avoid_as... Resolved
Severity: 3
Rank (Obsolete): 16875

 Description   

When using Asymmetric router failure detection, the system appears unable to determine the lack of an expected interface. While a defined but non-functional interface is detected, the clients do not seem to detect when they have a route to a network via a router but that router has no means of getting the traffic there.

Take for example a few nodes, login1, rtr5, rtr6, and mgs. This was demonstrated on live hardware, although the following example is abstracted/has changed addresses and names.

Host: interfaces (routes)
login1: 30@gni1 (o2ib1 via 27@gni1, o2ib1 via 31@gni1)
rtr5: 27@gni1 10.1.1.5@o2ib1 ()
rtr6: 31@gni1 10.1.1.6@o2ib1 ()
mgs: 10.1.1.1@o2ib1 (gni1 via 10.1.1.5@o2ib1 and gni1 via 10.1.1.6@o2ib1)

In other words, we have two routers with two interfaces each sitting between LNET1 and GNI1.

Reproduction steps:
Enable ARF via configs, ensure running
Configure interface ib0 on rtr5 to not start on boot.
Reboot rtr5 (ifconfig ib0 shows no ib0 down / no IP)
start lnet (lctl net up)

show missing interface on rtr5 via lctl list_nids
rtr5:~ # lctl list_nids
27@gni1
rtr5:~ #

on login1 ping mgs
lctl ping 10.1.1.1@o2ib1 (result is 50% success, 50% I/O error)

show routes
login1:~ # lctl show_route
net o2ib1 hops 1 gw 27@gni1 up pri 0
net o2ib1 hops 1 gw 31@gni1 up pri 0

look for down_ni
login1:~ # cat /proc/sys/lnet/routers
ref rtr_ref alive_cnt state last_ping ping_sent deadline down_ni router
4 1 1 up 28 1 NA 0 27@gni1
4 1 1 up 28 1 NA 0 31@gni1

In other words, there is no way to get to o2ib1 via rtr5, but arf does not detect this. Presumably, at least in a non-multihop configuration, clients should be concerned not with whether the router has defined routes that aren't working, but wether the client has a defined route that a router can't handle due to a down interface or a lack of an interface.



 Comments   
Comment by Liang Zhen (Inactive) [ 20/Dec/14 ]

Hi John, do you have patch on LU-5485 in your environment?

Comment by James A Simmons [ 20/Dec/14 ]

Yes the patch for LU-5485 is included. Without the patch we can't mount a file system with ARF enabled.

Comment by Gerrit Updater [ 21/Dec/14 ]

Liang Zhen (liang.zhen@intel.com) uploaded a new patch: http://review.whamcloud.com/13162
Subject: LU-6060 lnet: set downis to 1 if there's no NI for remote net
Project: fs/lustre-release
Branch: b2_5
Current Patch Set: 1
Commit: 5ed16f284871b6b898591815cb7d5468ae2c3fca

Comment by Liang Zhen (Inactive) [ 21/Dec/14 ]

James, I think the issue here is, we will not record downis if there is no NI for target network, above patch should fix this problem. Also, I'm wondering if this the same problem of LU-5758, could you please comment on 5758?

Comment by James A Simmons [ 15/Jan/15 ]

Can you make a patch for master as well. Testing looks good for the patch you provided.

Comment by Gerrit Updater [ 15/Jan/15 ]

Jian Yu (jian.yu@intel.com) uploaded a new patch: http://review.whamcloud.com/13417
Subject: LU-6060 lnet: set downis to 1 if there's no NI for remote net
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 826353849a9a51a4d4c53accd449e9427386f57c

Comment by Gerrit Updater [ 19/Jan/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13417/
Subject: LU-6060 lnet: set downis to 1 if there's no NI for remote net
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 749dc54622b2c3267c6c97eb770702b437a7897d

Comment by Peter Jones [ 20/Jan/15 ]

Landed for 2.7

Generated at Sat Feb 10 01:56:52 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.