[LU-13785] router ib interface was not configured on boot. gni clients mis-classified the router as multi-hop leading to evictions Created: 14/Jul/20  Updated: 24/Aug/22  Resolved: 15/Apr/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.15.0

Type: Bug Priority: Major
Reporter: Chris Horn Assignee: Chris Horn
Resolution: Fixed Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

An LNet router hit an odd problem. The router completed a reboot at 20:41:12

[Mon Jul 13 20:41:12 2020] Sending ec_node_info with boot code 8 (NODE_INFO_OS_BOOT_SUCCEEDED) for nid 602

but its ib0 interface didn't come up until 21:11:59

[Mon Jul 13 20:39:17 2020] ib0: enabling connected mode will cause multicast packet drops
[Mon Jul 13 20:39:17 2020] ib0: mtu > 4092 will cause multicast packet drops.
[Mon Jul 13 20:39:17 2020] IPv6: ADDRCONF(NETDEV_UP): ib0: link is not ready
...
[Mon Jul 13 21:11:59 2020] IPv6: ADDRCONF(NETDEV_CHANGE): ib0: link becomes ready

Because of this change:

commit 28324781942780cc149555ccfd3dcf9a8d2ffdfb
Author: Amir Shehata <ashehata@whamcloud.com>
Date:   Thu Nov 28 15:44:27 2019 -0800

    LU-13029 lnet: fix asym routing with multi-hop

the gni clients classified the router as "multi-hop" and continued to use it. It should have been considered "down" (because of avoid_asym_router_failure). This lead to a bunch of evictions.

We can keep the detection code, because it is useful to spot when things go awry, but when we actually determine route aliveness we should use the configured hop count.



 Comments   
Comment by Gerrit Updater [ 14/Jul/20 ]

Chris Horn (chris.horn@hpe.com) uploaded a new patch: https://review.whamcloud.com/39362
Subject: LU-13785 lnet: Use lr_hops for avoid_asym_router_failure
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 171e35b7f8a733f6489b23deb82baf0237b68a17

Comment by Gerrit Updater [ 15/Apr/21 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39362/
Subject: LU-13785 lnet: Use lr_hops for avoid_asym_router_failure
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 2e07619477684f287a2399ccdbbde0a71289574b

Comment by Peter Jones [ 15/Apr/21 ]

Landed for 2.15

Generated at Sat Feb 10 03:04:11 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.