[LU-13708] lnet_notify can set route aliveness incorrectly Created: 23/Jun/20  Updated: 22/Aug/22  Resolved: 11/Mar/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.15.0

Type: Bug Priority: Major
Reporter: Chris Horn Assignee: Chris Horn
Resolution: Fixed Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

lnet_notify() modifies route aliveness in two ways:
1. By setting lp_alive field of the lnet_peer struct.
2. By setting lr_alive field of the lnet_route struct (via call to lnet_set_route_aliveness())

In both cases, the aliveness value assigned is determined by a call
to lnet_is_peer_ni_alive(), but that value only reflects the aliveness
of a particular peer NI. A gateway may have multiple peer NIs, so the
aliveness of a gateway peer (lp_alive) is not necessarily equivalent
to the aliveness of one of its NIs. Furthermore, the lr_alive field
is only used to determine route aliveness for path selection if
discovery is disabled locally or on the gateway (see
lnet_find_route_locked() and lnet_is_route_alive()).

In general, we should not set lp_alive based on an lnet_notify()
call, and we should only set lr_alive if discovery is disabled. for
lr_alive specifically, we should only set it for those routes that
have the peer NI as a next-hop.

An exception to the above exists when the reset argument to
lnet_notify() is set. The gnilnd uses this flag in its calls to
lnet_notify() because gnilnd receives out-of-band notifications of
node up and down events. Thus, when gnilnd calls lnet_notify() we
actually know whether the gateway peer is up or down and we can set
lp_alive and lr_alive appropriately.



 Comments   
Comment by Gerrit Updater [ 23/Jun/20 ]

Chris Horn (chris.horn@hpe.com) uploaded a new patch: https://review.whamcloud.com/39160
Subject: LU-13708 lnet: lnet_notify sets route aliveness incorrectly
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: d89e4cf9d6178177dbc4cf786a4f78ed6d5923ad

Comment by Gerrit Updater [ 10/Mar/21 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39160/
Subject: LU-13708 lnet: lnet_notify sets route aliveness incorrectly
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: e24471a722a6f23fb0051b4511f3fee2662d0e4e

Comment by Peter Jones [ 11/Mar/21 ]

Landed for 2.15

Generated at Sat Feb 10 03:03:32 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.