[LU-14206] Router ping timeouts don't mark routes down if DD is disabled Created: 09/Dec/20  Updated: 30/Aug/22  Resolved: 28/Apr/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.15.0

Type: Bug Priority: Minor
Reporter: Chris Horn Assignee: Chris Horn
Resolution: Fixed Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Discovery pings are used to determine the health of gateways and
associated routes. Ping replies from gateways with dynamic discovery
(DD) disabled (or if DD is disabled locally) are handled in
a special routine, lnet_router_discovery_ping_reply(), but this
function and related code doesn't handle the case where a discovery
ping hits the response tracker timeout and is unlinked by the
monitor thread. In this case, an UNLINK event is generated and we
do not call the lnet_router_discovery_ping_reply(). For gateways
with DD enabled (and DD enabled locally), we handle this case
in lnet_router_discovery_copmlete(). If discovery failed then
lp_dc_error is set and we mark all routes down for the gateway. We
can simply extend this logic to the case of gateways w/DD disabled
(or DD disabled locally).



 Comments   
Comment by Gerrit Updater [ 09/Dec/20 ]

Chris Horn (chris.horn@hpe.com) uploaded a new patch: https://review.whamcloud.com/40923
Subject: LU-14206 lnet: Router ping timeout with discovery disabled
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 136ff5da3df3bdb908266101a09456e58e4c665d

Comment by Chris Horn [ 02/Mar/21 ]

Test notes for the fix (LUS-9612 is HPE internal issue for LU-14206)

Build cray-2.12-int to reproduce:

hornc@sles15build01 lustre-filesystem $ ./LUSTRE-VERSION-GEN
2.12.4.2_cray_253_gd8f8bfe
hornc@sles15build01 lustre-filesystem $ make -j 32
...
sles15build01:~ # for i in sles15s01 sles15s02 sles15c01; do rsync -avr /home/hornc/lustre-filesystem $i:/home/hornc ; done

Test node config:

sles15s01:

sles15s01:~ # lctl list_nids
192.168.2.30@tcp1
sles15s01:~ # lctl show_route
net               tcp2 hops 4294967295 gw                192.168.2.32@tcp1 up pri 0
sles15s01:~ #

sles15c01:

sles15c01:~ # lctl list_nids
192.168.2.38@tcp2
sles15c01:~ # lctl show_route
net               tcp1 hops 4294967295 gw                192.168.2.33@tcp2 up pri 0
sles15c01:~ #

sles15s02 (router w/DD disabled):

sles15s02:~ # lctl list_nids
192.168.2.32@tcp1
192.168.2.33@tcp2
sles15s02:~ # lnetctl global show | grep disc
    discovery: 0
sles15s02:~ #

Stop LNet on the router.

sles15s02:~ # lctl net down
LNET ready to unload
sles15s02:~ # lustre_rmmod
sles15s02:~ #

Wait for route to be marked down on peer. Check dk log to show we do not trigger "Router discovery failed" code path.

sles15c01:~ # lctl show_route
net               tcp1 hops 4294967295 gw                192.168.2.33@tcp2 down pri 0
sles15c01:~ # lctl dk > /tmp/dk.log
sles15c01:~ # grep 'Router discovery failed' /tmp/dk.log

Build/deploy fix:

hornc@sles15build01 lustre-filesystem $ git fetch https://es-gerrit.dev.cray.com/lustre-wc-rel refs/changes/88/158188/2 && git cherry-pick FETCH_HEAD
remote: Counting objects: 9, done
remote: Finding sources: 100% (5/5)
remote: Total 5 (delta 4), reused 5 (delta 4)
Unpacking objects: 100% (5/5), done.
From https://es-gerrit.dev.cray.com/lustre-wc-rel
 * branch                  refs/changes/88/158188/2 -> FETCH_HEAD
[task/2.12-int/test-LUS-9612 9571e895bb] LUS-9612 lnet: Router ping timeout with discovery disabled
 Date: Wed Dec 9 14:38:57 2020 -0600
 1 file changed, 4 insertions(+), 4 deletions(-)
hornc@sles15build01 lustre-filesystem $ make -j 32
...
sles15build01:~ # for i in sles15s01 sles15s02 sles15c01; do rsync -avr /home/hornc/lustre-filesystem $i:/home/hornc ; done

We can see we take the correct "Router discovery failed" code path.

sles15s02:~ # lctl net down
LNET ready to unload
sles15s02:~ # lustre_rmmod
sles15s02:~ #

sles15c01:~ # lctl show_route
net               tcp1 hops 4294967295 gw                192.168.2.33@tcp2 down pri 0
sles15c01:~ # lctl dk > /tmp/dk.log
sles15c01:~ # grep 'Router discovery failed' /tmp/dk.log
00000400:00000200:0.0:1611332945.989428:0:10139:0:(router.c:540:lnet_router_discovery_complete()) 192.168.2.33@tcp2: Router discovery failed -111
sles15c01:~ #
Comment by Gerrit Updater [ 28/Apr/21 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/40923/
Subject: LU-14206 lnet: Router ping timeout with discovery disabled
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 173d86c6e9a704a84de36ae57a337a3fdae7b1ed

Comment by Peter Jones [ 28/Apr/21 ]

Landed for 2.15

Generated at Sat Feb 10 03:07:46 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.