[LU-14206] Router ping timeouts don't mark routes down if DD is disabled Created: 09/Dec/20 Updated: 30/Aug/22 Resolved: 28/Apr/21 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.15.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Chris Horn | Assignee: | Chris Horn |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
Discovery pings are used to determine the health of gateways and |
| Comments |
| Comment by Gerrit Updater [ 09/Dec/20 ] |
|
Chris Horn (chris.horn@hpe.com) uploaded a new patch: https://review.whamcloud.com/40923 |
| Comment by Chris Horn [ 02/Mar/21 ] |
|
Test notes for the fix (LUS-9612 is HPE internal issue for Build cray-2.12-int to reproduce: hornc@sles15build01 lustre-filesystem $ ./LUSTRE-VERSION-GEN 2.12.4.2_cray_253_gd8f8bfe hornc@sles15build01 lustre-filesystem $ make -j 32 ... sles15build01:~ # for i in sles15s01 sles15s02 sles15c01; do rsync -avr /home/hornc/lustre-filesystem $i:/home/hornc ; done Test node config: sles15s01: sles15s01:~ # lctl list_nids 192.168.2.30@tcp1 sles15s01:~ # lctl show_route net tcp2 hops 4294967295 gw 192.168.2.32@tcp1 up pri 0 sles15s01:~ # sles15c01: sles15c01:~ # lctl list_nids 192.168.2.38@tcp2 sles15c01:~ # lctl show_route net tcp1 hops 4294967295 gw 192.168.2.33@tcp2 up pri 0 sles15c01:~ # sles15s02 (router w/DD disabled): sles15s02:~ # lctl list_nids
192.168.2.32@tcp1
192.168.2.33@tcp2
sles15s02:~ # lnetctl global show | grep disc
discovery: 0
sles15s02:~ #
Stop LNet on the router. sles15s02:~ # lctl net down LNET ready to unload sles15s02:~ # lustre_rmmod sles15s02:~ # Wait for route to be marked down on peer. Check dk log to show we do not trigger "Router discovery failed" code path. sles15c01:~ # lctl show_route net tcp1 hops 4294967295 gw 192.168.2.33@tcp2 down pri 0 sles15c01:~ # lctl dk > /tmp/dk.log sles15c01:~ # grep 'Router discovery failed' /tmp/dk.log Build/deploy fix: hornc@sles15build01 lustre-filesystem $ git fetch https://es-gerrit.dev.cray.com/lustre-wc-rel refs/changes/88/158188/2 && git cherry-pick FETCH_HEAD remote: Counting objects: 9, done remote: Finding sources: 100% (5/5) remote: Total 5 (delta 4), reused 5 (delta 4) Unpacking objects: 100% (5/5), done. From https://es-gerrit.dev.cray.com/lustre-wc-rel * branch refs/changes/88/158188/2 -> FETCH_HEAD [task/2.12-int/test-LUS-9612 9571e895bb] LUS-9612 lnet: Router ping timeout with discovery disabled Date: Wed Dec 9 14:38:57 2020 -0600 1 file changed, 4 insertions(+), 4 deletions(-) hornc@sles15build01 lustre-filesystem $ make -j 32 ... sles15build01:~ # for i in sles15s01 sles15s02 sles15c01; do rsync -avr /home/hornc/lustre-filesystem $i:/home/hornc ; done We can see we take the correct "Router discovery failed" code path. sles15s02:~ # lctl net down LNET ready to unload sles15s02:~ # lustre_rmmod sles15s02:~ # sles15c01:~ # lctl show_route net tcp1 hops 4294967295 gw 192.168.2.33@tcp2 down pri 0 sles15c01:~ # lctl dk > /tmp/dk.log sles15c01:~ # grep 'Router discovery failed' /tmp/dk.log 00000400:00000200:0.0:1611332945.989428:0:10139:0:(router.c:540:lnet_router_discovery_complete()) 192.168.2.33@tcp2: Router discovery failed -111 sles15c01:~ # |
| Comment by Gerrit Updater [ 28/Apr/21 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/40923/ |
| Comment by Peter Jones [ 28/Apr/21 ] |
|
Landed for 2.15 |