[LU-15595] Checking route aliveness should be a lookup rather than a calculation Created: 25/Feb/22  Updated: 29/Jul/23

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Minor
Reporter: Chris Horn Assignee: Chris Horn
Resolution: Unresolved Votes: 0
Labels: None

Rank (Obsolete): 9223372036854775807

 Description   

Every send to a remote network results in the sender calculating the aliveness of every route to the remote network. In the worst case this involves checking the health of every local and every remote interface (as determined by discovery pings as well as the LNet health feature) of every router. The aliveness of a route is going to change much less frequently than this send activity, so it makes sense to instead calculate the aliveness when there is some change to a router's interface status or health. That way, on the send path, we simply lookup the current aliveness value.

I propose to:
1. Convert the lnet_route::lr_alive field to an atomic_t to avoid any need for special locking when updating the lr_alive value.
2. Consolidate the logic that interprets discovery ping buffers (there is currently separate logic for router's that have discovery enabled and those that do not).
3. The logic in #2 should set the lr_alive value based on the current state of the interfaces as well as the contents of the ping buffer.
4. lnet_is_route_alive() simply returns (or appropriately interprets) the current value of lr_alive

There are a few other places where route status is modified, and these can be converted appropriately:
1. lnet_notify()
1.1 When notified that some lpni is DOWN we can set routes down as appropriate
1.2 When notified that some lpni is UP we currently set those routes as UP, but I think this is probably too aggressive. We should instead queue the router for discovery. Since we know the lpni is UP, we should be able to discovery it successfully and get an accurate accounting of route status through the gateway.
2. lnet_parse()
2.1 When we receive a message from a router we can make some reasonable assumptions about the status of routes through that router (see LUS-9088).

Lastly, a current component in the route aliveness calculation is the health value of a router's peer NIs. As such, anytime the health of one of these peer NIs is modified we'll need to re-calculate the route aliveness. The current functions for manipulating health values will need to be modified so that we can detect when there's an actual change in health value (they currently just do basically a blind increment/decrement regardless of whether the health value is already maxed out or already 0).



 Comments   
Comment by Gerrit Updater [ 25/Feb/22 ]

"Chris Horn <chris.horn@hpe.com>" uploaded a new patch: https://review.whamcloud.com/46622
Subject: LU-15595 tests: Add various router tests
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: fd57595ecb323922cf3132b8789f8d4d7818e497

Comment by Gerrit Updater [ 25/Feb/22 ]

"Chris Horn <chris.horn@hpe.com>" uploaded a new patch: https://review.whamcloud.com/46623
Subject: LU-15595 lnet: LNet peer aliveness broken
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 6ba081b278f8f0a349c26aef469291fa9bed8197

Comment by Gerrit Updater [ 25/Feb/22 ]

"Chris Horn <chris.horn@hpe.com>" uploaded a new patch: https://review.whamcloud.com/46624
Subject: LU-15595 lnet: Always use ping reply to set route lr_alive
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: ab886f01c3f36149c42e20b0920f478ac4aec104

Comment by Gerrit Updater [ 25/Feb/22 ]

"Chris Horn <chris.horn@hpe.com>" uploaded a new patch: https://review.whamcloud.com/46625
Subject: LU-15595 lnet: Tweak route updates in lnet_notify
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: bfeb655da6c029035e74ccb7a402c2393cc714b0

Comment by Gerrit Updater [ 25/Feb/22 ]

"Chris Horn <chris.horn@hpe.com>" uploaded a new patch: https://review.whamcloud.com/46626
Subject: LU-15595 lnet: Remove duplicate checks for peer sensitivity
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 7051155306f6024db79bedb7f4dbfef9529b14bc

Comment by Gerrit Updater [ 25/Feb/22 ]

"Chris Horn <chris.horn@hpe.com>" uploaded a new patch: https://review.whamcloud.com/46627
Subject: LU-15595 lnet: Update lnet_route::lr_alive on health change
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: b2aa610ed7f8b8a3b4c9d93cd5037114036ef1f5

Comment by Gerrit Updater [ 25/Feb/22 ]

"Chris Horn <chris.horn@hpe.com>" uploaded a new patch: https://review.whamcloud.com/46628
Subject: LU-15595 lnet: Always set lnet_route::lr_alive
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 54d81c4e67e8907e672fae88a01df9ed3247d78c

Comment by Gerrit Updater [ 25/Feb/22 ]

"Chris Horn <chris.horn@hpe.com>" uploaded a new patch: https://review.whamcloud.com/46629
Subject: LU-15595 lnet: Do not calculate route aliveness on send
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 4e567b1be4e70e0e3d669488c6d3abdb5c481b31

Comment by Gerrit Updater [ 28/Feb/22 ]

"Chris Horn <chris.horn@hpe.com>" uploaded a new patch: https://review.whamcloud.com/46653
Subject: LU-15595 debug: unload_modules_local debug
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 169e57fefff121a293168568feb9cf5c440eccc5

Comment by Gerrit Updater [ 01/Sep/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/46622/
Subject: LU-15595 tests: Add various router tests
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 8ee85e15412d32fbe60f70c474c0a28ff15b8351

Comment by Gerrit Updater [ 01/Sep/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/46623/
Subject: LU-15595 lnet: LNet peer aliveness broken
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: caf6095ade66f70d4bad99ced7a918814a3af092

Comment by Andreas Dilger [ 15/Sep/22 ]

The patch https://review.whamcloud.com/46622 "LU-15595 tests: Add various router tests" added sanity-lnet test_220 - test_227 and was run with "Test-Parameters: trivial" as is typical for LNet tests. However, it looks like these tests are all failing on aarch64 (ARM):
https://testing.whamcloud.com/test_sets/edf3fd8c-47d3-4045-808e-b2886e56c8f4

CMD: trevis-108vm12 /usr/sbin/lnetctl set routing 1
trevis-108vm12: add:
trevis-108vm12:     - routing:
trevis-108vm12:           errno: -12
trevis-108vm12:           descr: "cannot enable routing Cannot allocate memory"
pdsh@trevis-108vm11: trevis-108vm12: ssh exited with exit code 244
 sanity-lnet test_220: @@@@@@ FAIL: Unable to enable routing on trevis-108vm12 

Note that this error started on 2022-09-01 when 46622 was landed (since it first added those subtests), but was also hit after the LU-16140 "lnet: revert "LU-16011 lnet: use preallocate bulk for server" patch was landed, so the "Cannot allocate memory" error is not directly related to the LU-16011 patch (which only affected lnet-selftest).

Separately, there is a different error on x86_64 testing, but only when run with "full" test sessions.
https://testing.whamcloud.com/test_sets/1b87b224-2c5b-4035-a68d-99a776eddc6f

onyx-60vm3: onyx-60vm3.onyx.whamcloud.com: executing load_lnet config_on_load=1
onyx-60vm3: rpc.sh: line 21: load_lnet: command not found
pdsh@onyx-60vm1: onyx-60vm3: ssh exited with exit code 127
 sanity-lnet test_227: @@@@@@ FAIL: Failed to load and configure LNet 

It looks like this is failing because it is trying to test against 2.12.9 servers, which do not have the "load_lnet" command. These tests need to add a version check so that they are skipped with older servers (the "load_lnet" function was added in commit v2_15_0-RC2-42-ge41f91dc90:

        (( $MDS1_VERSION >= $(version_code 2.15.0) )) ||
                skip "need at least 2.15.0 for load_lnet"

Chris, could you please push a patch.

Comment by Chris Horn [ 15/Sep/22 ]

Chris, could you please push a patch.

Okay, I should be able to cook something up tomorrow.

Comment by Gerrit Updater [ 16/Sep/22 ]

"Chris Horn <chris.horn@hpe.com>" uploaded a new patch: https://review.whamcloud.com/48578
Subject: LU-15595 tests: Router test interop check and aarch fix
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 927255bd07531a24fc8cb4296d78285630549d5c

Comment by Gerrit Updater [ 10/Oct/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/48578/
Subject: LU-15595 tests: Router test interop check and aarch fix
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 1aba6b0d9b661d3699cbd4624e9db334a13fc647

Comment by Gerrit Updater [ 03/Jul/23 ]

"Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51543
Subject: LU-15595 tests: Add various router tests
Project: fs/lustre-release
Branch: b2_15
Current Patch Set: 1
Commit: 652ac5f81857e1f52c2a34a511a1a2c57e6de4e7

Comment by Gerrit Updater [ 03/Jul/23 ]

"Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51544
Subject: LU-15595 lnet: LNet peer aliveness broken
Project: fs/lustre-release
Branch: b2_15
Current Patch Set: 1
Commit: bdb5fe08201cfd5129f27a53b1485849e819f59d

Comment by Gerrit Updater [ 03/Jul/23 ]

"Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51545
Subject: LU-15595 lnet: Always use ping reply to set route lr_alive
Project: fs/lustre-release
Branch: b2_15
Current Patch Set: 1
Commit: d9e08a1be5e3488e804e6048cbcddb268cb1f5c9

Comment by Gerrit Updater [ 03/Jul/23 ]

"Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51546
Subject: LU-15595 tests: Router test interop check and aarch fix
Project: fs/lustre-release
Branch: b2_15
Current Patch Set: 1
Commit: b2ae00a58db4580a591dd82bb6acdb461de327d7

Generated at Sat Feb 10 03:19:40 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.