[LU-15595] Checking route aliveness should be a lookup rather than a calculation Created: 25/Feb/22 Updated: 29/Jul/23 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Improvement | Priority: | Minor |
| Reporter: | Chris Horn | Assignee: | Chris Horn |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
Every send to a remote network results in the sender calculating the aliveness of every route to the remote network. In the worst case this involves checking the health of every local and every remote interface (as determined by discovery pings as well as the LNet health feature) of every router. The aliveness of a route is going to change much less frequently than this send activity, so it makes sense to instead calculate the aliveness when there is some change to a router's interface status or health. That way, on the send path, we simply lookup the current aliveness value. I propose to: There are a few other places where route status is modified, and these can be converted appropriately: Lastly, a current component in the route aliveness calculation is the health value of a router's peer NIs. As such, anytime the health of one of these peer NIs is modified we'll need to re-calculate the route aliveness. The current functions for manipulating health values will need to be modified so that we can detect when there's an actual change in health value (they currently just do basically a blind increment/decrement regardless of whether the health value is already maxed out or already 0). |
| Comments |
| Comment by Gerrit Updater [ 25/Feb/22 ] |
|
"Chris Horn <chris.horn@hpe.com>" uploaded a new patch: https://review.whamcloud.com/46622 |
| Comment by Gerrit Updater [ 25/Feb/22 ] |
|
"Chris Horn <chris.horn@hpe.com>" uploaded a new patch: https://review.whamcloud.com/46623 |
| Comment by Gerrit Updater [ 25/Feb/22 ] |
|
"Chris Horn <chris.horn@hpe.com>" uploaded a new patch: https://review.whamcloud.com/46624 |
| Comment by Gerrit Updater [ 25/Feb/22 ] |
|
"Chris Horn <chris.horn@hpe.com>" uploaded a new patch: https://review.whamcloud.com/46625 |
| Comment by Gerrit Updater [ 25/Feb/22 ] |
|
"Chris Horn <chris.horn@hpe.com>" uploaded a new patch: https://review.whamcloud.com/46626 |
| Comment by Gerrit Updater [ 25/Feb/22 ] |
|
"Chris Horn <chris.horn@hpe.com>" uploaded a new patch: https://review.whamcloud.com/46627 |
| Comment by Gerrit Updater [ 25/Feb/22 ] |
|
"Chris Horn <chris.horn@hpe.com>" uploaded a new patch: https://review.whamcloud.com/46628 |
| Comment by Gerrit Updater [ 25/Feb/22 ] |
|
"Chris Horn <chris.horn@hpe.com>" uploaded a new patch: https://review.whamcloud.com/46629 |
| Comment by Gerrit Updater [ 28/Feb/22 ] |
|
"Chris Horn <chris.horn@hpe.com>" uploaded a new patch: https://review.whamcloud.com/46653 |
| Comment by Gerrit Updater [ 01/Sep/22 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/46622/ |
| Comment by Gerrit Updater [ 01/Sep/22 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/46623/ |
| Comment by Andreas Dilger [ 15/Sep/22 ] |
|
The patch https://review.whamcloud.com/46622 "LU-15595 tests: Add various router tests" added sanity-lnet test_220 - test_227 and was run with "Test-Parameters: trivial" as is typical for LNet tests. However, it looks like these tests are all failing on aarch64 (ARM): CMD: trevis-108vm12 /usr/sbin/lnetctl set routing 1 trevis-108vm12: add: trevis-108vm12: - routing: trevis-108vm12: errno: -12 trevis-108vm12: descr: "cannot enable routing Cannot allocate memory" pdsh@trevis-108vm11: trevis-108vm12: ssh exited with exit code 244 sanity-lnet test_220: @@@@@@ FAIL: Unable to enable routing on trevis-108vm12 Note that this error started on 2022-09-01 when 46622 was landed (since it first added those subtests), but was also hit after the LU-16140 "lnet: revert " Separately, there is a different error on x86_64 testing, but only when run with "full" test sessions. onyx-60vm3: onyx-60vm3.onyx.whamcloud.com: executing load_lnet config_on_load=1 onyx-60vm3: rpc.sh: line 21: load_lnet: command not found pdsh@onyx-60vm1: onyx-60vm3: ssh exited with exit code 127 sanity-lnet test_227: @@@@@@ FAIL: Failed to load and configure LNet It looks like this is failing because it is trying to test against 2.12.9 servers, which do not have the "load_lnet" command. These tests need to add a version check so that they are skipped with older servers (the "load_lnet" function was added in commit v2_15_0-RC2-42-ge41f91dc90:
(( $MDS1_VERSION >= $(version_code 2.15.0) )) ||
skip "need at least 2.15.0 for load_lnet"
Chris, could you please push a patch. |
| Comment by Chris Horn [ 15/Sep/22 ] |
Okay, I should be able to cook something up tomorrow. |
| Comment by Gerrit Updater [ 16/Sep/22 ] |
|
"Chris Horn <chris.horn@hpe.com>" uploaded a new patch: https://review.whamcloud.com/48578 |
| Comment by Gerrit Updater [ 10/Oct/22 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/48578/ |
| Comment by Gerrit Updater [ 03/Jul/23 ] |
|
"Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51543 |
| Comment by Gerrit Updater [ 03/Jul/23 ] |
|
"Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51544 |
| Comment by Gerrit Updater [ 03/Jul/23 ] |
|
"Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51545 |
| Comment by Gerrit Updater [ 03/Jul/23 ] |
|
"Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51546 |