Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-15595

Checking route aliveness should be a lookup rather than a calculation

Details

    • Improvement
    • Resolution: Unresolved
    • Minor
    • None
    • None
    • None
    • 9223372036854775807

    Description

      Every send to a remote network results in the sender calculating the aliveness of every route to the remote network. In the worst case this involves checking the health of every local and every remote interface (as determined by discovery pings as well as the LNet health feature) of every router. The aliveness of a route is going to change much less frequently than this send activity, so it makes sense to instead calculate the aliveness when there is some change to a router's interface status or health. That way, on the send path, we simply lookup the current aliveness value.

      I propose to:
      1. Convert the lnet_route::lr_alive field to an atomic_t to avoid any need for special locking when updating the lr_alive value.
      2. Consolidate the logic that interprets discovery ping buffers (there is currently separate logic for router's that have discovery enabled and those that do not).
      3. The logic in #2 should set the lr_alive value based on the current state of the interfaces as well as the contents of the ping buffer.
      4. lnet_is_route_alive() simply returns (or appropriately interprets) the current value of lr_alive

      There are a few other places where route status is modified, and these can be converted appropriately:
      1. lnet_notify()
      1.1 When notified that some lpni is DOWN we can set routes down as appropriate
      1.2 When notified that some lpni is UP we currently set those routes as UP, but I think this is probably too aggressive. We should instead queue the router for discovery. Since we know the lpni is UP, we should be able to discovery it successfully and get an accurate accounting of route status through the gateway.
      2. lnet_parse()
      2.1 When we receive a message from a router we can make some reasonable assumptions about the status of routes through that router (see LUS-9088).

      Lastly, a current component in the route aliveness calculation is the health value of a router's peer NIs. As such, anytime the health of one of these peer NIs is modified we'll need to re-calculate the route aliveness. The current functions for manipulating health values will need to be modified so that we can detect when there's an actual change in health value (they currently just do basically a blind increment/decrement regardless of whether the health value is already maxed out or already 0).

      Attachments

        Issue Links

          Activity

            [LU-15595] Checking route aliveness should be a lookup rather than a calculation

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/51546/
            Subject: LU-15595 tests: Router test interop check and aarch fix
            Project: fs/lustre-release
            Branch: b2_15
            Current Patch Set:
            Commit: a3200ab63ad83241bf5383c9be750c42bf104239

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/51546/ Subject: LU-15595 tests: Router test interop check and aarch fix Project: fs/lustre-release Branch: b2_15 Current Patch Set: Commit: a3200ab63ad83241bf5383c9be750c42bf104239

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/51543/
            Subject: LU-15595 tests: Add various router tests
            Project: fs/lustre-release
            Branch: b2_15
            Current Patch Set:
            Commit: 19dbb06af087e9a754bf803c40a0ae12139c6d43

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/51543/ Subject: LU-15595 tests: Add various router tests Project: fs/lustre-release Branch: b2_15 Current Patch Set: Commit: 19dbb06af087e9a754bf803c40a0ae12139c6d43

            "Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51546
            Subject: LU-15595 tests: Router test interop check and aarch fix
            Project: fs/lustre-release
            Branch: b2_15
            Current Patch Set: 1
            Commit: b2ae00a58db4580a591dd82bb6acdb461de327d7

            gerrit Gerrit Updater added a comment - "Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51546 Subject: LU-15595 tests: Router test interop check and aarch fix Project: fs/lustre-release Branch: b2_15 Current Patch Set: 1 Commit: b2ae00a58db4580a591dd82bb6acdb461de327d7

            "Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51545
            Subject: LU-15595 lnet: Always use ping reply to set route lr_alive
            Project: fs/lustre-release
            Branch: b2_15
            Current Patch Set: 1
            Commit: d9e08a1be5e3488e804e6048cbcddb268cb1f5c9

            gerrit Gerrit Updater added a comment - "Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51545 Subject: LU-15595 lnet: Always use ping reply to set route lr_alive Project: fs/lustre-release Branch: b2_15 Current Patch Set: 1 Commit: d9e08a1be5e3488e804e6048cbcddb268cb1f5c9

            "Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51544
            Subject: LU-15595 lnet: LNet peer aliveness broken
            Project: fs/lustre-release
            Branch: b2_15
            Current Patch Set: 1
            Commit: bdb5fe08201cfd5129f27a53b1485849e819f59d

            gerrit Gerrit Updater added a comment - "Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51544 Subject: LU-15595 lnet: LNet peer aliveness broken Project: fs/lustre-release Branch: b2_15 Current Patch Set: 1 Commit: bdb5fe08201cfd5129f27a53b1485849e819f59d

            "Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51543
            Subject: LU-15595 tests: Add various router tests
            Project: fs/lustre-release
            Branch: b2_15
            Current Patch Set: 1
            Commit: 652ac5f81857e1f52c2a34a511a1a2c57e6de4e7

            gerrit Gerrit Updater added a comment - "Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51543 Subject: LU-15595 tests: Add various router tests Project: fs/lustre-release Branch: b2_15 Current Patch Set: 1 Commit: 652ac5f81857e1f52c2a34a511a1a2c57e6de4e7

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/48578/
            Subject: LU-15595 tests: Router test interop check and aarch fix
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 1aba6b0d9b661d3699cbd4624e9db334a13fc647

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/48578/ Subject: LU-15595 tests: Router test interop check and aarch fix Project: fs/lustre-release Branch: master Current Patch Set: Commit: 1aba6b0d9b661d3699cbd4624e9db334a13fc647

            "Chris Horn <chris.horn@hpe.com>" uploaded a new patch: https://review.whamcloud.com/48578
            Subject: LU-15595 tests: Router test interop check and aarch fix
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 927255bd07531a24fc8cb4296d78285630549d5c

            gerrit Gerrit Updater added a comment - "Chris Horn <chris.horn@hpe.com>" uploaded a new patch: https://review.whamcloud.com/48578 Subject: LU-15595 tests: Router test interop check and aarch fix Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 927255bd07531a24fc8cb4296d78285630549d5c
            hornc Chris Horn added a comment -

            Chris, could you please push a patch.

            Okay, I should be able to cook something up tomorrow.

            hornc Chris Horn added a comment - Chris, could you please push a patch. Okay, I should be able to cook something up tomorrow.

            The patch https://review.whamcloud.com/46622 "LU-15595 tests: Add various router tests" added sanity-lnet test_220 - test_227 and was run with "Test-Parameters: trivial" as is typical for LNet tests. However, it looks like these tests are all failing on aarch64 (ARM):
            https://testing.whamcloud.com/test_sets/edf3fd8c-47d3-4045-808e-b2886e56c8f4

            CMD: trevis-108vm12 /usr/sbin/lnetctl set routing 1
            trevis-108vm12: add:
            trevis-108vm12:     - routing:
            trevis-108vm12:           errno: -12
            trevis-108vm12:           descr: "cannot enable routing Cannot allocate memory"
            pdsh@trevis-108vm11: trevis-108vm12: ssh exited with exit code 244
             sanity-lnet test_220: @@@@@@ FAIL: Unable to enable routing on trevis-108vm12 
            

            Note that this error started on 2022-09-01 when 46622 was landed (since it first added those subtests), but was also hit after the LU-16140 "lnet: revert "LU-16011 lnet: use preallocate bulk for server" patch was landed, so the "Cannot allocate memory" error is not directly related to the LU-16011 patch (which only affected lnet-selftest).

            Separately, there is a different error on x86_64 testing, but only when run with "full" test sessions.
            https://testing.whamcloud.com/test_sets/1b87b224-2c5b-4035-a68d-99a776eddc6f

            onyx-60vm3: onyx-60vm3.onyx.whamcloud.com: executing load_lnet config_on_load=1
            onyx-60vm3: rpc.sh: line 21: load_lnet: command not found
            pdsh@onyx-60vm1: onyx-60vm3: ssh exited with exit code 127
             sanity-lnet test_227: @@@@@@ FAIL: Failed to load and configure LNet 
            

            It looks like this is failing because it is trying to test against 2.12.9 servers, which do not have the "load_lnet" command. These tests need to add a version check so that they are skipped with older servers (the "load_lnet" function was added in commit v2_15_0-RC2-42-ge41f91dc90:

                    (( $MDS1_VERSION >= $(version_code 2.15.0) )) ||
                            skip "need at least 2.15.0 for load_lnet"
            

            Chris, could you please push a patch.

            adilger Andreas Dilger added a comment - The patch https://review.whamcloud.com/46622 " LU-15595 tests: Add various router tests " added sanity-lnet test_220 - test_227 and was run with " Test-Parameters: trivial " as is typical for LNet tests. However, it looks like these tests are all failing on aarch64 (ARM): https://testing.whamcloud.com/test_sets/edf3fd8c-47d3-4045-808e-b2886e56c8f4 CMD: trevis-108vm12 /usr/sbin/lnetctl set routing 1 trevis-108vm12: add: trevis-108vm12: - routing: trevis-108vm12: errno: -12 trevis-108vm12: descr: "cannot enable routing Cannot allocate memory" pdsh@trevis-108vm11: trevis-108vm12: ssh exited with exit code 244 sanity-lnet test_220: @@@@@@ FAIL: Unable to enable routing on trevis-108vm12 Note that this error started on 2022-09-01 when 46622 was landed (since it first added those subtests), but was also hit after the LU-16140 " lnet: revert " LU-16011 lnet: use preallocate bulk for server " patch was landed, so the " Cannot allocate memory " error is not directly related to the LU-16011 patch (which only affected lnet-selftest). Separately, there is a different error on x86_64 testing, but only when run with "full" test sessions. https://testing.whamcloud.com/test_sets/1b87b224-2c5b-4035-a68d-99a776eddc6f onyx-60vm3: onyx-60vm3.onyx.whamcloud.com: executing load_lnet config_on_load=1 onyx-60vm3: rpc.sh: line 21: load_lnet: command not found pdsh@onyx-60vm1: onyx-60vm3: ssh exited with exit code 127 sanity-lnet test_227: @@@@@@ FAIL: Failed to load and configure LNet It looks like this is failing because it is trying to test against 2.12.9 servers, which do not have the " load_lnet " command. These tests need to add a version check so that they are skipped with older servers (the " load_lnet " function was added in commit v2_15_0-RC2-42-ge41f91dc90: (( $MDS1_VERSION >= $(version_code 2.15.0) )) || skip "need at least 2.15.0 for load_lnet" Chris, could you please push a patch.

            People

              hornc Chris Horn
              hornc Chris Horn
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated: