Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-15595

Checking route aliveness should be a lookup rather than a calculation

    XMLWordPrintable

Details

    • Improvement
    • Resolution: Unresolved
    • Minor
    • None
    • None
    • None
    • 9223372036854775807

    Description

      Every send to a remote network results in the sender calculating the aliveness of every route to the remote network. In the worst case this involves checking the health of every local and every remote interface (as determined by discovery pings as well as the LNet health feature) of every router. The aliveness of a route is going to change much less frequently than this send activity, so it makes sense to instead calculate the aliveness when there is some change to a router's interface status or health. That way, on the send path, we simply lookup the current aliveness value.

      I propose to:
      1. Convert the lnet_route::lr_alive field to an atomic_t to avoid any need for special locking when updating the lr_alive value.
      2. Consolidate the logic that interprets discovery ping buffers (there is currently separate logic for router's that have discovery enabled and those that do not).
      3. The logic in #2 should set the lr_alive value based on the current state of the interfaces as well as the contents of the ping buffer.
      4. lnet_is_route_alive() simply returns (or appropriately interprets) the current value of lr_alive

      There are a few other places where route status is modified, and these can be converted appropriately:
      1. lnet_notify()
      1.1 When notified that some lpni is DOWN we can set routes down as appropriate
      1.2 When notified that some lpni is UP we currently set those routes as UP, but I think this is probably too aggressive. We should instead queue the router for discovery. Since we know the lpni is UP, we should be able to discovery it successfully and get an accurate accounting of route status through the gateway.
      2. lnet_parse()
      2.1 When we receive a message from a router we can make some reasonable assumptions about the status of routes through that router (see LUS-9088).

      Lastly, a current component in the route aliveness calculation is the health value of a router's peer NIs. As such, anytime the health of one of these peer NIs is modified we'll need to re-calculate the route aliveness. The current functions for manipulating health values will need to be modified so that we can detect when there's an actual change in health value (they currently just do basically a blind increment/decrement regardless of whether the health value is already maxed out or already 0).

      Attachments

        Activity

          People

            hornc Chris Horn
            hornc Chris Horn
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated: