Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-13708

lnet_notify can set route aliveness incorrectly

    XMLWordPrintable

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • Lustre 2.15.0
    • None
    • 3
    • 9223372036854775807

    Description

      lnet_notify() modifies route aliveness in two ways:
      1. By setting lp_alive field of the lnet_peer struct.
      2. By setting lr_alive field of the lnet_route struct (via call to lnet_set_route_aliveness())

      In both cases, the aliveness value assigned is determined by a call
      to lnet_is_peer_ni_alive(), but that value only reflects the aliveness
      of a particular peer NI. A gateway may have multiple peer NIs, so the
      aliveness of a gateway peer (lp_alive) is not necessarily equivalent
      to the aliveness of one of its NIs. Furthermore, the lr_alive field
      is only used to determine route aliveness for path selection if
      discovery is disabled locally or on the gateway (see
      lnet_find_route_locked() and lnet_is_route_alive()).

      In general, we should not set lp_alive based on an lnet_notify()
      call, and we should only set lr_alive if discovery is disabled. for
      lr_alive specifically, we should only set it for those routes that
      have the peer NI as a next-hop.

      An exception to the above exists when the reset argument to
      lnet_notify() is set. The gnilnd uses this flag in its calls to
      lnet_notify() because gnilnd receives out-of-band notifications of
      node up and down events. Thus, when gnilnd calls lnet_notify() we
      actually know whether the gateway peer is up or down and we can set
      lp_alive and lr_alive appropriately.

      Attachments

        Activity

          People

            hornc Chris Horn
            hornc Chris Horn
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: