Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-18444

LNet health router sensitivity may lead to no routes alive

Details

    • Improvement
    • Resolution: Unresolved
    • Minor
    • None
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      It was observed on an o2ib filesystem connected via LNet routers to tcp clients that when a client crashed this could result in the LNet health of o2ib router peer NIs having their health decremented.

      For example, an OSS sends bulk payload as an LNet PUT with an ACK requested. The router only sends the ACK after the message is successfully forwarded to the client. Since the client is crashed the message cannot be forwarded, and the ACK is not sent back to the OSS. This causes an LNet "response timeout", and the health of the router's peer NI is decremented – placing the NI in recovery.

      /*
       * A peer NI is alive if it satisfies the following two conditions:
       *  1. peer NI health >= LNET_MAX_HEALTH_VALUE * router_sensitivity_percentage
       *  2. the cached NI status received when we discover the peer is UP
       */
      static inline bool
      lnet_is_peer_ni_alive(struct lnet_peer_ni *lpni)
      {
              bool halive = false;
      
              halive = (atomic_read(&lpni->lpni_healthv) >=
                       (LNET_MAX_HEALTH_VALUE * router_sensitivity_percentage / 100));
      
              return halive && lpni->lpni_ns_status == LNET_NI_STATUS_UP;
      }
      
      static struct lnet_route *
      lnet_find_route_locked(struct lnet_remotenet *rnet, __u32 src_net,
                             struct lnet_peer_ni *remote_lpni,
                             struct lnet_route **prev_route,
                             struct lnet_peer_ni **gwni)
      {
      ...
              list_for_each_entry(route, &rnet->lrn_routes, lr_list) {
                      if (!lnet_is_route_alive(route))
                              continue;
      

      If a route is not considered "alive" then we will not use it for any sends. If no routes are "alive" then the send will fail. e.g.:

      [5236447.659951] LNetError: 1850029:0:(lib-move.c:2341:lnet_handle_find_routed_path()) no route to 10.112.48.209@tcp5700 from 172.22.12.164@o2ib21

      If a router NI's health is decremented, then it is considered dead/down. If all NI's belonging to a router are dead/down then the route is dead/down.

      Thus, it is possible for clients crashing to result in all server routes going down. This could further hinder availability of an OSS.

      We should modify the route selection to avoid this issue. One idea is to remove consideration of the peer NI health value from lnet_is_peer_ni_alive(). We could instead use the health value as a selection criteria (i.e. prefer "healthier" routers).

      Attachments

        Activity

          [LU-18444] LNet health router sensitivity may lead to no routes alive

          "Chris Horn <chris.horn@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/57493
          Subject: LU-18444 lnet: Remove per-peer health sensitivity
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: 8d266f74c15687d7a8cb15d52c50676a5bc96a5f

          gerrit Gerrit Updater added a comment - "Chris Horn <chris.horn@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/57493 Subject: LU-18444 lnet: Remove per-peer health sensitivity Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 8d266f74c15687d7a8cb15d52c50676a5bc96a5f

          "Chris Horn <chris.horn@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/57492
          Subject: LU-18444 lnet: Use only NI status for route aliveness
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: fd7cc7013f70fd8bebc7e35400330ceb1a7cbe09

          gerrit Gerrit Updater added a comment - "Chris Horn <chris.horn@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/57492 Subject: LU-18444 lnet: Use only NI status for route aliveness Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: fd7cc7013f70fd8bebc7e35400330ceb1a7cbe09

          People

            hornc Chris Horn
            hornc Chris Horn
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: