Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-19827

MR Forwarding not working correctly after LU-18444

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.18.0
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      The LU-18444 change introduced a regression in the multi-rail forwarding code. MR forwarding allows a router to select a different target interface for a message when the one chosen by the sender is down or unhealthy.

              /* Determine whether to allow MR forwarding for this message.
               * NB: MR forwarding is allowed if the message originator and the
               * destination are both MR capable, and the destination lpni that was
               * originally chosen by the originator is unhealthy or down.
               * We check the MR capability of the destination further below
               */
              mr_forwarding_allowed = false;
              if (final_hop) {
                      struct lnet_peer *src_lp;
                      struct lnet_peer_ni *src_lpni;
      
                      src_lpni = lnet_peerni_by_nid_locked(&msg->msg_hdr.src_nid,
                                                         NULL, cpt);
                      /* We don't fail the send if we hit any errors here. We'll just
                       * try to send it via non-multi-rail criteria
                       */
                      if (!IS_ERR(src_lpni)) {
                              /* Drop ref taken by lnet_nid2peerni_locked() */
                              lnet_peer_ni_decref_locked(src_lpni);
                              src_lp = lpni->lpni_peer_net->lpn_peer;
                              if (lnet_peer_is_multi_rail(src_lp) &&
                                  !lnet_is_peer_ni_alive(lpni)) <<< Bug here
                                      mr_forwarding_allowed = true;
      
                      }
                      CDEBUG(D_NET, "msg %p MR forwarding %s\n", msg,
                             mr_forwarding_allowed ? "allowed" : "not allowed");
              }
      

      Prior to LU-18444, the call to lnet_is_peer_ni_alive() would check both the down and unhealthy criteria, but now it only checks the down criteria. We need to add a check for the peer NI's health value. There's actually a second bug here where src_lp is assigned the wrong value. We can fix that at the same time.

      Attachments

        Activity

          People

            hornc Chris Horn
            hornc Chris Horn
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: