Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-18572

Regression in 2.15.4 backport of b341288179 LU-14668 lnet: Lock primary NID logic

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • Lustre 2.15.7
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      In 2.15, lnet_peer_add() takes an lnet_nid_t nid argument, but the backport uses an uninitalized large nid variable in various places. This results in incorrect behavior.

      static int
      lnet_peer_add(lnet_nid_t nid4, unsigned int flags)
      {
              struct lnet_nid nid;
      ...
              lpni = lnet_find_peer_ni_locked(nid4);
              if (lpni) {
                      /* A peer with this NID already exists. */
                      lp = lpni->lpni_peer_net->lpn_peer;
                      lnet_peer_ni_decref_locked(lpni);
                      if (lp->lp_state & LNET_PEER_CONFIGURED) {
                              if (lnet_nid_to_nid4(&lp->lp_primary_nid) != nid4)
                                      rc = -EEXIST;
                              else if ((lp->lp_state ^ flags) & LNET_PEER_MULTI_RAIL)
                                      rc = -EPERM;
                              goto out;
                      } else if (lp->lp_state & LNET_PEER_LOCK_PRIMARY) {
                              if (nid_same(&lp->lp_primary_nid, &nid)) <<<<<<<<<<< 'nid' not initialized
                                      rc = -EEXIST;
                              else
                                      rc = -EALREADY;
                              goto out;
                      } else if (!(flags &
                                 (LNET_PEER_LOCK_PRIMARY | LNET_PEER_CONFIGURED))) {
                              if (nid_same(&lp->lp_primary_nid, &nid)) { <<<<<<<<<<< 'nid' not initialized
                                      rc = -EEXIST;
                                      goto out;
                              }
                      }
                      rc = lnet_peer_del(lp);
                      if (rc)
                              goto out;
              }
      
              /* Create peer, peer_net, and peer_ni. */
              rc = -ENOMEM;
              lnet_nid4_to_nid(nid4, &nid);  <<<<<<<<<<<<< 'nid' initialized here
      

      Three cases to consider w.r.t. impact of this bug:

      • Statically defined peers.
        • Peers that are defined statically will have LNET_PEER_CONFIGURED bit set in lp_state, so they will not hit the buggy code.
      • lock_prim_nid enabled (default).
        • In this case, peers will have LNET_PEER_LOCK_PRIMARY bit set in lp_state. The bug causes lnet_peer_add() to return -EALREADY instead of -EEXIST for some cases.
        • Only LNetAddPeer() cares about this status. When LNetAddPeer() sees -EALREADY it will perform a lookup of "already existing" peer to grab its primary NID. In the EEXIST case, this is just the same peer so it is kind of a no-op. There should not be a problem here.
      • lock_prim_nid disabled.
        • In this case the bug prevents us from detecting when we are trying to add a peer that already exists. This causes us to unnecessarily delete and re-create peers. This should be benign as the DLC code is intended to allow deleting and creating peers at will. lock_prim_nid is enabled by default so we can avoid this case. At the same time, the presence of the bug will limit our ability to provide flexibility in enabling/disabling the feature.

      Attachments

        Issue Links

          Activity

            People

              hornc Chris Horn
              hornc Chris Horn
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated: