Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-9919

unsafe peer access in lnet_select_pathway()

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.11.0
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      Looking at the changes to lnet_select_pathway() in https://review.whamcloud.com/#/c/25789/24/lnet/lnet/lib-move.c the accesses to peer appear to be unsafe. In the CDEBUG() peer may be invalid:

                      lnet_peer_ni_decref_locked(lpni);
                      lnet_net_unlock(cpt);
      
                      CDEBUG(D_NET, "%s pending discovery\n",
                             libcfs_nid2str(peer->lp_primary_nid));
      
                      return LNET_DC_WAIT;
      

      And similarly after the lpni has been released:

              }
              lnet_peer_ni_decref_locked(lpni);
      
              /* If peer is not healthy then can not send anything to it */
              if (!lnet_is_peer_healthy_locked(peer)) {
                      lnet_net_unlock(cpt);
                      return -EHOSTUNREACH;
              }
      

      Attachments

        Activity

          [LU-9919] unsafe peer access in lnet_select_pathway()
          pjones Peter Jones added a comment -

          Landed for 2.11

          pjones Peter Jones added a comment - Landed for 2.11

          Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/28771/
          Subject: LU-9919 lnet: safe access in debug print
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: 7c9ffeac1b69f8cead86307d49c8019f4c51821b

          gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/28771/ Subject: LU-9919 lnet: safe access in debug print Project: fs/lustre-release Branch: master Current Patch Set: Commit: 7c9ffeac1b69f8cead86307d49c8019f4c51821b
          jhammond John Hammond added a comment -

          There are a few more places in this function where we call lnet_net_unlock(cpt) and then use best_ni or best_lpni in a debug statement. Could you fix those in you change or create a new change?

          jhammond John Hammond added a comment - There are a few more places in this function where we call lnet_net_unlock(cpt) and then use best_ni or best_lpni in a debug statement. Could you fix those in you change or create a new change?

          Amir Shehata (amir.shehata@intel.com) uploaded a new patch: https://review.whamcloud.com/28771
          Subject: LU-9919 lnet: safe access in debug print
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: b721c22413cefd878e046ee5b3faf3a46bc996b0

          gerrit Gerrit Updater added a comment - Amir Shehata (amir.shehata@intel.com) uploaded a new patch: https://review.whamcloud.com/28771 Subject: LU-9919 lnet: safe access in debug print Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: b721c22413cefd878e046ee5b3faf3a46bc996b0
          jhammond John Hammond added a comment -

          Are you sure about using peer in the CDEBUG()?

          jhammond John Hammond added a comment - Are you sure about using peer in the CDEBUG()?

          These accesses are ok. If you found an lpni that means there is already a refcount on it, because it's on the list. When you call lnet_nid2peerni_locked() it adds another refcount. So you'd have 1 refcount for it on the list and another refcount because you found it. That latter is what we don't want to keep. Also because we have the cpt locked, and because we do LOCK_EX whenever we delete or add, then that peer structure tree (peer->peer_net->peer_ni) will not be freed while we have the lock. The only time we should worry, is if we drop the lock and regain it again, then another thread can go in there and remove the peer.

          ashehata Amir Shehata (Inactive) added a comment - These accesses are ok. If you found an lpni that means there is already a refcount on it, because it's on the list. When you call lnet_nid2peerni_locked() it adds another refcount. So you'd have 1 refcount for it on the list and another refcount because you found it. That latter is what we don't want to keep. Also because we have the cpt locked, and because we do LOCK_EX whenever we delete or add, then that peer structure tree (peer->peer_net->peer_ni) will not be freed while we have the lock. The only time we should worry, is if we drop the lock and regain it again, then another thread can go in there and remove the peer.

          People

            ashehata Amir Shehata (Inactive)
            jhammond John Hammond
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: