[LU-9919] unsafe peer access in lnet_select_pathway() Created: 25/Aug/17  Updated: 06/Feb/18  Resolved: 06/Feb/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.11.0

Type: Bug Priority: Minor
Reporter: John Hammond Assignee: Amir Shehata (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Looking at the changes to lnet_select_pathway() in https://review.whamcloud.com/#/c/25789/24/lnet/lnet/lib-move.c the accesses to peer appear to be unsafe. In the CDEBUG() peer may be invalid:

                lnet_peer_ni_decref_locked(lpni);
                lnet_net_unlock(cpt);

                CDEBUG(D_NET, "%s pending discovery\n",
                       libcfs_nid2str(peer->lp_primary_nid));

                return LNET_DC_WAIT;

And similarly after the lpni has been released:

        }
        lnet_peer_ni_decref_locked(lpni);

        /* If peer is not healthy then can not send anything to it */
        if (!lnet_is_peer_healthy_locked(peer)) {
                lnet_net_unlock(cpt);
                return -EHOSTUNREACH;
        }


 Comments   
Comment by Amir Shehata (Inactive) [ 25/Aug/17 ]

These accesses are ok. If you found an lpni that means there is already a refcount on it, because it's on the list. When you call lnet_nid2peerni_locked() it adds another refcount. So you'd have 1 refcount for it on the list and another refcount because you found it. That latter is what we don't want to keep. Also because we have the cpt locked, and because we do LOCK_EX whenever we delete or add, then that peer structure tree (peer->peer_net->peer_ni) will not be freed while we have the lock. The only time we should worry, is if we drop the lock and regain it again, then another thread can go in there and remove the peer.

Comment by John Hammond [ 28/Aug/17 ]

Are you sure about using peer in the CDEBUG()?

Comment by Gerrit Updater [ 28/Aug/17 ]

Amir Shehata (amir.shehata@intel.com) uploaded a new patch: https://review.whamcloud.com/28771
Subject: LU-9919 lnet: safe access in debug print
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: b721c22413cefd878e046ee5b3faf3a46bc996b0

Comment by John Hammond [ 30/Aug/17 ]

There are a few more places in this function where we call lnet_net_unlock(cpt) and then use best_ni or best_lpni in a debug statement. Could you fix those in you change or create a new change?

Comment by Gerrit Updater [ 06/Feb/18 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/28771/
Subject: LU-9919 lnet: safe access in debug print
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 7c9ffeac1b69f8cead86307d49c8019f4c51821b

Comment by Peter Jones [ 06/Feb/18 ]

Landed for 2.11

Generated at Sat Feb 10 02:30:28 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.