[LU-9919] unsafe peer access in lnet_select_pathway() Created: 25/Aug/17 Updated: 06/Feb/18 Resolved: 06/Feb/18 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.11.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | John Hammond | Assignee: | Amir Shehata (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
Looking at the changes to lnet_select_pathway() in https://review.whamcloud.com/#/c/25789/24/lnet/lnet/lib-move.c the accesses to peer appear to be unsafe. In the CDEBUG() peer may be invalid: lnet_peer_ni_decref_locked(lpni);
lnet_net_unlock(cpt);
CDEBUG(D_NET, "%s pending discovery\n",
libcfs_nid2str(peer->lp_primary_nid));
return LNET_DC_WAIT;
And similarly after the lpni has been released: }
lnet_peer_ni_decref_locked(lpni);
/* If peer is not healthy then can not send anything to it */
if (!lnet_is_peer_healthy_locked(peer)) {
lnet_net_unlock(cpt);
return -EHOSTUNREACH;
}
|
| Comments |
| Comment by Amir Shehata (Inactive) [ 25/Aug/17 ] |
|
These accesses are ok. If you found an lpni that means there is already a refcount on it, because it's on the list. When you call lnet_nid2peerni_locked() it adds another refcount. So you'd have 1 refcount for it on the list and another refcount because you found it. That latter is what we don't want to keep. Also because we have the cpt locked, and because we do LOCK_EX whenever we delete or add, then that peer structure tree (peer->peer_net->peer_ni) will not be freed while we have the lock. The only time we should worry, is if we drop the lock and regain it again, then another thread can go in there and remove the peer. |
| Comment by John Hammond [ 28/Aug/17 ] |
|
Are you sure about using peer in the CDEBUG()? |
| Comment by Gerrit Updater [ 28/Aug/17 ] |
|
Amir Shehata (amir.shehata@intel.com) uploaded a new patch: https://review.whamcloud.com/28771 |
| Comment by John Hammond [ 30/Aug/17 ] |
|
There are a few more places in this function where we call lnet_net_unlock(cpt) and then use best_ni or best_lpni in a debug statement. Could you fix those in you change or create a new change? |
| Comment by Gerrit Updater [ 06/Feb/18 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/28771/ |
| Comment by Peter Jones [ 06/Feb/18 ] |
|
Landed for 2.11 |