Details
-
Bug
-
Resolution: Fixed
-
Minor
-
None
-
None
-
3
-
9223372036854775807
Description
The LU-18444 change introduced a regression in the multi-rail forwarding code. MR forwarding allows a router to select a different target interface for a message when the one chosen by the sender is down or unhealthy.
/* Determine whether to allow MR forwarding for this message.
* NB: MR forwarding is allowed if the message originator and the
* destination are both MR capable, and the destination lpni that was
* originally chosen by the originator is unhealthy or down.
* We check the MR capability of the destination further below
*/
mr_forwarding_allowed = false;
if (final_hop) {
struct lnet_peer *src_lp;
struct lnet_peer_ni *src_lpni;
src_lpni = lnet_peerni_by_nid_locked(&msg->msg_hdr.src_nid,
NULL, cpt);
/* We don't fail the send if we hit any errors here. We'll just
* try to send it via non-multi-rail criteria
*/
if (!IS_ERR(src_lpni)) {
/* Drop ref taken by lnet_nid2peerni_locked() */
lnet_peer_ni_decref_locked(src_lpni);
src_lp = lpni->lpni_peer_net->lpn_peer;
if (lnet_peer_is_multi_rail(src_lp) &&
!lnet_is_peer_ni_alive(lpni)) <<< Bug here
mr_forwarding_allowed = true;
}
CDEBUG(D_NET, "msg %p MR forwarding %s\n", msg,
mr_forwarding_allowed ? "allowed" : "not allowed");
}
Prior to LU-18444, the call to lnet_is_peer_ni_alive() would check both the down and unhealthy criteria, but now it only checks the down criteria. We need to add a check for the peer NI's health value. There's actually a second bug here where src_lp is assigned the wrong value. We can fix that at the same time.