Details
-
Bug
-
Resolution: Fixed
-
Minor
-
None
-
None
-
3
-
9223372036854775807
Description
there is a race condition with checking, setting and clearing the kp_hello_pending flag that can result in multiple hello requests being sent for the same peer.
The issue can occur when there is a hello request outstanding, and multiple threads sending when the hello request deadline is exceeded.
The sendings threads each call kfilnd_peer_needs_hello(). This function sees deadline is expired and clears the kp_hello_pending flag:
static inline bool kfilnd_peer_needs_hello(struct kfilnd_peer *kp,
bool proactive_handshake)
{
...
} else if (ktime_before(kp->kp_hello_ts + lnet_get_lnd_timeout(),
ktime_get_seconds())) {
/* Sent hello but never received reply */
CDEBUG(D_NET,
"No response from %s(%p):0x%llx after %lld\n",
libcfs_nid2str(kp->kp_nid), kp, kp->kp_addr,
ktime_sub(ktime_get_seconds(), kp->kp_hello_ts));
kfilnd_peer_clear_hello_pending(kp);
The sending thread then issues a new hello request which sets the hello_pending flag, but the kp_hello_ts is not updated right away:
int kfilnd_send_hello_request(struct kfilnd_dev *dev, int cpt,
struct kfilnd_peer *kp)
{
struct kfilnd_transaction *tn;
int rc;
if (kfilnd_peer_set_check_hello_pending(kp)) { <<<<<<<<<<<<<<<<<<<<<<<<< kp_hello_pending set
CDEBUG(D_NET, "Hello already pending to peer %s(%px)\n",
libcfs_nid2str(kp->kp_nid), kp);
return 0;
}
tn = kfilnd_tn_alloc_for_peer(dev, cpt, kp, true, true, false);
if (IS_ERR(tn)) {
rc = PTR_ERR(tn);
CERROR("Failed to allocate transaction struct: rc=%d\n", rc);
kfilnd_peer_clear_hello_pending(kp);
return rc;
}
/* +1 for tn->tn_kp. This ref is dropped when this transaction is
* finalized
*/
refcount_inc(&kp->kp_cnt);
tn->msg_type = KFILND_MSG_HELLO_REQ;
kp->kp_hello_ts = ktime_get_seconds(); <<<<<<<<<<<<<<<<<<<<<<<<< kp_hello_ts updated
kfilnd_tn_event_handler(tn, TN_EVENT_TX_HELLO, 0);
return 0;
}
In the window between kp_hello_pending getting set and kp_hello_ts being updated, another sending thread can call kfilnd_peer_needs_hello() and do the same thing.
# Hello sent 00000800:00000200:22.0:1699556200.100120:0:17048:0:(kfilnd_tn.c:649:kfilnd_tn_state_idle()) KFILND_MSG_HELLO_REQ Transact ion ID 0000000038261ad5: 1@kfi:5 -> 1@kfi(000000009029990e):0x0 TN_EVENT_TX_HELLO event status 0 # Three threads all see hello expired and clear hello pending: 00000800:00000200:23.0:1699556412.994968:0:7169:0:(kfilnd.h:308:kfilnd_peer_needs_hello()) No response from 1@kfi(000000009029990e):0x2 after 213 00000800:00000200:22.0:1699556412.994968:0:5606:0:(kfilnd.h:308:kfilnd_peer_needs_hello()) No response from 1@kfi(000000009029990e):0x2 after 213 00000800:00000200:21.0:1699556412.994968:0:15493:0:(kfilnd.h:308:kfilnd_peer_needs_hello()) No response from 1@kfi(000000009029990e):0x2 after 213 # Each thread sends a hello: 00000800:00000200:23.0:1699556412.994971:0:7169:0:(kfilnd_tn.c:649:kfilnd_tn_state_idle()) KFILND_MSG_HELLO_REQ Transaction ID 000000001b6bcb14: 1@kfi:5 -> 1@kfi(000000009029990e):0x0 TN_EVENT_TX_HELLO event status 0 00000800:00000200:22.0:1699556412.994971:0:5606:0:(kfilnd_tn.c:649:kfilnd_tn_state_idle()) KFILND_MSG_HELLO_REQ Transaction ID 00000000bab5bd32: 1@kfi:5 -> 1@kfi(000000009029990e):0x0 TN_EVENT_TX_HELLO event status 0 00000800:00000200:21.0:1699556412.994971:0:15493:0:(kfilnd_tn.c:649:kfilnd_tn_state_idle()) KFILND_MSG_HELLO_REQ Transaction ID 00000000e19578f9: 1@kfi:5 -> 1@kfi(000000009029990e):0x0 TN_EVENT_TX_HELLO event status 0