Details
-
Bug
-
Resolution: Fixed
-
Minor
-
None
-
None
-
3
-
9223372036854775807
Description
there is a race condition with checking, setting and clearing the kp_hello_pending flag that can result in multiple hello requests being sent for the same peer.
The issue can occur when there is a hello request outstanding, and multiple threads sending when the hello request deadline is exceeded.
The sendings threads each call kfilnd_peer_needs_hello(). This function sees deadline is expired and clears the kp_hello_pending flag:
static inline bool kfilnd_peer_needs_hello(struct kfilnd_peer *kp, bool proactive_handshake) { ... } else if (ktime_before(kp->kp_hello_ts + lnet_get_lnd_timeout(), ktime_get_seconds())) { /* Sent hello but never received reply */ CDEBUG(D_NET, "No response from %s(%p):0x%llx after %lld\n", libcfs_nid2str(kp->kp_nid), kp, kp->kp_addr, ktime_sub(ktime_get_seconds(), kp->kp_hello_ts)); kfilnd_peer_clear_hello_pending(kp);
The sending thread then issues a new hello request which sets the hello_pending flag, but the kp_hello_ts is not updated right away:
int kfilnd_send_hello_request(struct kfilnd_dev *dev, int cpt, struct kfilnd_peer *kp) { struct kfilnd_transaction *tn; int rc; if (kfilnd_peer_set_check_hello_pending(kp)) { <<<<<<<<<<<<<<<<<<<<<<<<< kp_hello_pending set CDEBUG(D_NET, "Hello already pending to peer %s(%px)\n", libcfs_nid2str(kp->kp_nid), kp); return 0; } tn = kfilnd_tn_alloc_for_peer(dev, cpt, kp, true, true, false); if (IS_ERR(tn)) { rc = PTR_ERR(tn); CERROR("Failed to allocate transaction struct: rc=%d\n", rc); kfilnd_peer_clear_hello_pending(kp); return rc; } /* +1 for tn->tn_kp. This ref is dropped when this transaction is * finalized */ refcount_inc(&kp->kp_cnt); tn->msg_type = KFILND_MSG_HELLO_REQ; kp->kp_hello_ts = ktime_get_seconds(); <<<<<<<<<<<<<<<<<<<<<<<<< kp_hello_ts updated kfilnd_tn_event_handler(tn, TN_EVENT_TX_HELLO, 0); return 0; }
In the window between kp_hello_pending getting set and kp_hello_ts being updated, another sending thread can call kfilnd_peer_needs_hello() and do the same thing.
# Hello sent 00000800:00000200:22.0:1699556200.100120:0:17048:0:(kfilnd_tn.c:649:kfilnd_tn_state_idle()) KFILND_MSG_HELLO_REQ Transact ion ID 0000000038261ad5: 1@kfi:5 -> 1@kfi(000000009029990e):0x0 TN_EVENT_TX_HELLO event status 0 # Three threads all see hello expired and clear hello pending: 00000800:00000200:23.0:1699556412.994968:0:7169:0:(kfilnd.h:308:kfilnd_peer_needs_hello()) No response from 1@kfi(000000009029990e):0x2 after 213 00000800:00000200:22.0:1699556412.994968:0:5606:0:(kfilnd.h:308:kfilnd_peer_needs_hello()) No response from 1@kfi(000000009029990e):0x2 after 213 00000800:00000200:21.0:1699556412.994968:0:15493:0:(kfilnd.h:308:kfilnd_peer_needs_hello()) No response from 1@kfi(000000009029990e):0x2 after 213 # Each thread sends a hello: 00000800:00000200:23.0:1699556412.994971:0:7169:0:(kfilnd_tn.c:649:kfilnd_tn_state_idle()) KFILND_MSG_HELLO_REQ Transaction ID 000000001b6bcb14: 1@kfi:5 -> 1@kfi(000000009029990e):0x0 TN_EVENT_TX_HELLO event status 0 00000800:00000200:22.0:1699556412.994971:0:5606:0:(kfilnd_tn.c:649:kfilnd_tn_state_idle()) KFILND_MSG_HELLO_REQ Transaction ID 00000000bab5bd32: 1@kfi:5 -> 1@kfi(000000009029990e):0x0 TN_EVENT_TX_HELLO event status 0 00000800:00000200:21.0:1699556412.994971:0:15493:0:(kfilnd_tn.c:649:kfilnd_tn_state_idle()) KFILND_MSG_HELLO_REQ Transaction ID 00000000e19578f9: 1@kfi:5 -> 1@kfi(000000009029990e):0x0 TN_EVENT_TX_HELLO event status 0