Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-17838

kfilnd may issue multiple, redundant hello requests

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.16.0
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      there is a race condition with checking, setting and clearing the kp_hello_pending flag that can result in multiple hello requests being sent for the same peer.

      The issue can occur when there is a hello request outstanding, and multiple threads sending when the hello request deadline is exceeded.

      The sendings threads each call kfilnd_peer_needs_hello(). This function sees deadline is expired and clears the kp_hello_pending flag:

      static inline bool kfilnd_peer_needs_hello(struct kfilnd_peer *kp,
                                                 bool proactive_handshake)
      {
      ...
              } else if (ktime_before(kp->kp_hello_ts + lnet_get_lnd_timeout(),
                                      ktime_get_seconds())) {
                      /* Sent hello but never received reply */
                      CDEBUG(D_NET,
                             "No response from %s(%p):0x%llx after %lld\n",
                             libcfs_nid2str(kp->kp_nid), kp, kp->kp_addr,
                             ktime_sub(ktime_get_seconds(), kp->kp_hello_ts));
      
                      kfilnd_peer_clear_hello_pending(kp);
      

      The sending thread then issues a new hello request which sets the hello_pending flag, but the kp_hello_ts is not updated right away:

      int kfilnd_send_hello_request(struct kfilnd_dev *dev, int cpt,
                                    struct kfilnd_peer *kp)
      {
              struct kfilnd_transaction *tn;
              int rc;
      
              if (kfilnd_peer_set_check_hello_pending(kp)) { <<<<<<<<<<<<<<<<<<<<<<<<< kp_hello_pending set
                      CDEBUG(D_NET, "Hello already pending to peer %s(%px)\n",
                             libcfs_nid2str(kp->kp_nid), kp);
                      return 0;
              }
      
              tn = kfilnd_tn_alloc_for_peer(dev, cpt, kp, true, true, false);
              if (IS_ERR(tn)) {
                      rc = PTR_ERR(tn);
                      CERROR("Failed to allocate transaction struct: rc=%d\n", rc);
                      kfilnd_peer_clear_hello_pending(kp);
                      return rc;
              }
      
              /* +1 for tn->tn_kp. This ref is dropped when this transaction is
               * finalized
               */
              refcount_inc(&kp->kp_cnt);
      
              tn->msg_type = KFILND_MSG_HELLO_REQ;
      
              kp->kp_hello_ts = ktime_get_seconds(); <<<<<<<<<<<<<<<<<<<<<<<<< kp_hello_ts updated
      
              kfilnd_tn_event_handler(tn, TN_EVENT_TX_HELLO, 0);
      
              return 0;
      }
      

      In the window between kp_hello_pending getting set and kp_hello_ts being updated, another sending thread can call kfilnd_peer_needs_hello() and do the same thing.

      # Hello sent
      00000800:00000200:22.0:1699556200.100120:0:17048:0:(kfilnd_tn.c:649:kfilnd_tn_state_idle()) KFILND_MSG_HELLO_REQ Transact
      ion ID 0000000038261ad5: 1@kfi:5 -> 1@kfi(000000009029990e):0x0 TN_EVENT_TX_HELLO event status 0
      
      # Three threads all see hello expired and clear hello pending:
      00000800:00000200:23.0:1699556412.994968:0:7169:0:(kfilnd.h:308:kfilnd_peer_needs_hello()) No response from 1@kfi(000000009029990e):0x2 after 213
      00000800:00000200:22.0:1699556412.994968:0:5606:0:(kfilnd.h:308:kfilnd_peer_needs_hello()) No response from 1@kfi(000000009029990e):0x2 after 213
      00000800:00000200:21.0:1699556412.994968:0:15493:0:(kfilnd.h:308:kfilnd_peer_needs_hello()) No response from 1@kfi(000000009029990e):0x2 after 213
      
      # Each thread sends a hello:
      00000800:00000200:23.0:1699556412.994971:0:7169:0:(kfilnd_tn.c:649:kfilnd_tn_state_idle()) KFILND_MSG_HELLO_REQ Transaction ID 000000001b6bcb14: 1@kfi:5 -> 1@kfi(000000009029990e):0x0 TN_EVENT_TX_HELLO event status 0
      00000800:00000200:22.0:1699556412.994971:0:5606:0:(kfilnd_tn.c:649:kfilnd_tn_state_idle()) KFILND_MSG_HELLO_REQ Transaction ID 00000000bab5bd32: 1@kfi:5 -> 1@kfi(000000009029990e):0x0 TN_EVENT_TX_HELLO event status 0
      00000800:00000200:21.0:1699556412.994971:0:15493:0:(kfilnd_tn.c:649:kfilnd_tn_state_idle()) KFILND_MSG_HELLO_REQ Transaction ID 00000000e19578f9: 1@kfi:5 -> 1@kfi(000000009029990e):0x0 TN_EVENT_TX_HELLO event status 0
      

      Attachments

        Activity

          People

            hornc Chris Horn
            hornc Chris Horn
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: