Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-16149

LNet Discovery queue and deletion race

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.16.0, Lustre 2.15.3
    • Lustre 2.16.0, Lustre 2.15.2
    • None
    • 3
    • 9223372036854775807

    Description

      lnet_peer_deletion() can race with another thread calling
      lnet_peer_queue_for_discovery.

      Discovery thread:

      • Calls lnet_peer_deletion():
      • LNET_PEER_DISCOVERING bit is cleared from lnet_peer::lp_state
      • releases lnet_peer::lp_lock

      Another thread:

      • Acquires lnet_net_lock/EX
      • Calls lnet_peer_queue_for_discovery()
      • Takes lnet_peer::lp_lock
      • Sets LNET_PEER_DISCOVERING bit
      • Releases lnet_peer::lp_lock
      • Sees lnet_peer::lp_dc_list is not empty, so it does not add peer
        to dc request queue
      • lnet_peer_queue_for_discovery() returns, lnet_net_lock/EX releases

      Discovery thread:

      • Acquires lnet_net_lock/EX
      • Deletes peer from ln_dc_working list
      • performs the peer deletion

      At this point, the peer is not on any discovery list, and it has
      LNET_PEER_DISCOVERING bit set. This peer is now stranded, and any
      messages on the peer's lnet_peer::lp_dc_pendq are likewise stranded.

      To solve this, modify lnet_peer_deletion() so that it waits to clear
      the LNET_PEER_DISCOVERING bit until it has completed deleting the
      peer and re-acquired the lnet_peer::lp_lock. This ensures we cannot
      race with any other thread that may add the LNET_PEER_DISCOVERING bit
      back to the peer.

      Futhermore, do not bother deleting the peer from the ln_dc_working
      list in lnet_peer_deletion(). This will be done by
      lnet_peer_discovery_complete().

      Attachments

        Activity

          People

            hornc Chris Horn
            hornc Chris Horn
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: