Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-16149

LNet Discovery queue and deletion race

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.16.0, Lustre 2.15.3
    • Lustre 2.16.0, Lustre 2.15.2
    • None
    • 3
    • 9223372036854775807

    Description

      lnet_peer_deletion() can race with another thread calling
      lnet_peer_queue_for_discovery.

      Discovery thread:

      • Calls lnet_peer_deletion():
      • LNET_PEER_DISCOVERING bit is cleared from lnet_peer::lp_state
      • releases lnet_peer::lp_lock

      Another thread:

      • Acquires lnet_net_lock/EX
      • Calls lnet_peer_queue_for_discovery()
      • Takes lnet_peer::lp_lock
      • Sets LNET_PEER_DISCOVERING bit
      • Releases lnet_peer::lp_lock
      • Sees lnet_peer::lp_dc_list is not empty, so it does not add peer
        to dc request queue
      • lnet_peer_queue_for_discovery() returns, lnet_net_lock/EX releases

      Discovery thread:

      • Acquires lnet_net_lock/EX
      • Deletes peer from ln_dc_working list
      • performs the peer deletion

      At this point, the peer is not on any discovery list, and it has
      LNET_PEER_DISCOVERING bit set. This peer is now stranded, and any
      messages on the peer's lnet_peer::lp_dc_pendq are likewise stranded.

      To solve this, modify lnet_peer_deletion() so that it waits to clear
      the LNET_PEER_DISCOVERING bit until it has completed deleting the
      peer and re-acquired the lnet_peer::lp_lock. This ensures we cannot
      race with any other thread that may add the LNET_PEER_DISCOVERING bit
      back to the peer.

      Futhermore, do not bother deleting the peer from the ln_dc_working
      list in lnet_peer_deletion(). This will be done by
      lnet_peer_discovery_complete().

      Attachments

        Activity

          [LU-16149] LNet Discovery queue and deletion race

          "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49772/
          Subject: LU-16149 lnet: Discovery queue and deletion race
          Project: fs/lustre-release
          Branch: b2_15
          Current Patch Set:
          Commit: 7caade3078a168c3d39f6318c485490322604ab4

          gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49772/ Subject: LU-16149 lnet: Discovery queue and deletion race Project: fs/lustre-release Branch: b2_15 Current Patch Set: Commit: 7caade3078a168c3d39f6318c485490322604ab4

          "Serguei Smirnov <ssmirnov@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49772
          Subject: LU-16149 lnet: Discovery queue and deletion race
          Project: fs/lustre-release
          Branch: b2_15
          Current Patch Set: 1
          Commit: d464cf8747032b92d0d0daa7a9a2153a2b30b6d5

          gerrit Gerrit Updater added a comment - "Serguei Smirnov <ssmirnov@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49772 Subject: LU-16149 lnet: Discovery queue and deletion race Project: fs/lustre-release Branch: b2_15 Current Patch Set: 1 Commit: d464cf8747032b92d0d0daa7a9a2153a2b30b6d5
          pjones Peter Jones added a comment -

          Landed for 2.16

          pjones Peter Jones added a comment - Landed for 2.16

          "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/48532/
          Subject: LU-16149 lnet: Discovery queue and deletion race
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: a966b624ac76e34e8ed28c6980c3f58cb441eeb0

          gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/48532/ Subject: LU-16149 lnet: Discovery queue and deletion race Project: fs/lustre-release Branch: master Current Patch Set: Commit: a966b624ac76e34e8ed28c6980c3f58cb441eeb0

          "Chris Horn <chris.horn@hpe.com>" uploaded a new patch: https://review.whamcloud.com/48532
          Subject: LU-16149 lnet: Discovery queue and deletion race
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: 27a2481daf1026883f004109bd9a766cf5798161

          gerrit Gerrit Updater added a comment - "Chris Horn <chris.horn@hpe.com>" uploaded a new patch: https://review.whamcloud.com/48532 Subject: LU-16149 lnet: Discovery queue and deletion race Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 27a2481daf1026883f004109bd9a766cf5798161

          People

            hornc Chris Horn
            hornc Chris Horn
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: