[LU-16149] LNet Discovery queue and deletion race Created: 12/Sep/22  Updated: 23/Feb/23  Resolved: 25/Oct/22

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.16.0, Lustre 2.15.2
Fix Version/s: Lustre 2.16.0, Lustre 2.15.3

Type: Bug Priority: Major
Reporter: Chris Horn Assignee: Chris Horn
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

lnet_peer_deletion() can race with another thread calling
lnet_peer_queue_for_discovery.

Discovery thread:

  • Calls lnet_peer_deletion():
  • LNET_PEER_DISCOVERING bit is cleared from lnet_peer::lp_state
  • releases lnet_peer::lp_lock

Another thread:

  • Acquires lnet_net_lock/EX
  • Calls lnet_peer_queue_for_discovery()
  • Takes lnet_peer::lp_lock
  • Sets LNET_PEER_DISCOVERING bit
  • Releases lnet_peer::lp_lock
  • Sees lnet_peer::lp_dc_list is not empty, so it does not add peer
    to dc request queue
  • lnet_peer_queue_for_discovery() returns, lnet_net_lock/EX releases

Discovery thread:

  • Acquires lnet_net_lock/EX
  • Deletes peer from ln_dc_working list
  • performs the peer deletion

At this point, the peer is not on any discovery list, and it has
LNET_PEER_DISCOVERING bit set. This peer is now stranded, and any
messages on the peer's lnet_peer::lp_dc_pendq are likewise stranded.

To solve this, modify lnet_peer_deletion() so that it waits to clear
the LNET_PEER_DISCOVERING bit until it has completed deleting the
peer and re-acquired the lnet_peer::lp_lock. This ensures we cannot
race with any other thread that may add the LNET_PEER_DISCOVERING bit
back to the peer.

Futhermore, do not bother deleting the peer from the ln_dc_working
list in lnet_peer_deletion(). This will be done by
lnet_peer_discovery_complete().



 Comments   
Comment by Gerrit Updater [ 12/Sep/22 ]

"Chris Horn <chris.horn@hpe.com>" uploaded a new patch: https://review.whamcloud.com/48532
Subject: LU-16149 lnet: Discovery queue and deletion race
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 27a2481daf1026883f004109bd9a766cf5798161

Comment by Gerrit Updater [ 25/Oct/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/48532/
Subject: LU-16149 lnet: Discovery queue and deletion race
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: a966b624ac76e34e8ed28c6980c3f58cb441eeb0

Comment by Peter Jones [ 25/Oct/22 ]

Landed for 2.16

Comment by Gerrit Updater [ 25/Jan/23 ]

"Serguei Smirnov <ssmirnov@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49772
Subject: LU-16149 lnet: Discovery queue and deletion race
Project: fs/lustre-release
Branch: b2_15
Current Patch Set: 1
Commit: d464cf8747032b92d0d0daa7a9a2153a2b30b6d5

Comment by Gerrit Updater [ 23/Feb/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49772/
Subject: LU-16149 lnet: Discovery queue and deletion race
Project: fs/lustre-release
Branch: b2_15
Current Patch Set:
Commit: 7caade3078a168c3d39f6318c485490322604ab4

Generated at Sat Feb 10 03:24:26 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.