Details
-
Bug
-
Resolution: Fixed
-
Major
-
Lustre 2.16.0, Lustre 2.15.2
-
None
-
3
-
9223372036854775807
Description
lnet_peer_deletion() can race with another thread calling
lnet_peer_queue_for_discovery.
Discovery thread:
- Calls lnet_peer_deletion():
- LNET_PEER_DISCOVERING bit is cleared from lnet_peer::lp_state
- releases lnet_peer::lp_lock
Another thread:
- Acquires lnet_net_lock/EX
- Calls lnet_peer_queue_for_discovery()
- Takes lnet_peer::lp_lock
- Sets LNET_PEER_DISCOVERING bit
- Releases lnet_peer::lp_lock
- Sees lnet_peer::lp_dc_list is not empty, so it does not add peer
to dc request queue - lnet_peer_queue_for_discovery() returns, lnet_net_lock/EX releases
Discovery thread:
- Acquires lnet_net_lock/EX
- Deletes peer from ln_dc_working list
- performs the peer deletion
At this point, the peer is not on any discovery list, and it has
LNET_PEER_DISCOVERING bit set. This peer is now stranded, and any
messages on the peer's lnet_peer::lp_dc_pendq are likewise stranded.
To solve this, modify lnet_peer_deletion() so that it waits to clear
the LNET_PEER_DISCOVERING bit until it has completed deleting the
peer and re-acquired the lnet_peer::lp_lock. This ensures we cannot
race with any other thread that may add the LNET_PEER_DISCOVERING bit
back to the peer.
Futhermore, do not bother deleting the peer from the ln_dc_working
list in lnet_peer_deletion(). This will be done by
lnet_peer_discovery_complete().