Details
-
Bug
-
Resolution: Fixed
-
Minor
-
None
-
None
-
3
-
9223372036854775807
Description
ian.ziemba@hpe.com found a race between kfilnd peer deletion and RKEY re-use that could result in data corruption.
kfilnd_peer object deletion is a two step process. First a flag (kfilnd_peer::kp_remove_peer = 1) is atomically set in the object to mark it for removal via a call to kfilnd_peer_del(). Then, the next caller of kfilnd_peer_put() will atomically modify this flag (kfilnd_peer::kp_remove_peer = 2) again to denote that it is removing the peer from the rhashtable before actually removing the object.
The window between marking a peer for deletion and removing it from the peer cache allows a race where an RKEY may be re-used. For example:
Thread 1: Posts tagged receive with RKEY based on peerA::kp_local_session_key X and tn_mr_key Y
Thread 1: Cancels tagged receive
Thread 1: kfilnd_peer_del() -> peerA::kp_remove_peer = 1
Thread 2: kfilnd_peer_put() -> peerA::kp_remove_peer = 2
Thread 1: kfilnd_peer_put() -> kfilnd_tn_finalize() -> releases tn_mr_key Y
Thread 3: allocates tn_mr_key Y
Thread 3: Fetches peerA with kp_local_session_key X
Thread 2: Removes peerA from rhashtable
At this point, thread 3 has the same RKEY used by thread 1.