Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-17840

Race between peer delete and RKEY re-use

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.16.0
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      ian.ziemba@hpe.com found a race between kfilnd peer deletion and RKEY re-use that could result in data corruption.

      kfilnd_peer object deletion is a two step process. First a flag (kfilnd_peer::kp_remove_peer = 1) is atomically set in the object to mark it for removal via a call to kfilnd_peer_del(). Then, the next caller of kfilnd_peer_put() will atomically modify this flag (kfilnd_peer::kp_remove_peer = 2) again to denote that it is removing the peer from the rhashtable before actually removing the object.
       
      The window between marking a peer for deletion and removing it from the peer cache allows a race where an RKEY may be re-used. For example:
       
      Thread 1: Posts tagged receive with RKEY based on peerA::kp_local_session_key X and tn_mr_key Y
      Thread 1: Cancels tagged receive
      Thread 1: kfilnd_peer_del() -> peerA::kp_remove_peer = 1
      Thread 2: kfilnd_peer_put() -> peerA::kp_remove_peer = 2
      Thread 1: kfilnd_peer_put() -> kfilnd_tn_finalize() -> releases tn_mr_key Y
      Thread 3: allocates tn_mr_key Y
      Thread 3: Fetches peerA with kp_local_session_key X
      Thread 2: Removes peerA from rhashtable
       
      At this point, thread 3 has the same RKEY used by thread 1.

      Attachments

        Activity

          People

            hornc Chris Horn
            hornc Chris Horn
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: