Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11131

resent reint rpc failure due to reused reply data slot

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.12.0
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      The following scenario leads to failure of recent reint rpc:

      1. mdt server has number of rpcs being handled, rpc 1 from client A
      and rpc 2 from client B.

      2. shutdown for the server starts

      3. rpc 1 is processed, reply data is added, but client A gets ENODEV
      in reply (ptlrpc_send_reply()) as shutdown is running

      3. shutdown reaches class_disconnect_exports() and links an export A
      to the list of zombie exports

      4. obd_zombid thread wakes up and destroy the export A, which includes
      freeing of reply data list with clearing bits in
      lut->lut_reply_bitmap (tgt_free_reply_data())

      5. export B is still processing the rpc 2 and looks for free bit in
      the lut->lut_reply_bitmap to store reply data
      (tgt_add_reply_data()). If it finds a bit which has been just freed by
      obd_zombid thread, then reply data from export A will get overwritten
      in reply_data file with reply data from export B

      6. after failover, reply data gets restored with
      tgt_reply_data_init(). The reply data of rpc1 from client A is missing

      7. client A reconnects and resends its rpc 1. Server does not find
      reply data and processes the rpc as if it has not been seen yet. In
      case of unlink, the directory entry already does not exist so rpc 1
      fails

      Attachments

        Activity

          [LU-11131] resent reint rpc failure due to reused reply data slot
          pjones Peter Jones added a comment -

          Landed for 2.12

          pjones Peter Jones added a comment - Landed for 2.12

          Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/32798/
          Subject: LU-11131 target: keep reply data bit set on failover
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: 58edec38160f44be0ef784fecfab830a43f92fa8

          gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/32798/ Subject: LU-11131 target: keep reply data bit set on failover Project: fs/lustre-release Branch: master Current Patch Set: Commit: 58edec38160f44be0ef784fecfab830a43f92fa8

          Vladimir Saveliev (c17830@cray.com) uploaded a new patch: https://review.whamcloud.com/32798
          Subject: LU-11131 target: keep reply data bit set on failover
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: fe72c36f256d061862adb8bcf14a596eb0709c31

          gerrit Gerrit Updater added a comment - Vladimir Saveliev (c17830@cray.com) uploaded a new patch: https://review.whamcloud.com/32798 Subject: LU-11131 target: keep reply data bit set on failover Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: fe72c36f256d061862adb8bcf14a596eb0709c31

          People

            vsaveliev Vladimir Saveliev
            vsaveliev Vladimir Saveliev
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: