Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11169

Data corruption during IOR testing with network error simulation

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.12.0
    • Lustre 2.10.0, Lustre 2.11.0
    • None
    • 3
    • 9223372036854775807

    Description

      During large IOR testing with network fails introduced, Cray found a data corruption issues.
      first issue is related to the 4MB BRW patchset and exist for long time. Bulk will be marked as failed just with real network error, but if one parts of data was lost and request timeout will treat as transfer done.

      second issue is related with cleanup landed as commit 49d8a7ccd73 where "rc" parameter of obd_commit function was replaced with local data, it horror any errors before it.

      Attachments

        Activity

          [LU-11169] Data corruption during IOR testing with network error simulation

          Alexandr Boyko (c17825@cray.com) uploaded a new patch: https://review.whamcloud.com/35571
          Subject: LU-11169 ptlrpc: handle reply and resend reorder
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: 88c2a9a0c840d01d50762a71f04319e58c9affef

          gerrit Gerrit Updater added a comment - Alexandr Boyko (c17825@cray.com) uploaded a new patch: https://review.whamcloud.com/35571 Subject: LU-11169 ptlrpc: handle reply and resend reorder Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 88c2a9a0c840d01d50762a71f04319e58c9affef

          last patch is addressed a theoretical problem, which is impossible now.

          shadow Alexey Lyashkov added a comment - last patch is addressed a theoretical problem, which is impossible now.

          Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/32906/
          Subject: LU-11169 obdclass: fix old return code usage
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: 1db258b57e5669e07934fe848861817f88102475

          gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/32906/ Subject: LU-11169 obdclass: fix old return code usage Project: fs/lustre-release Branch: master Current Patch Set: Commit: 1db258b57e5669e07934fe848861817f88102475

          NO. server side part is horror any error on server side, so commit write (who make write really) don't know about errors before it.

          shadow Alexey Lyashkov added a comment - NO. server side part is horror any error on server side, so commit write (who make write really) don't know about errors before it.

          OSC checksum can detect client side problem, and "read" part of server side problem.

          It should also cause the client to resend a write if one of the RDMAs was missing data (up to the 10x retry limit).

          adilger Andreas Dilger added a comment - OSC checksum can detect client side problem, and "read" part of server side problem. It should also cause the client to resend a write if one of the RDMAs was missing data (up to the 10x retry limit).

          Patric,

          In general, random data in file - as pages is unchanged on failed bulk transfer, but client / server think all is OK. zeros is just luck.

          shadow Alexey Lyashkov added a comment - Patric, In general, random data in file - as pages is unchanged on failed bulk transfer, but client / server think all is OK. zeros is just luck.

          Ihara,

          Basically, it sometimes fails to notice and resend when there's a transfer error.  On reads, this shows up as zeroes read by the client, when the data on disk is correct.  On writes, the result would be whatever data was present before the write (or, I believe, zeroes if the write is to a new region of the file).

          paf Patrick Farrell (Inactive) added a comment - Ihara, Basically, it sometimes fails to notice and resend when there's a transfer error.  On reads, this shows up as zeroes read by the client, when the data on disk is correct.  On writes, the result would be whatever data was present before the write (or, I believe, zeroes if the write is to a new region of the file).

          it's depend of what you ask. Server side bug found as reading a zero's from file, after bulk transfer error.

          shadow Alexey Lyashkov added a comment - it's depend of what you ask. Server side bug found as reading a zero's from file, after bulk transfer error.

          Interesting. How is file corrupted? could you share exmaple of corrupted file?

          ihara Shuichi Ihara (Inactive) added a comment - Interesting. How is file corrupted? could you share exmaple of corrupted file?

          Andreas,

          you are right. OSC checksum can detect client side problem, and "read" part of server side problem.

          shadow Alexey Lyashkov added a comment - Andreas, you are right. OSC checksum can detect client side problem, and "read" part of server side problem.

          People

            shadow Alexey Lyashkov
            shadow Alexey Lyashkov
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: