Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-10958

brw rpc reordering causes data corruption when the writethrough cache is disabled

    Details

    • Type: Bug
    • Status: Open
    • Priority: Critical
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: Lustre 2.14.0
    • Labels:
      None
    • Severity:
      3
    • Rank (Obsolete):
      9223372036854775807

      Description

      We ran IOR with LNet router failure simulation and encountered data corruption which seems to be reproducible on master.

      The following scenario happens:

      1. a client thread writes some data to page N of file X
      2. page N is transfered to the OSS, the processing thread sleeps somewhere
      3. the original BRW request timeouts and the client resends page N
      4. page N is successfully written to disk, the client receives the reply and clears PG_Writeback
      5. a client thread writes different data to the same page N of file X
      6. page N with the new data is successfully written to disk, the client receives the reply and clears PG_Writeback
      7. the OSS thread from step 2 wakes up and writes stale data to disk [data corruption]

      A reproducer will be uploaded shortly.

        Attachments

          Activity

            People

            • Assignee:
              panda Andrew Perepechko
              Reporter:
              panda Andrew Perepechko
            • Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated: