Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-10958

brw rpc reordering causes data corruption when the writethrough cache is disabled

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.14.0
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      We ran IOR with LNet router failure simulation and encountered data corruption which seems to be reproducible on master.

      The following scenario happens:

      1. a client thread writes some data to page N of file X
      2. page N is transfered to the OSS, the processing thread sleeps somewhere
      3. the original BRW request timeouts and the client resends page N
      4. page N is successfully written to disk, the client receives the reply and clears PG_Writeback
      5. a client thread writes different data to the same page N of file X
      6. page N with the new data is successfully written to disk, the client receives the reply and clears PG_Writeback
      7. the OSS thread from step 2 wakes up and writes stale data to disk [data corruption]

      A reproducer will be uploaded shortly.

      Attachments

        Activity

          People

            panda Andrew Perepechko
            panda Andrew Perepechko
            Votes:
            0 Vote for this issue
            Watchers:
            10 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: