Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-10958

brw rpc reordering causes data corruption when the writethrough cache is disabled

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.14.0
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      We ran IOR with LNet router failure simulation and encountered data corruption which seems to be reproducible on master.

      The following scenario happens:

      1. a client thread writes some data to page N of file X
      2. page N is transfered to the OSS, the processing thread sleeps somewhere
      3. the original BRW request timeouts and the client resends page N
      4. page N is successfully written to disk, the client receives the reply and clears PG_Writeback
      5. a client thread writes different data to the same page N of file X
      6. page N with the new data is successfully written to disk, the client receives the reply and clears PG_Writeback
      7. the OSS thread from step 2 wakes up and writes stale data to disk [data corruption]

      A reproducer will be uploaded shortly.

      Attachments

        Activity

          [LU-10958] brw rpc reordering causes data corruption when the writethrough cache is disabled

          Hello,

          Is a backport planned for the b2_12 branch for this issue ?

          eaujames Etienne Aujames added a comment - Hello, Is a backport planned for the b2_12 branch for this issue ?
          pjones Peter Jones added a comment -

          Landed for 2.14

          pjones Peter Jones added a comment - Landed for 2.14

          Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/32281/
          Subject: LU-10958 ofd: data corruption due to RPC reordering
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: 35679a730bf0b7a8d4ce84cadc3ecc7c289ef491

          gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/32281/ Subject: LU-10958 ofd: data corruption due to RPC reordering Project: fs/lustre-release Branch: master Current Patch Set: Commit: 35679a730bf0b7a8d4ce84cadc3ecc7c289ef491
          spitzcor Cory Spitz added a comment -

          Proposed for 2.14.0. With -RC1 already available, I realize that its candidacy might not hold.

          spitzcor Cory Spitz added a comment - Proposed for 2.14.0. With -RC1 already available, I realize that its candidacy might not hold.
          panda Andrew Perepechko added a comment - - edited

          No, I don't think so. There's nothing wrong in the md layer except the delay itself which makes it possible for the resent RPC and the RPC after it to complete before the initial delayed RPC. This delay is the analogue of OBD_FAIL_OST_BRW_PAUSE_BULK2 from https://review.whamcloud.com/#/c/32165/6/lustre/tests/recovery-small.sh

          Delay can happen anywhere on a non-RTOS system.

          panda Andrew Perepechko added a comment - - edited No, I don't think so. There's nothing wrong in the md layer except the delay itself which makes it possible for the resent RPC and the RPC after it to complete before the initial delayed RPC. This delay is the analogue of OBD_FAIL_OST_BRW_PAUSE_BULK2 from https://review.whamcloud.com/#/c/32165/6/lustre/tests/recovery-small.sh Delay can happen anywhere on a non-RTOS system.

          In our scenario the real delay happens in the dm/mdraid layer after the bulk transfer succeeded.

          Isn't that a problem of the DM/mdraid later that it is reordering writes incorrectly? If the OST thread submit writes to disk as A, A', B, but the disk writes A', B, A because A was blocked in the IO stack, then there isn't much we can do about it.

          adilger Andreas Dilger added a comment - In our scenario the real delay happens in the dm/mdraid layer after the bulk transfer succeeded. Isn't that a problem of the DM/mdraid later that it is reordering writes incorrectly? If the OST thread submit writes to disk as A, A', B, but the disk writes A', B, A because A was blocked in the IO stack, then there isn't much we can do about it.

          People

            panda Andrew Perepechko
            panda Andrew Perepechko
            Votes:
            0 Vote for this issue
            Watchers:
            10 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: