Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-9491

v2.9: silent data corruption with writev() and O_APPEND

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.10.0
    • Lustre 2.8.0, Lustre 2.9.0
    • None
    • RHEL6
      RHEL7
    • 3
    • 9223372036854775807

    Description

      Lustre client version 2.9.0 will silently corrupt data if:

      • a file is opened with mode O_APPEND, and
      • is written to using the writev() syscall using two or more iovs, and
      • the file offset written to by any but the last iov touches a page boundary.

      If all of the above hold true, the last iov overwrites some or all data from the previous iovs. It can be easily reproduces with eg.

      /usr/lib64/lustre/tests/rwv -w -a -n 2 4096 4096 /lustre/test131.out

      On a Lustre client running v2.9.0, the first chunk is lost, and test.out ends up as a 4k-sized file. Earlier versions create an 8k-sized test.out containing both chunks. In both cases, the writev() syscall returns 8192, indicating that the complete data set had been written.

      Test 131b from sanity.sh already triggers this bug. It only checks return codes, but doesn't verify the written file, and therefore currently won't catch the error. (Obvious patch to follow.)

      A git bisect run revealed that commit 1101120d3258509fa74f952cd8664bfdc17bd97d (LU-4257 llite: fix up iov_iter implementation) had introduced the error. So far, I've verified that v2.9.0 is affected on both RHEL6 and RHEL7, both of which use the various iov_iter_*() functions from lustre/llite/llite_internal.h introduced by this changeset. There's a chance that more recent kernel versions that implement iov_iter_*() themselves are unaffected, but I haven't been able to check.

      I'll attach a brief vfstrace from running the rwv reproducer mentioned above. Note the second call to ll_write_begin() logs "Writing 0 of 0 to 4096 bytes" just like the first invocation whereas it should have advanced a page, ie. "Writing 1 of 0 to 4096 bytes" would be the expected output. This means that ll_write_begin() may be called with an incorrect pos field for all but the first iov in the list, but I haven't been able to pinpoint the spot where it goes wrong.

      If O_APPEND is dropped from the open() call (ie. rwv is called without the -a switch), everything works fine, and the second call to ll_write_begin() indeed logs "Writing 1 of 0 to 4096 bytes" as expected.

      Attachments

        Issue Links

          Activity

            [LU-9491] v2.9: silent data corruption with writev() and O_APPEND
            pjones Peter Jones added a comment -

            Landed for 2.10

            pjones Peter Jones added a comment - Landed for 2.10

            Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/27086/
            Subject: LU-9491 llite: Handle multi-vec append write correctly
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 6352eb778d032d99e2a87e4d8b3a5562b585e752

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/27086/ Subject: LU-9491 llite: Handle multi-vec append write correctly Project: fs/lustre-release Branch: master Current Patch Set: Commit: 6352eb778d032d99e2a87e4d8b3a5562b585e752
            adilger Andreas Dilger added a comment - - edited

            A very good test of this code is write_append_truncate, though it needs MPI to run. That said, it doesn't look like it triggered this problem because it isn't using writev().

            adilger Andreas Dilger added a comment - - edited A very good test of this code is write_append_truncate , though it needs MPI to run. That said, it doesn't look like it triggered this problem because it isn't using writev().

            The iov_offset of iov_iter could not be zero then your proposal will write wrong data to file. The old way worked because it manipulated iovec in I/O stack.

            jay Jinshan Xiong (Inactive) added a comment - The iov_offset of iov_iter could not be zero then your proposal will write wrong data to file. The old way worked because it manipulated iovec in I/O stack.
            kobras Daniel Kobras (Inactive) added a comment - - edited

            Thanks! With 9a231894578dcecb1ad7280f235ffaa0b7d18447 on top of 2.9.0, I was unable to reproduce the problem. So this looks good!

             

            That said, after staring at __generic_file_aio_write() and descendants for a while, the need for the added complexity in llite_internal's __generic_file_write_iter() still isn't clear to me. Just wrapping it like

            static inline ssize_t __generic_file_write_iter(struct kiocb *iocb, struct iov_iter *iter)
            {
                   return __generic_file_aio_write(iocb, iter->iov, iter->nr_segs, &iocb->ki_pos);
            }
            

             

            matches the old code prior to 1101120d3258509fa74f952cd8664bfdc17bd97d, and doesn't look less efficient. Why the need to iterate in __generic_file_write_iter() ourselves in the general (non-O_APPEND) case, rather than relying on common code that already seems to do the right thing?

            kobras Daniel Kobras (Inactive) added a comment - - edited Thanks! With 9a231894578dcecb1ad7280f235ffaa0b7d18447 on top of 2.9.0, I was unable to reproduce the problem. So this looks good!   That said, after staring at __generic_file_aio_write() and descendants for a while, the need for the added complexity in llite_internal's __generic_file_write_iter() still isn't clear to me. Just wrapping it like static inline ssize_t __generic_file_write_iter(struct kiocb *iocb, struct iov_iter *iter) { return __generic_file_aio_write(iocb, iter->iov, iter->nr_segs, &iocb->ki_pos); }   matches the old code prior to 1101120d3258509fa74f952cd8664bfdc17bd97d, and doesn't look less efficient. Why the need to iterate in __generic_file_write_iter() ourselves in the general (non-O_APPEND) case, rather than relying on common code that already seems to do the right thing?

            Jinshan Xiong (jinshan.xiong@intel.com) uploaded a new patch: https://review.whamcloud.com/27086
            Subject: LU-9491 llite: Handle multi-vec append write correctly
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 9a231894578dcecb1ad7280f235ffaa0b7d18447

            gerrit Gerrit Updater added a comment - Jinshan Xiong (jinshan.xiong@intel.com) uploaded a new patch: https://review.whamcloud.com/27086 Subject: LU-9491 llite: Handle multi-vec append write correctly Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 9a231894578dcecb1ad7280f235ffaa0b7d18447

            Reverting the changes to vvp_io_write_start() in 1101120d3258509fa74f952cd8664bfdc17bd97d fixes the issue for me.

            kobras Daniel Kobras (Inactive) added a comment - Reverting the changes to vvp_io_write_start() in 1101120d3258509fa74f952cd8664bfdc17bd97d fixes the issue for me.
            pjones Peter Jones added a comment -

            Jinshan

            Can you please look into this issue?

            Thanks

            Peter

            pjones Peter Jones added a comment - Jinshan Can you please look into this issue? Thanks Peter

            Sorry, mistyped the command line for the reproducer. Correct sequence of options is:

            /usr/lib64/lustre/tests/rwv -f /lustre/test131.out -w -a -n 2 4096 4096

            kobras Daniel Kobras (Inactive) added a comment - Sorry, mistyped the command line for the reproducer. Correct sequence of options is: /usr/lib64/lustre/tests/rwv -f /lustre/test131.out -w -a -n 2 4096 4096

            People

              jay Jinshan Xiong (Inactive)
              kobras Daniel Kobras (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: