Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-9491

v2.9: silent data corruption with writev() and O_APPEND

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.10.0
    • Lustre 2.8.0, Lustre 2.9.0
    • None
    • RHEL6
      RHEL7
    • 3
    • 9223372036854775807

    Description

      Lustre client version 2.9.0 will silently corrupt data if:

      • a file is opened with mode O_APPEND, and
      • is written to using the writev() syscall using two or more iovs, and
      • the file offset written to by any but the last iov touches a page boundary.

      If all of the above hold true, the last iov overwrites some or all data from the previous iovs. It can be easily reproduces with eg.

      /usr/lib64/lustre/tests/rwv -w -a -n 2 4096 4096 /lustre/test131.out

      On a Lustre client running v2.9.0, the first chunk is lost, and test.out ends up as a 4k-sized file. Earlier versions create an 8k-sized test.out containing both chunks. In both cases, the writev() syscall returns 8192, indicating that the complete data set had been written.

      Test 131b from sanity.sh already triggers this bug. It only checks return codes, but doesn't verify the written file, and therefore currently won't catch the error. (Obvious patch to follow.)

      A git bisect run revealed that commit 1101120d3258509fa74f952cd8664bfdc17bd97d (LU-4257 llite: fix up iov_iter implementation) had introduced the error. So far, I've verified that v2.9.0 is affected on both RHEL6 and RHEL7, both of which use the various iov_iter_*() functions from lustre/llite/llite_internal.h introduced by this changeset. There's a chance that more recent kernel versions that implement iov_iter_*() themselves are unaffected, but I haven't been able to check.

      I'll attach a brief vfstrace from running the rwv reproducer mentioned above. Note the second call to ll_write_begin() logs "Writing 0 of 0 to 4096 bytes" just like the first invocation whereas it should have advanced a page, ie. "Writing 1 of 0 to 4096 bytes" would be the expected output. This means that ll_write_begin() may be called with an incorrect pos field for all but the first iov in the list, but I haven't been able to pinpoint the spot where it goes wrong.

      If O_APPEND is dropped from the open() call (ie. rwv is called without the -a switch), everything works fine, and the second call to ll_write_begin() indeed logs "Writing 1 of 0 to 4096 bytes" as expected.

      Attachments

        Issue Links

          Activity

            People

              jay Jinshan Xiong (Inactive)
              kobras Daniel Kobras (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: