[LU-9491] v2.9: silent data corruption with writev() and O_APPEND Created: 11/May/17  Updated: 03/May/19  Resolved: 20/May/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0, Lustre 2.9.0
Fix Version/s: Lustre 2.10.0

Type: Bug Priority: Critical
Reporter: Daniel Kobras (Inactive) Assignee: Jinshan Xiong (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Environment:

RHEL6
RHEL7


Attachments: File writev_append.debug    
Issue Links:
Duplicate
Related
is related to LU-7067 vvp_io.c:1076:vvp_io_write_start()) A... Resolved
is related to LU-4257 parallel dds are slower than serial dds Resolved
Epic/Theme: lustre-2.9
Severity: 3
Epic: client
Rank (Obsolete): 9223372036854775807

 Description   

Lustre client version 2.9.0 will silently corrupt data if:

  • a file is opened with mode O_APPEND, and
  • is written to using the writev() syscall using two or more iovs, and
  • the file offset written to by any but the last iov touches a page boundary.

If all of the above hold true, the last iov overwrites some or all data from the previous iovs. It can be easily reproduces with eg.

/usr/lib64/lustre/tests/rwv -w -a -n 2 4096 4096 /lustre/test131.out

On a Lustre client running v2.9.0, the first chunk is lost, and test.out ends up as a 4k-sized file. Earlier versions create an 8k-sized test.out containing both chunks. In both cases, the writev() syscall returns 8192, indicating that the complete data set had been written.

Test 131b from sanity.sh already triggers this bug. It only checks return codes, but doesn't verify the written file, and therefore currently won't catch the error. (Obvious patch to follow.)

A git bisect run revealed that commit 1101120d3258509fa74f952cd8664bfdc17bd97d (LU-4257 llite: fix up iov_iter implementation) had introduced the error. So far, I've verified that v2.9.0 is affected on both RHEL6 and RHEL7, both of which use the various iov_iter_*() functions from lustre/llite/llite_internal.h introduced by this changeset. There's a chance that more recent kernel versions that implement iov_iter_*() themselves are unaffected, but I haven't been able to check.

I'll attach a brief vfstrace from running the rwv reproducer mentioned above. Note the second call to ll_write_begin() logs "Writing 0 of 0 to 4096 bytes" just like the first invocation whereas it should have advanced a page, ie. "Writing 1 of 0 to 4096 bytes" would be the expected output. This means that ll_write_begin() may be called with an incorrect pos field for all but the first iov in the list, but I haven't been able to pinpoint the spot where it goes wrong.

If O_APPEND is dropped from the open() call (ie. rwv is called without the -a switch), everything works fine, and the second call to ll_write_begin() indeed logs "Writing 1 of 0 to 4096 bytes" as expected.



 Comments   
Comment by Daniel Kobras (Inactive) [ 11/May/17 ]

Sorry, mistyped the command line for the reproducer. Correct sequence of options is:

/usr/lib64/lustre/tests/rwv -f /lustre/test131.out -w -a -n 2 4096 4096

Comment by Peter Jones [ 11/May/17 ]

Jinshan

Can you please look into this issue?

Thanks

Peter

Comment by Daniel Kobras (Inactive) [ 11/May/17 ]

Reverting the changes to vvp_io_write_start() in 1101120d3258509fa74f952cd8664bfdc17bd97d fixes the issue for me.

Comment by Gerrit Updater [ 11/May/17 ]

Jinshan Xiong (jinshan.xiong@intel.com) uploaded a new patch: https://review.whamcloud.com/27086
Subject: LU-9491 llite: Handle multi-vec append write correctly
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 9a231894578dcecb1ad7280f235ffaa0b7d18447

Comment by Daniel Kobras (Inactive) [ 12/May/17 ]

Thanks! With 9a231894578dcecb1ad7280f235ffaa0b7d18447 on top of 2.9.0, I was unable to reproduce the problem. So this looks good!

 

That said, after staring at __generic_file_aio_write() and descendants for a while, the need for the added complexity in llite_internal's __generic_file_write_iter() still isn't clear to me. Just wrapping it like

static inline ssize_t __generic_file_write_iter(struct kiocb *iocb, struct iov_iter *iter)
{
       return __generic_file_aio_write(iocb, iter->iov, iter->nr_segs, &iocb->ki_pos);
}

 

matches the old code prior to 1101120d3258509fa74f952cd8664bfdc17bd97d, and doesn't look less efficient. Why the need to iterate in __generic_file_write_iter() ourselves in the general (non-O_APPEND) case, rather than relying on common code that already seems to do the right thing?

Comment by Jinshan Xiong (Inactive) [ 12/May/17 ]

The iov_offset of iov_iter could not be zero then your proposal will write wrong data to file. The old way worked because it manipulated iovec in I/O stack.

Comment by Andreas Dilger [ 15/May/17 ]

A very good test of this code is write_append_truncate, though it needs MPI to run. That said, it doesn't look like it triggered this problem because it isn't using writev().

Comment by Gerrit Updater [ 20/May/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/27086/
Subject: LU-9491 llite: Handle multi-vec append write correctly
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 6352eb778d032d99e2a87e4d8b3a5562b585e752

Comment by Peter Jones [ 20/May/17 ]

Landed for 2.10

Generated at Sat Feb 10 02:26:40 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.