[LU-9491] v2.9: silent data corruption with writev() and O_APPEND Created: 11/May/17 Updated: 03/May/19 Resolved: 20/May/17 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.8.0, Lustre 2.9.0 |
| Fix Version/s: | Lustre 2.10.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Daniel Kobras (Inactive) | Assignee: | Jinshan Xiong (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
RHEL6 |
||
| Attachments: |
|
||||||||||||||||
| Issue Links: |
|
||||||||||||||||
| Epic/Theme: | lustre-2.9 | ||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Epic: | client | ||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||
| Description |
|
Lustre client version 2.9.0 will silently corrupt data if:
If all of the above hold true, the last iov overwrites some or all data from the previous iovs. It can be easily reproduces with eg. /usr/lib64/lustre/tests/rwv -w -a -n 2 4096 4096 /lustre/test131.out On a Lustre client running v2.9.0, the first chunk is lost, and test.out ends up as a 4k-sized file. Earlier versions create an 8k-sized test.out containing both chunks. In both cases, the writev() syscall returns 8192, indicating that the complete data set had been written. Test 131b from sanity.sh already triggers this bug. It only checks return codes, but doesn't verify the written file, and therefore currently won't catch the error. (Obvious patch to follow.) A git bisect run revealed that commit 1101120d3258509fa74f952cd8664bfdc17bd97d ( I'll attach a brief vfstrace from running the rwv reproducer mentioned above. Note the second call to ll_write_begin() logs "Writing 0 of 0 to 4096 bytes" just like the first invocation whereas it should have advanced a page, ie. "Writing 1 of 0 to 4096 bytes" would be the expected output. This means that ll_write_begin() may be called with an incorrect pos field for all but the first iov in the list, but I haven't been able to pinpoint the spot where it goes wrong. If O_APPEND is dropped from the open() call (ie. rwv is called without the -a switch), everything works fine, and the second call to ll_write_begin() indeed logs "Writing 1 of 0 to 4096 bytes" as expected. |
| Comments |
| Comment by Daniel Kobras (Inactive) [ 11/May/17 ] |
|
Sorry, mistyped the command line for the reproducer. Correct sequence of options is: /usr/lib64/lustre/tests/rwv -f /lustre/test131.out -w -a -n 2 4096 4096 |
| Comment by Peter Jones [ 11/May/17 ] |
|
Jinshan Can you please look into this issue? Thanks Peter |
| Comment by Daniel Kobras (Inactive) [ 11/May/17 ] |
|
Reverting the changes to vvp_io_write_start() in 1101120d3258509fa74f952cd8664bfdc17bd97d fixes the issue for me. |
| Comment by Gerrit Updater [ 11/May/17 ] |
|
Jinshan Xiong (jinshan.xiong@intel.com) uploaded a new patch: https://review.whamcloud.com/27086 |
| Comment by Daniel Kobras (Inactive) [ 12/May/17 ] |
|
Thanks! With 9a231894578dcecb1ad7280f235ffaa0b7d18447 on top of 2.9.0, I was unable to reproduce the problem. So this looks good!
That said, after staring at __generic_file_aio_write() and descendants for a while, the need for the added complexity in llite_internal's __generic_file_write_iter() still isn't clear to me. Just wrapping it like static inline ssize_t __generic_file_write_iter(struct kiocb *iocb, struct iov_iter *iter) { return __generic_file_aio_write(iocb, iter->iov, iter->nr_segs, &iocb->ki_pos); }
matches the old code prior to 1101120d3258509fa74f952cd8664bfdc17bd97d, and doesn't look less efficient. Why the need to iterate in __generic_file_write_iter() ourselves in the general (non-O_APPEND) case, rather than relying on common code that already seems to do the right thing? |
| Comment by Jinshan Xiong (Inactive) [ 12/May/17 ] |
|
The iov_offset of iov_iter could not be zero then your proposal will write wrong data to file. The old way worked because it manipulated iovec in I/O stack. |
| Comment by Andreas Dilger [ 15/May/17 ] |
|
A very good test of this code is write_append_truncate, though it needs MPI to run. That said, it doesn't look like it triggered this problem because it isn't using writev(). |
| Comment by Gerrit Updater [ 20/May/17 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/27086/ |
| Comment by Peter Jones [ 20/May/17 ] |
|
Landed for 2.10 |