[LU-16025] Read past file size after truncate from another client Created: 19/Jul/22 Updated: 30/Jan/24 Resolved: 30/Jan/24 |
|
| Status: | Closed |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.8 |
| Fix Version/s: | Lustre 2.16.0, Lustre 2.15.3 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Xing Huang | Assignee: | Zhenyu Xu |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
RHEL 8.5 kernel 4.18.0-348.23.1.el8 |
||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
It was found that if a file which was already accessed on node1 is truncated by node2 and then opened and read again by node1, calls to read() can get data past the end of the file. The extra bytes are filled with zeroes. This is usually not seen as a call to (f)stat() on the file before being opened or read will actually trigger a glimpse lock, refreshing the actual file size on node1, which is the case for most usual unix tools. Note that reading the same file on node2 a second time will actually get it right.
Here is a reproducer:{}
[seb@node1 ~]$ cat mycat.c #include <stdio.h> #include <unistd.h> #include <fcntl.h> int main(int argc, char **argv) { int i; int s; int fd; char buffer[4096]; for (i = 1; i < argc; i++) { fd = open(argv[i], O_RDONLY); if (fd < 0) { perror("Could not open file"); return (1); } while ( (s = read(fd, buffer, sizeof(buffer))) > 0) { write(1, buffer, s); } close(fd); } } [seb@node1 ~]$ gcc -Wall mycat.c -o mycat
[seb@node1 ~]$ cp somefile.txt /lustre/seb/somefile.txt # Trigger a read of the file on node2 to fill up the inode informations # Notice the file at this time is several megabytes [seb@node1 ~]$ ssh node2 'cat /lustre/seb/somefile.txt > /dev/null; ls -l /lustre/seb/somefile.txt' -rw-r----- 1 seb seb 114052401 Jun 16 13:51 /lustre/seb/somefile.txt [seb@node1 ~]$ truncate -s 100000 /lustre/seb/somefile.txt # Now read the file from a remote node making sure no stat() call occurs before [seb@node1 ~]$ ssh node2 '~/mycat /lustre/seb/somefile.txt | hexdump -C | tail -n 4' 00018690 63 73 74 30 31 5b 4f 53 54 3a 33 36 5d 0a 32 30 |cst01[OST:36].20| 000186a0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| * 00019000 # bytes above 100000 up to 102400 (completing the last page) are showing zeroes [seb@node1 ~]$ ssh node2 '~/mycat /lustre/seb/somefile.txt | hexdump -C | tail -n 4' 00018670 31 31 33 37 34 38 20 20 36 33 38 37 39 38 31 37 |113748 63879817| 00018680 31 36 20 20 34 36 25 20 2f 6c 75 73 2f 68 31 74 |16 46% /lus/h1t| 00018690 63 73 74 30 31 5b 4f 53 54 3a 33 36 5d 0a 32 30 |cst01[OST:36].20| 000186a0 This was done with a simple ftruncate() call here, but the same problem occurs with an open( "/lustre/seb/somefile.txt", O_WRONLY|O_TRUNC) + writes to a lower size than original.
|
| Comments |
| Comment by Peter Jones [ 19/Jul/22 ] |
|
Patch pushed to master - https://review.whamcloud.com/47896 |
| Comment by Carlos Thomaz [ 08/Aug/22 ] |
|
Hi Peter Any news on this one? Thank you Carlos
|
| Comment by Xing Huang [ 09/Aug/22 ] |
|
Hi Carlos, we are actively working on the fix patch, now we are solving the test failures of the patch. |
| Comment by Andreas Dilger [ 12/Oct/22 ] |
|
Is patch also needed on b_es6_0? |
| Comment by Gerrit Updater [ 25/Oct/22 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/47896/ |
| Comment by Peter Jones [ 25/Oct/22 ] |
|
Landed for 2.16 |
| Comment by Gerrit Updater [ 19/Apr/23 ] |
|
"Sebastien Buisson <sbuisson@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50689 |
| Comment by Gerrit Updater [ 20/May/23 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50689/ |
| Comment by Shuichi Ihara [ 15/Jun/23 ] |
|
an constant small performance regression (-10%) detected after patch https://review.whamcloud.com/#/c/fs/lustre-release/+/50689 landed in b2_15. a workload is single process and 4k buffered IO # salloc -p 40n -N 1 --ntasks-per-node=1 /usr/mpi/gcc/openmpi-4.1.5a1/bin/mpirun --allow-run-as-root /work/tools/bin/ior -i 1 -w -r -b 256g -t 4k -C -Q 2 -e -vv -F -o /exafs/d0/d1/d2/ior.single.out/file here is list of b2_15 branch - snip - a4e89bffe6 LU-16369 ldiskfs: do not check enc context at lookup Max Write: 539.39 MiB/sec (565.60 MB/sec) Max Read: 2503.16 MiB/sec (2624.75 MB/sec) 48d247fcc9 LU-16025 llite: adjust read count as file got truncated Max Write: 538.69 MiB/sec (564.85 MB/sec) Max Read: 2416.74 MiB/sec (2534.13 MB/sec) b74560d74a LU-16286 ldiskfs: reimplement nodelalloc optimization Max Write: 533.40 MiB/sec (559.31 MB/sec) Max Read: 2761.09 MiB/sec (2895.21 MB/sec) Is this expected performance degradation to solve original defect, or is it possible to fix without performance impacts? NOTE: Max Write: 529.53 MiB/sec (555.25 MB/sec) Max Read: 2704.63 MiB/sec (2836.01 MB/sec) |
| Comment by Patrick Farrell [ 15/Jun/23 ] |
|
bobijam , I posted a comment in https://review.whamcloud.com/c/fs/lustre-release/+/50689 - I think I know why we have a regression in 4K reads, and maybe have a suggestion? But Ihara, fyi, this patch is an important correctness fix so definitely no revert. |
| Comment by Shuichi Ihara [ 15/Jun/23 ] |
|
OK, that's I thought. Thanks for confirmation, Patrick. It's a trade off to fix an critical problem first and will imporve the codes later if it's possible. |
| Comment by Patrick Farrell [ 16/Jun/23 ] |
|
Bobi commented in Gerrit - He tried it and my suggested approach doesn't work (for anyone following along here). https://review.whamcloud.com/c/fs/lustre-release/+/50689 Basically the fix adds slight overhead to each read, so it's more significant for smaller reads and probably can't even be measured for large reads. But we need the fix and the cost isn't too high, so that's where we're at for now. |
| Comment by Zhenyu Xu [ 26/Jan/24 ] |
|
iocb->ki_pos could be wrong so that next read could start from wrong offset. |
| Comment by Gerrit Updater [ 26/Jan/24 ] |
|
"Zhenyu Xu <bobijam@hotmail.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53827 |
| Comment by Zhenyu Xu [ 30/Jan/24 ] |
|
short read reset ki_pos issue is tracked at LU-17482 |