[LU-16025] Read past file size after truncate from another client Created: 19/Jul/22  Updated: 30/Jan/24  Resolved: 30/Jan/24

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.8
Fix Version/s: Lustre 2.16.0, Lustre 2.15.3

Type: Bug Priority: Critical
Reporter: Xing Huang Assignee: Zhenyu Xu
Resolution: Fixed Votes: 0
Labels: None
Environment:

RHEL 8.5 kernel 4.18.0-348.23.1.el8
Whamcloud release 2.12.8


Issue Links:
Duplicate
Related
is related to LU-17469 GPF in ll_writepages()-> lov_io_init+... Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

It was found that if a file which was already accessed on node1 is truncated by node2 and then opened and read again by node1, calls to read() can get data past the end of the file. The extra bytes are filled with zeroes.

This is usually not seen as a call to (f)stat() on the file before being opened or read will actually trigger a glimpse lock, refreshing the actual file size on node1, which is the case for most usual unix tools.

Note that reading the same file on node2 a second time will actually get it right.

 

Here is a reproducer:{}

 

[seb@node1 ~]$ cat mycat.c
#include <stdio.h>
#include <unistd.h>
#include <fcntl.h>
int main(int argc, char **argv)
{
        int i;
        int s;
        int fd;
        char buffer[4096];
        for (i = 1; i < argc; i++) {
                fd = open(argv[i], O_RDONLY);
                if (fd < 0) {
                        perror("Could not open file");
                        return (1);
                }
                while ( (s = read(fd, buffer, sizeof(buffer))) > 0) {
                        write(1, buffer, s);
                }
                close(fd);
        }
} 
[seb@node1 ~]$ gcc -Wall mycat.c -o mycat 

 

 

[seb@node1 ~]$ cp somefile.txt /lustre/seb/somefile.txt
# Trigger a read of the file on node2 to fill up the inode informations
# Notice the file at this time is several megabytes
[seb@node1 ~]$ ssh node2 'cat /lustre/seb/somefile.txt > /dev/null; ls -l /lustre/seb/somefile.txt'
-rw-r----- 1 seb seb 114052401 Jun 16 13:51 /lustre/seb/somefile.txt
[seb@node1 ~]$ truncate -s 100000 /lustre/seb/somefile.txt
# Now read the file from a remote node making sure no stat() call occurs before
[seb@node1 ~]$ ssh node2 '~/mycat /lustre/seb/somefile.txt | hexdump -C | tail -n 4'
00018690  63 73 74 30 31 5b 4f 53  54 3a 33 36 5d 0a 32 30  |cst01[OST:36].20| 
000186a0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................| 
* 
00019000 
# bytes above 100000 up to 102400 (completing the last page) are showing zeroes
[seb@node1 ~]$ ssh node2 '~/mycat /lustre/seb/somefile.txt | hexdump -C | tail -n 4'
00018670  31 31 33 37 34 38 20 20  36 33 38 37 39 38 31 37  |113748  63879817| 
00018680  31 36 20 20 34 36 25 20  2f 6c 75 73 2f 68 31 74  |16  46% /lus/h1t| 
00018690  63 73 74 30 31 5b 4f 53  54 3a 33 36 5d 0a 32 30  |cst01[OST:36].20| 
000186a0 

This was done with a simple ftruncate() call here, but the same problem occurs with an open( "/lustre/seb/somefile.txt", O_WRONLY|O_TRUNC) + writes to a lower size than original.

 



 Comments   
Comment by Peter Jones [ 19/Jul/22 ]

Patch pushed to master - https://review.whamcloud.com/47896

Comment by Carlos Thomaz [ 08/Aug/22 ]

Hi Peter

Any news on this one? 

Thank you

Carlos

 

Comment by Xing Huang [ 09/Aug/22 ]

Hi Carlos, we are actively working on the fix patch, now we are solving the test failures of the patch.

Comment by Andreas Dilger [ 12/Oct/22 ]

Is patch also needed on b_es6_0?

Comment by Gerrit Updater [ 25/Oct/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/47896/
Subject: LU-16025 llite: adjust read count as file got truncated
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 4468f6c9d92448cb72c5a616ec74653e83ee8e10

Comment by Peter Jones [ 25/Oct/22 ]

Landed for 2.16

Comment by Gerrit Updater [ 19/Apr/23 ]

"Sebastien Buisson <sbuisson@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50689
Subject: LU-16025 llite: adjust read count as file got truncated
Project: fs/lustre-release
Branch: b2_15
Current Patch Set: 1
Commit: 8acf9a3ebd3f45b193436b18a4d115d3327c7a56

Comment by Gerrit Updater [ 20/May/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50689/
Subject: LU-16025 llite: adjust read count as file got truncated
Project: fs/lustre-release
Branch: b2_15
Current Patch Set:
Commit: 48d247fcc97533ae10b204dd059c908344ab1d06

Comment by Shuichi Ihara [ 15/Jun/23 ]

an constant small performance regression (-10%) detected after patch https://review.whamcloud.com/#/c/fs/lustre-release/+/50689 landed in b2_15.

a workload is single process and 4k buffered IO

# salloc -p 40n -N 1 --ntasks-per-node=1 /usr/mpi/gcc/openmpi-4.1.5a1/bin/mpirun --allow-run-as-root /work/tools/bin/ior -i 1 -w -r -b 256g -t 4k -C -Q 2 -e -vv -F -o /exafs/d0/d1/d2/ior.single.out/file

here is list of b2_15 branch

- snip -
a4e89bffe6 LU-16369 ldiskfs: do not check enc context at lookup 
Max Write: 539.39 MiB/sec (565.60 MB/sec)
Max Read:  2503.16 MiB/sec (2624.75 MB/sec)

48d247fcc9 LU-16025 llite: adjust read count as file got truncated
Max Write: 538.69 MiB/sec (564.85 MB/sec)
Max Read:  2416.74 MiB/sec (2534.13 MB/sec)

b74560d74a LU-16286 ldiskfs: reimplement nodelalloc optimization
Max Write: 533.40 MiB/sec (559.31 MB/sec)
Max Read:  2761.09 MiB/sec (2895.21 MB/sec)

Is this expected performance degradation to solve original defect, or is it possible to fix without performance impacts?

NOTE:
the performance was back after reverted patch https://review.whamcloud.com/#/c/fs/lustre-release/+/50689 from b2_15

Max Write: 529.53 MiB/sec (555.25 MB/sec)
Max Read:  2704.63 MiB/sec (2836.01 MB/sec)
Comment by Patrick Farrell [ 15/Jun/23 ]

bobijam , I posted a comment in https://review.whamcloud.com/c/fs/lustre-release/+/50689 - I think I know why we have a regression in 4K reads, and maybe have a suggestion?  But Ihara, fyi, this patch is an important correctness fix so definitely no revert.

Comment by Shuichi Ihara [ 15/Jun/23 ]

OK, that's I thought. Thanks for confirmation, Patrick. It's a trade off to fix an critical problem first and will imporve the codes later if it's possible.

Comment by Patrick Farrell [ 16/Jun/23 ]

Bobi commented in Gerrit - He tried it and my suggested approach doesn't work (for anyone following along here).

https://review.whamcloud.com/c/fs/lustre-release/+/50689

Basically the fix adds slight overhead to each read, so it's more significant for smaller reads and probably can't even be measured for large reads.  But we need the fix and the cost isn't too high, so that's where we're at for now.

Comment by Zhenyu Xu [ 26/Jan/24 ]

iocb->ki_pos could be wrong so that next read could start from wrong offset.

Comment by Gerrit Updater [ 26/Jan/24 ]

"Zhenyu Xu <bobijam@hotmail.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53827
Subject: LU-16025 llite: short read could mess up next read offset
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: be710a787d4af182d51866dd5dc09b71b880688b

Comment by Zhenyu Xu [ 30/Jan/24 ]

short read reset ki_pos issue is tracked at LU-17482

Generated at Sat Feb 10 03:23:19 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.