Details
-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
Lustre 2.10.0
-
None
-
3
-
9223372036854775807
Description
A complete description of the bug we are seeing is rather lengthy, so I'll just explain a simplified version of the bug that can be easily reproduced.
In short, when Lustre page cache pages are put into a pipe buffer by ll_file_splice_read() a concurrent blocking AST can truncate them (more confusingly, before LU-8633, they would be even marked NON-uptodate). If the first page is truncated when transferring data from the pipe, depending on the exact kernel routine and the kernel version, userspace will see ENODATA, EIO or 0. 0 will usually be treated as a bug by applications because according to VFS conventions it marks EOF (applications usually will not restart reading if they are not making progress at all).
A simple reproducer for master is as follows:
[root@panda-testbox lustre]# cat splice.fio [file] ioengine=splice iodepth=1 rw=read bs=4k size=1G [root@panda-testbox lustre]# while true; do for i in /sys/fs/lustre/ldlm/namespaces/*OST*osc*/lru_size; do echo clear > $i; done ; done > /dev/null 2>&1 & [1] 2422 [root@panda-testbox lustre]# fio splice.fio file: (g=0): rw=read, bs=4K-4K/4K-4K/4K-4K, ioengine=splice, iodepth=1 fio-2.1.10 Starting 1 process fio: pid=2425, err=61/file:engines/splice.c:140, func=vmsplice, error=No data available
The exact scenario leading to this bug is as follows:
fio-1373 [003] 7061.034844: p_ll_readpage_0: (ll_readpage+0x0/0x1c40 [lustre]) arg1=ffffea0002af5888
fio-1373 [003] 7061.037857: p_page_cache_pipe_buf_confirm_0: (page_cache_pipe_buf_confirm+0x0/0x90) arg1=ffffea0002af5888
ptlrpcd_00_00-27942 [003] 7061.039328: p_vvp_page_export_0: (vvp_page_export+0x0/0x90 [lustre]) arg1=1 arg2=ffffea0002af5888
<...>-30290 [000] 7061.039338: p_ll_invalidatepage_0: (ll_invalidatepage+0x0/0x180 [lustre]) arg1=ffffea0002af5888
fio-1373 [002] 7061.039379: r_page_cache_pipe_buf_confirm_0: (pipe_to_user+0x31/0x130 <- page_cache_pipe_buf_confirm) arg1=ffffffc3
So,
1) splice allocates the page cache pages, locks the pages and calls ->readpage()
2) pages are put into the pipe buffer
3) another thread (actually, the same thread in this scenario) requests data from the pipe (via vmsplice - in this scenario)
4) page_cache_pipe_buf_confirm() fails the PageUptodate check because the pages haven't got read so far
5) the reads complete and the pages are marked uptodate by vvp_page_export()
6) a concurrent blocking AST truncates the pages
7) page_cache_pipe_buf_confirm() finds that the page was truncated and returns ENODATA
From the perspective of this scenario, it seems that Lustre has been truncating pages in a broken way for many years. No filesystem truncates pages without a real truncate, clear_inode or I/O error. Even NFS only invalidates (but not truncates) pages in the context of mapping revalidate.
However, not truncating pages when the corresponding DLM lock is revoked raises cache coherency concerns. So we need to decide how to fix that.
A possible solution is to replace generic_file_splice_read() call with its copy which waits until the first page becomes uptodate so that page_cache_pipe_buf_confirm() should always pass the PageUptodate() check. Even more straightforward solution is to use default_file_splice_read(), however, it removes any sort of "zero-copy" and the performance can drop significantly.
We would like to know an opinion from Intel on this defect before we propose our solution.
P.S. The original bug is NFS client getting EIO if the OST is failed over during some fio load. The kernel NFSD uses ->splice_read when its avalable (see nfsd_vfs_read()). Short read is converted to EIO on the NFS client (see nfs_readpage_retry()).