[LU-9989] a concurrent blocking AST can cause short read from a splice buffer Created: 14/Sep/17 Updated: 14/Sep/17 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.10.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Andrew Perepechko | Assignee: | WC Triage |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
A complete description of the bug we are seeing is rather lengthy, so I'll just explain a simplified version of the bug that can be easily reproduced. In short, when Lustre page cache pages are put into a pipe buffer by ll_file_splice_read() a concurrent blocking AST can truncate them (more confusingly, before A simple reproducer for master is as follows: [root@panda-testbox lustre]# cat splice.fio [file] ioengine=splice iodepth=1 rw=read bs=4k size=1G [root@panda-testbox lustre]# while true; do for i in /sys/fs/lustre/ldlm/namespaces/*OST*osc*/lru_size; do echo clear > $i; done ; done > /dev/null 2>&1 & [1] 2422 [root@panda-testbox lustre]# fio splice.fio file: (g=0): rw=read, bs=4K-4K/4K-4K/4K-4K, ioengine=splice, iodepth=1 fio-2.1.10 Starting 1 process fio: pid=2425, err=61/file:engines/splice.c:140, func=vmsplice, error=No data available The exact scenario leading to this bug is as follows: fio-1373 [003] 7061.034844: p_ll_readpage_0: (ll_readpage+0x0/0x1c40 [lustre]) arg1=ffffea0002af5888
fio-1373 [003] 7061.037857: p_page_cache_pipe_buf_confirm_0: (page_cache_pipe_buf_confirm+0x0/0x90) arg1=ffffea0002af5888
ptlrpcd_00_00-27942 [003] 7061.039328: p_vvp_page_export_0: (vvp_page_export+0x0/0x90 [lustre]) arg1=1 arg2=ffffea0002af5888
<...>-30290 [000] 7061.039338: p_ll_invalidatepage_0: (ll_invalidatepage+0x0/0x180 [lustre]) arg1=ffffea0002af5888
fio-1373 [002] 7061.039379: r_page_cache_pipe_buf_confirm_0: (pipe_to_user+0x31/0x130 <- page_cache_pipe_buf_confirm) arg1=ffffffc3
So, From the perspective of this scenario, it seems that Lustre has been truncating pages in a broken way for many years. No filesystem truncates pages without a real truncate, clear_inode or I/O error. Even NFS only invalidates (but not truncates) pages in the context of mapping revalidate. However, not truncating pages when the corresponding DLM lock is revoked raises cache coherency concerns. So we need to decide how to fix that. A possible solution is to replace generic_file_splice_read() call with its copy which waits until the first page becomes uptodate so that page_cache_pipe_buf_confirm() should always pass the PageUptodate() check. Even more straightforward solution is to use default_file_splice_read(), however, it removes any sort of "zero-copy" and the performance can drop significantly. We would like to know an opinion from Intel on this defect before we propose our solution. P.S. The original bug is NFS client getting EIO if the OST is failed over during some fio load. The kernel NFSD uses ->splice_read when its avalable (see nfsd_vfs_read()). Short read is converted to EIO on the NFS client (see nfs_readpage_retry()). |
| Comments |
| Comment by Andreas Dilger [ 14/Sep/17 ] |
|
I haven't looked at this code in detail, but I the Lustre client can't wait indefinitely for the userspace thread to read the pages from the pipe before cancelling the lock and dropping the pages from cache. It might be hours or days before a process is woken up to read pages from a pipe. Two solutions come to mind, in addition to those presented above:
Since there is no guarantee of data consistency when reading from a pipe (i.e. cat "largefile.txt" | more might show old data if data is stuffed in the pipe and the file is changed, even before the data is shown to the user, I don't think there is a consistency problem with making a copy of the data to put into the pipe before the lock is cancelled and pages dropped, even if another client is immediately modifying that data. It is just a normal concurrent read-write race. |