<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 02:31:04 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-9989] a concurrent blocking AST can cause short read from a splice buffer</title>
                <link>https://jira.whamcloud.com/browse/LU-9989</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;A complete description of the bug we are seeing is rather lengthy, so I&apos;ll just explain a simplified version of the bug that can be easily reproduced.&lt;/p&gt;

&lt;p&gt;In short, when Lustre page cache pages are put into a pipe buffer by ll_file_splice_read() a concurrent blocking AST can truncate them (more confusingly, before &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-8633&quot; title=&quot;SIGBUS under memory pressure with fast_read enabled&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-8633&quot;&gt;&lt;del&gt;LU-8633&lt;/del&gt;&lt;/a&gt;, they would be even marked NON-uptodate). If the first page is truncated when transferring data from the pipe, depending on the exact kernel routine and the kernel version, userspace will see ENODATA, EIO or 0. 0 will usually be treated as a bug by applications because according to VFS conventions it marks EOF (applications usually will not restart reading if they are not making progress at all).&lt;/p&gt;

&lt;p&gt;A simple reproducer for master is as follows:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;[root@panda-testbox lustre]# cat splice.fio 
[file]
ioengine=splice
iodepth=1
rw=read
bs=4k
size=1G
[root@panda-testbox lustre]# &lt;span class=&quot;code-keyword&quot;&gt;while&lt;/span&gt; &lt;span class=&quot;code-keyword&quot;&gt;true&lt;/span&gt;; &lt;span class=&quot;code-keyword&quot;&gt;do&lt;/span&gt; &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; i in /sys/fs/lustre/ldlm/namespaces&lt;span class=&quot;code-comment&quot;&gt;/*OST*osc*/&lt;/span&gt;lru_size; &lt;span class=&quot;code-keyword&quot;&gt;do&lt;/span&gt; echo clear &amp;gt; $i; done ; done &amp;gt; /dev/&lt;span class=&quot;code-keyword&quot;&gt;null&lt;/span&gt; 2&amp;gt;&amp;amp;1 &amp;amp;
[1] 2422
[root@panda-testbox lustre]# fio splice.fio 
file: (g=0): rw=read, bs=4K-4K/4K-4K/4K-4K, ioengine=splice, iodepth=1
fio-2.1.10
Starting 1 process
fio: pid=2425, err=61/file:engines/splice.c:140, func=vmsplice, error=No data available
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The exact scenario leading to this bug is as follows:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;             fio-1373  [003]  7061.034844: p_ll_readpage_0: (ll_readpage+0x0/0x1c40 [lustre]) arg1=ffffea0002af5888
             fio-1373  [003]  7061.037857: p_page_cache_pipe_buf_confirm_0: (page_cache_pipe_buf_confirm+0x0/0x90) arg1=ffffea0002af5888
   ptlrpcd_00_00-27942 [003]  7061.039328: p_vvp_page_export_0: (vvp_page_export+0x0/0x90 [lustre]) arg1=1 arg2=ffffea0002af5888
           &amp;lt;...&amp;gt;-30290 [000]  7061.039338: p_ll_invalidatepage_0: (ll_invalidatepage+0x0/0x180 [lustre]) arg1=ffffea0002af5888
             fio-1373  [002]  7061.039379: r_page_cache_pipe_buf_confirm_0: (pipe_to_user+0x31/0x130 &amp;lt;- page_cache_pipe_buf_confirm) arg1=ffffffc3
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;So,&lt;br/&gt;
1) splice allocates the page cache pages, locks the pages and calls -&amp;gt;readpage()&lt;br/&gt;
2) pages are put into the pipe buffer&lt;br/&gt;
3) another thread (actually, the same thread in this scenario) requests data from the pipe (via vmsplice - in this scenario)&lt;br/&gt;
4) page_cache_pipe_buf_confirm() fails the PageUptodate check because the pages haven&apos;t got read so far&lt;br/&gt;
5) the reads complete and the pages are marked uptodate by vvp_page_export()&lt;br/&gt;
6) a concurrent blocking AST truncates the pages&lt;br/&gt;
7) page_cache_pipe_buf_confirm() finds that the page was truncated and returns ENODATA&lt;/p&gt;

&lt;p&gt;From the perspective of this scenario, it seems that Lustre has been truncating pages in a broken way for many years. No filesystem truncates pages without a real truncate, clear_inode or I/O error. Even NFS only invalidates (but not truncates) pages in the context of mapping revalidate.&lt;/p&gt;

&lt;p&gt;However, not truncating pages when the corresponding DLM lock is revoked raises cache coherency concerns. So we need to decide how to fix that.&lt;/p&gt;

&lt;p&gt;A possible solution is to replace generic_file_splice_read() call with its copy which waits until the first page becomes uptodate so that page_cache_pipe_buf_confirm() should always pass the PageUptodate() check. Even more straightforward solution is to use default_file_splice_read(), however, it removes any sort of &quot;zero-copy&quot; and the performance can drop significantly.&lt;/p&gt;

&lt;p&gt;We would like to know an opinion from Intel on this defect before we propose our solution.&lt;/p&gt;

&lt;p&gt;P.S. The original bug is NFS client getting EIO if the OST is failed over during some fio load. The kernel NFSD uses -&amp;gt;splice_read when its avalable (see nfsd_vfs_read()). Short read is converted to EIO on the NFS client (see nfs_readpage_retry()).&lt;/p&gt;</description>
                <environment></environment>
        <key id="48308">LU-9989</key>
            <summary>a concurrent blocking AST can cause short read from a splice buffer</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="3" iconUrl="https://jira.whamcloud.com/images/icons/priorities/major.svg">Major</priority>
                        <status id="1" iconUrl="https://jira.whamcloud.com/images/icons/statuses/open.png" description="The issue is open and ready for the assignee to start work on it.">Open</status>
                    <statusCategory id="2" key="new" colorName="default"/>
                                    <resolution id="-1">Unresolved</resolution>
                                        <assignee username="wc-triage">WC Triage</assignee>
                                    <reporter username="panda">Andrew Perepechko</reporter>
                        <labels>
                    </labels>
                <created>Thu, 14 Sep 2017 13:03:39 +0000</created>
                <updated>Thu, 14 Sep 2017 18:05:03 +0000</updated>
                                            <version>Lustre 2.10.0</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>3</watches>
                                                                            <comments>
                            <comment id="208386" author="adilger" created="Thu, 14 Sep 2017 18:05:03 +0000"  >&lt;p&gt;I haven&apos;t looked at this code in detail, but I the Lustre client can&apos;t wait indefinitely for the userspace thread to read the pages from the pipe before cancelling the lock and dropping the pages from cache.  It might be hours or days before a process is woken up to read pages from a pipe.&lt;/p&gt;

&lt;p&gt;Two solutions come to mind, in addition to those presented above:&lt;/p&gt;
&lt;ul class=&quot;alternate&quot; type=&quot;square&quot;&gt;
	&lt;li&gt;use &lt;tt&gt;generic_file_splice_read()&lt;/tt&gt; to do zero-copy reads from the pages in the normal case where there is no AST on the lock, but in the lock callback handler copy the pages into the pipe in case of an AST.  That avoids overhead in the most common scenarios&lt;/li&gt;
	&lt;li&gt;return an error like &lt;tt&gt;-EAGAIN&lt;/tt&gt; or &lt;tt&gt;-EINTR&lt;/tt&gt; or &lt;tt&gt;-ERESTARTSYS&lt;/tt&gt; to the caller so that the read will be retried if the pages are dropped from cache.  This would add overhead if there are lots of lock conflicts, and potentially prevent forward progress on reading from the pipe, so the &quot;fall back to copy&quot; solution is probably better&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;Since there is no guarantee of data consistency when reading from a pipe (i.e. &lt;tt&gt;cat &quot;largefile.txt&quot; | more&lt;/tt&gt; might show old data if data is stuffed in the pipe and the file is changed, even before the data is shown to the user, I don&apos;t think there is a consistency problem with making a copy of the data to put into the pipe before the lock is cancelled and pages dropped, even if another client is immediately modifying that data.  It is just a normal concurrent read-write race.&lt;/p&gt;</comment>
                    </comments>
                    <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzzk6f:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>