<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 03:21:34 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-15819] Executables run from Lustre may receive spurious SIGBUS signals</title>
                <link>https://jira.whamcloud.com/browse/LU-15819</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;We received several reports about applications (IOR and other unrelated user-provided programs) started from Lustre receiving SIGBUS signals. &lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;We were able to reproduce the issue with IOR, RHEL7 kernel on the client.&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;It seems that it is caused by &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-14541&quot; title=&quot;Memory reclaim caused a stale data read&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-14541&quot;&gt;&lt;del&gt;LU-14541&lt;/del&gt;&lt;/a&gt; and the mechanics is the following:&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;1) a major fault in the IOR code happens&lt;/p&gt;

&lt;p&gt;2) ll_fault()&lt;del&gt;&amp;gt;...&lt;/del&gt;&amp;gt;filemap_fault()&lt;/p&gt;

&lt;p&gt;3) ll_readpage() is issued from filemap_fault()&lt;/p&gt;

&lt;p&gt;4) wait_on_page_locked() is issued from filemap_fault()&lt;/p&gt;

&lt;p&gt;5) the uptodate check in filemap_fault() fails due to a parallel ClearPageUptodate() called from a blocking AST handler&lt;/p&gt;

&lt;p&gt;6) VM_FAULT_SIGBUS is returned&lt;/p&gt;</description>
                <environment></environment>
        <key id="70140">LU-15819</key>
            <summary>Executables run from Lustre may receive spurious SIGBUS signals</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="1" iconUrl="https://jira.whamcloud.com/images/icons/priorities/blocker.svg">Blocker</priority>
                        <status id="6" iconUrl="https://jira.whamcloud.com/images/icons/statuses/closed.png" description="The issue is considered finished, the resolution is correct. Issues which are closed can be reopened.">Closed</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="3">Duplicate</resolution>
                                        <assignee username="panda">Andrew Perepechko</assignee>
                                    <reporter username="panda">Andrew Perepechko</reporter>
                        <labels>
                    </labels>
                <created>Wed, 4 May 2022 16:21:55 +0000</created>
                <updated>Thu, 5 May 2022 17:17:27 +0000</updated>
                            <resolved>Wed, 4 May 2022 17:23:41 +0000</resolved>
                                    <version>Upstream</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>8</watches>
                                                                            <comments>
                            <comment id="333785" author="pjones" created="Wed, 4 May 2022 17:07:34 +0000"  >&lt;p&gt;Panda&lt;/p&gt;

&lt;p&gt;Isn&apos;t this a duplicate of &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-15815&quot; title=&quot;fast_read/stale data/reclaim workround causes SIGBUS&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-15815&quot;&gt;&lt;del&gt;LU-15815&lt;/del&gt;&lt;/a&gt;?&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="333786" author="panda" created="Wed, 4 May 2022 17:23:17 +0000"  >&lt;p&gt;Peter, yes, it is. Sorry, I searched for similar bugs some time ago but forgot to do it right before opening the ticket. We can close this one.&lt;/p&gt;</comment>
                            <comment id="333809" author="adilger" created="Wed, 4 May 2022 20:56:29 +0000"  >&lt;p&gt;While this definitely seems like a duplicate of &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-15815&quot; title=&quot;fast_read/stale data/reclaim workround causes SIGBUS&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-15815&quot;&gt;&lt;del&gt;LU-15815&lt;/del&gt;&lt;/a&gt;, is anyone from HPE currently looking at this issue?  There were comments in &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-14541&quot; title=&quot;Memory reclaim caused a stale data read&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-14541&quot;&gt;&lt;del&gt;LU-14541&lt;/del&gt;&lt;/a&gt; that Shadow was looking into this same issue:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;This patch is tested already - it solve a problem, but some problems exist.&lt;br/&gt;
1) page isn&apos;t freed and stay in memory for long time until page cache LRU will flush it.&lt;br/&gt;
2) page without uptodate flag may cause a EIO in some cases, a specially with splice. Don&apos;t sure - but possible.&lt;/p&gt;

&lt;p&gt;I have a different patch with change a cl_page states change to avoid own a CPS_FREED pages, but no resources to verify it.&lt;br/&gt;
in our case, it reproduced with overstripe with 5000 stripes and sysctl -w vm.drop_caches=3 on client nodes in parallel to the IOR.&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;It looks like John Hammond has an MPI reproducer on &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-15815&quot; title=&quot;fast_read/stale data/reclaim workround causes SIGBUS&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-15815&quot;&gt;&lt;del&gt;LU-15815&lt;/del&gt;&lt;/a&gt;, and an even simpler reproducer on &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-14541&quot; title=&quot;Memory reclaim caused a stale data read&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-14541&quot;&gt;&lt;del&gt;LU-14541&lt;/del&gt;&lt;/a&gt; that can be run on a single client, so that should allow testing potential fixes much more easily.&lt;/p&gt;

&lt;p&gt;This issue is one of the few 2.15.0 blockers that does not have a fix yet.&lt;/p&gt;</comment>
                            <comment id="333816" author="panda" created="Wed, 4 May 2022 21:43:31 +0000"  >&lt;p&gt;We discussed a few possible solutions specifically for this (&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-15815&quot; title=&quot;fast_read/stale data/reclaim workround causes SIGBUS&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-15815&quot;&gt;&lt;del&gt;LU-15815&lt;/del&gt;&lt;/a&gt;/&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-15819&quot; title=&quot;Executables run from Lustre may receive spurious SIGBUS signals&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-15819&quot;&gt;&lt;del&gt;LU-15819&lt;/del&gt;&lt;/a&gt;) issue.&lt;/p&gt;

&lt;p&gt;E.g. we could add the page-&amp;gt;mapping == NULL check like the kernel does in do_generic_file_read() to work around invalidate_mapping_pages(). However, we want to avoid copying filemap_fault() implementations for every supported kernel.&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;As for alternative fixes for &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-14541&quot; title=&quot;Memory reclaim caused a stale data read&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-14541&quot;&gt;&lt;del&gt;LU-14541&lt;/del&gt;&lt;/a&gt; such as mentioned by &lt;a href=&quot;https://jira.whamcloud.com/secure/ViewProfile.jspa?name=alyashkov&quot; class=&quot;user-hover&quot; rel=&quot;alyashkov&quot;&gt;alyashkov&lt;/a&gt;, I believe, this kind of fix was too complicated and was abandoned in favour of ClearPageUptodate(). While there was understanding that it was not fully legitimate to call ClearPageUptodate(), the page fault path was not considered and properly tested. I asked him to put a comment as to why the patch was abandoned and whether it can be restored and finished.&lt;/p&gt;</comment>
                            <comment id="333919" author="jhammond" created="Thu, 5 May 2022 17:17:27 +0000"  >&lt;p&gt;&amp;gt; It looks like John Hammond has an MPI reproducer on &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-15815&quot; title=&quot;fast_read/stale data/reclaim workround causes SIGBUS&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-15815&quot;&gt;&lt;del&gt;LU-15815&lt;/del&gt;&lt;/a&gt;, and an even simpler reproducer on &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-14541&quot; title=&quot;Memory reclaim caused a stale data read&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-14541&quot;&gt;&lt;del&gt;LU-14541&lt;/del&gt;&lt;/a&gt; that can be run on a single client, so that should allow testing potential fixes much more easily.&lt;/p&gt;

&lt;p&gt;The MPI reproducer does not require multiple clients. It only requires 2 procs. You can use oversubscribe to place them both on the same node. It&apos;s unlikely that it has much to do with MPI it&apos;s just that the application generates the right memory map access pattern to reproduce the bug. I did try a bit to make something simpler but I not successful. It would be good to have a simplified reproducer.&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10010">
                    <name>Duplicate</name>
                                            <outwardlinks description="duplicates">
                                        <issuelink>
            <issuekey id="70132">LU-15815</issuekey>
        </issuelink>
                            </outwardlinks>
                                                        </issuelinktype>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                        <issuelink>
            <issuekey id="63444">LU-14541</issuekey>
        </issuelink>
                            </outwardlinks>
                                                        </issuelinktype>
                    </issuelinks>
                <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|i02oyv:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>