<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:15:39 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-1325] loading large enough binary from lustre trigger OOM killer during page_fault while a large amount of memory is available</title>
                <link>https://jira.whamcloud.com/browse/LU-1325</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;While loading a large enough binary, we hit OOM during page_fault while the system have still a lot of free memory available (in our case we still have 60 GB of free memory on a node with 64 GB installed).&lt;/p&gt;

&lt;p&gt;The problem doesn&apos;t popup is the binary is not big enough and if there isn&apos;t enough concurrency. A simple ls works, a small program too, but if the size increase to few MB with some DSO around and the binary is run with mpirun, the page_fault looks interrupted by a signal into cl_lock_state_wait then the error code return up to ll_fault0 where is it replaced by a VM_FAULT_ERROR which trigger the OOM.&lt;/p&gt;

&lt;p&gt;Here is the extract from the trace collected (and attached) :&lt;br/&gt;
  (cl_lock.c:986:cl_lock_state_wait()) Process leaving (rc=18446744073709551612 : -4 : fffffffffffffffc)&lt;br/&gt;
  (cl_lock.c:1310:cl_enqueue_locked()) Process leaving (rc=18446744073709551612 : -4 : fffffffffffffffc)&lt;br/&gt;
  (cl_lock.c:2175:cl_lock_request()) Process leaving (rc=18446744073709551612 : -4 : fffffffffffffffc)&lt;br/&gt;
  (cl_io.c:393:cl_lockset_lock_one()) Process leaving (rc=18446744073709551612 : -4 : fffffffffffffffc)&lt;br/&gt;
  (cl_io.c:444:cl_lockset_lock()) Process leaving (rc=18446744073709551612 : -4 : fffffffffffffffc)&lt;br/&gt;
  (cl_io.c:479:cl_io_lock()) Process leaving (rc=18446744073709551612 : -4 : fffffffffffffffc)&lt;br/&gt;
  (cl_io.c:1033:cl_io_loop()) Process leaving (rc=18446744073709551612 : -4 : fffffffffffffffc)&lt;br/&gt;
  (llite_mmap.c:298:ll_fault0()) Process leaving (rc=51 : 51 : 33)&lt;/p&gt;

&lt;p&gt;We are able to reproduce the problem at will, by scheduling through the batch scheduler a mpi job of 32 cores, 2 nodes (16 cores per nodes) on the customer system. I hasn&apos;t been able to reproduce it on an another system.&lt;/p&gt;

&lt;p&gt;I also tried to retrieve the culprit signal by setting panic_on_oom, but unfortunately it seems to have been cleared during the oom handling. Strac&apos;ing is too complicated with the mpi layer.&lt;/p&gt;

&lt;p&gt;Alex.&lt;/p&gt;</description>
                <environment></environment>
        <key id="14021">LU-1325</key>
            <summary>loading large enough binary from lustre trigger OOM killer during page_fault while a large amount of memory is available</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="3" iconUrl="https://jira.whamcloud.com/images/icons/priorities/major.svg">Major</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="3">Duplicate</resolution>
                                        <assignee username="jay">Jinshan Xiong</assignee>
                                    <reporter username="louveta">Alexandre Louvet</reporter>
                        <labels>
                    </labels>
                <created>Mon, 16 Apr 2012 04:28:38 +0000</created>
                <updated>Thu, 7 Jun 2012 11:38:28 +0000</updated>
                            <resolved>Thu, 7 Jun 2012 11:38:28 +0000</resolved>
                                    <version>Lustre 2.1.0</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>2</watches>
                                                                            <comments>
                            <comment id="34812" author="pjones" created="Mon, 16 Apr 2012 12:30:31 +0000"  >&lt;p&gt;Jinshan will look into this one&lt;/p&gt;</comment>
                            <comment id="35054" author="jay" created="Wed, 18 Apr 2012 17:18:58 +0000"  >&lt;p&gt;Please try patch &lt;a href=&quot;http://review.whamcloud.com/2574&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/2574&lt;/a&gt;.&lt;/p&gt;</comment>
                            <comment id="38817" author="bfaccini" created="Tue, 15 May 2012 10:42:33 +0000"  >&lt;p&gt;A nasty side-effect/consequence of this problem is that it often (always ??) leaves processes stuck on at least one mm_struct-&amp;gt;mmap_sem when the owner of the semaphore is impossible to find.&lt;/p&gt;

&lt;p&gt;This may come from a hole/bug in OOM algorithm allowing a process to either take the semaphore and leave or self-deadlock on it ...&lt;/p&gt;

&lt;p&gt;The bad thing is that finally an affected node has to be re-booted since commands like &quot;ps/pidof/swapoff/...&quot; also block for ever on these semaphores.&lt;/p&gt;</comment>
                            <comment id="38838" author="jay" created="Tue, 15 May 2012 13:51:06 +0000"  >&lt;p&gt;Hi Bruno, which version of patch are you running? I saw this problem in earlier versions but it should have been fixed in patch set 7.&lt;/p&gt;</comment>
                            <comment id="39267" author="bfaccini" created="Wed, 23 May 2012 06:01:33 +0000"  >&lt;p&gt;I will ask our/Bull integration team and let you know.&lt;/p&gt;</comment>
                            <comment id="39905" author="pjones" created="Mon, 4 Jun 2012 05:00:18 +0000"  >&lt;p&gt;Bruno&lt;/p&gt;

&lt;p&gt;Any answer on this yet? Can we mark this as a duplicate of &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-1299&quot; title=&quot;running truncated executable causes spewing of lock debug messages&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-1299&quot;&gt;&lt;del&gt;LU-1299&lt;/del&gt;&lt;/a&gt;?&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="40192" author="louveta" created="Thu, 7 Jun 2012 11:22:10 +0000"  >&lt;p&gt;To answer Jinshan question, we never got any patch from this Jira (nor &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-1299&quot; title=&quot;running truncated executable causes spewing of lock debug messages&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-1299&quot;&gt;&lt;del&gt;LU-1299&lt;/del&gt;&lt;/a&gt;).&lt;br/&gt;
On our side, this LU is a duplicate of &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-1299&quot; title=&quot;running truncated executable causes spewing of lock debug messages&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-1299&quot;&gt;&lt;del&gt;LU-1299&lt;/del&gt;&lt;/a&gt; for a while, so yes you can mark it as a duplicate of &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-1299&quot; title=&quot;running truncated executable causes spewing of lock debug messages&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-1299&quot;&gt;&lt;del&gt;LU-1299&lt;/del&gt;&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Alex.&lt;/p&gt;</comment>
                            <comment id="40193" author="pjones" created="Thu, 7 Jun 2012 11:38:28 +0000"  >&lt;p&gt;ok thanks Alexandre.&lt;/p&gt;</comment>
                    </comments>
                    <attachments>
                            <attachment id="11151" name="trace.107493.txt.gz" size="194016" author="louveta" created="Mon, 16 Apr 2012 04:28:38 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzvh2f:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>6414</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>