<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:13:30 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-1099] Lustre OSS OOMs repeatedly</title>
                <link>https://jira.whamcloud.com/browse/LU-1099</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;Today we had an IB-connected 2.1 server OOM out of the blue.  After rebooting the node, the OSS OOMs again after it is in recovery for a little while.  This OSS servers 15 OSTs.&lt;/p&gt;

&lt;p&gt;With one of the reboots, we added the &quot;malloc&quot; debug level and used the debug daemon to collect a log.  Note that we saw messages about dropped log messages, so be aware that lines are missing in there.  I will upload that to the ftp site, as it is too large fir jira.  Filename will be sumom31-lustre.log.txt.bz2.&lt;/p&gt;

&lt;p&gt;We also extracted a lustre log at our default logging level from the original crash dump after the first oom.  I will attach that here.&lt;/p&gt;

&lt;p&gt;Note also that we have the obdfilter writethrough and read caches disabled at this time.&lt;/p&gt;

&lt;p&gt;Using the crash &quot;kmem&quot; command, it is clear that most of the memory is used in slab, but not attributed to any of the Lustre named slabs.  Here is the short kmem -i:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;crash&amp;gt; kmem -i
              PAGES        TOTAL      PERCENTAGE
 TOTAL MEM  6117058      23.3 GB         ----
      FREE    37513     146.5 MB    0% of TOTAL MEM
      USED  6079545      23.2 GB   99% of TOTAL MEM
    SHARED     3386      13.2 MB    0% of TOTAL MEM
   BUFFERS     3297      12.9 MB    0% of TOTAL MEM
    CACHED    26240     102.5 MB    0% of TOTAL MEM
      SLAB  5908658      22.5 GB   96% of TOTAL MEM
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The OSS has 24G of RAM total.&lt;/p&gt;

&lt;p&gt;The biggest consumers by far are size-8192, size-1024, and size-2048.  I will attach the full &quot;kmem -s&quot; output as well.&lt;/p&gt;

&lt;p&gt;We attempted to work around the problem by starting one OST at a time, and allowing it to fully recover before starting the next OST.  By the end of the third OST&apos;s recovery, memory usage was normal.  During the fourth OST&apos;s recovery the memory usage spiked and the node OOMed.&lt;/p&gt;

&lt;p&gt;We finally gave up and mounted with the abort_recovery option, and things seem to be running fine at the moment.&lt;/p&gt;</description>
                <environment>Lustre 2.1.0-21chaos (github.com/chaos/lustre)</environment>
        <key id="13170">LU-1099</key>
            <summary>Lustre OSS OOMs repeatedly</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="3" iconUrl="https://jira.whamcloud.com/images/icons/priorities/major.svg">Major</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="3">Duplicate</resolution>
                                        <assignee username="green">Oleg Drokin</assignee>
                                    <reporter username="morrone">Christopher Morrone</reporter>
                        <labels>
                            <label>llnl</label>
                    </labels>
                <created>Mon, 13 Feb 2012 21:26:28 +0000</created>
                <updated>Wed, 28 Feb 2018 20:27:23 +0000</updated>
                            <resolved>Wed, 28 Feb 2018 20:27:23 +0000</resolved>
                                    <version>Lustre 2.1.0</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>8</watches>
                                                                            <comments>
                            <comment id="28609" author="green" created="Tue, 14 Feb 2012 00:09:23 +0000"  >&lt;p&gt;Is this one with server side read and write-through caches disabled?&lt;/p&gt;</comment>
                            <comment id="28645" author="morrone" created="Tue, 14 Feb 2012 12:45:04 +0000"  >&lt;p&gt;That is what I reported, yes.  But now that I think about it, the caches were probably enabled still on this server cluster when we saw the first occurance of an OOM.&lt;/p&gt;

&lt;p&gt;The admins are using lctl set_param to disable the caches from a lustre script after mount occurs.  That should be fairly early in the recovery process.&lt;/p&gt;</comment>
                            <comment id="28776" author="green" created="Wed, 15 Feb 2012 13:35:40 +0000"  >&lt;p&gt;Ah, indeed I missed the caches disabled line somehow.&lt;/p&gt;

&lt;p&gt;Anyway the default log does not have allocation information collected unfortuantely so it&apos;s hard to know what was actually allocating.&lt;br/&gt;
I suspected it might be related to ldlm locks, but apparently not based on the data in kmem -s output.&lt;/p&gt;

&lt;p&gt;I wonder what sort of an allocation we might have to the connected clients other than ldlm locks and nothing obvious comes to mind, esp. at these sizes.&lt;/p&gt;

&lt;p&gt;You y you&apos;ll upload the log with malloc debug under the name of sumom31-lustre.log.txt.bz2 but it;&apos;s nowhere to be found on our ftp. Any chance I can take a look at it?&lt;/p&gt;</comment>
                            <comment id="28837" author="morrone" created="Wed, 15 Feb 2012 21:14:18 +0000"  >&lt;p&gt;Hmmm, I could have sworn that I uploaded the file.  Maybe I got distracted.  Sorry about that.  I just pushed sumom31-lustre.log.txt.bz2 to the ftp uploads directory.&lt;/p&gt;</comment>
                            <comment id="28838" author="green" created="Wed, 15 Feb 2012 21:50:16 +0000"  >&lt;p&gt;Ok, so the huge number of 8k allocations comes from rqbd allocs and they are never freed (during the log anyway).&lt;br/&gt;
I imagine if you have a big enough cluster with quite a bit of active clients, a lot of them might have uncommitted transactions that we would accumulate in transaction order to replay, esp. if there is a dead (or just very slow to reconnect) client out there that happens to hold some early transaction that is not yet committed.&lt;br/&gt;
Unfortunately without D_HA debug level I cannot know how far from the truth that idea is.&lt;/p&gt;</comment>
                            <comment id="39978" author="liang" created="Tue, 5 Jun 2012 00:03:17 +0000"  >&lt;p&gt;Although ptlrpc service sets &quot;Lazy&quot; for request portal, which means request could be blocked inside LNet layer (instead of dropping), but ptlrpc service will always try to allocate enough buffer (rqbd) and ptlrpc_request for incoming requests, these rqbds will not be freed until shutting down service, this is kind of thing we might improve, but my question is, is it expected that we have so many pending requests on this OSS? &lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;ffff88033fd304c0 size-8192               8192    1413671   1413672 1413672     8k
ffff88033fcd0340 size-1024               1024    8298628   8298832 2074708     4k
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;I remember each ptlrpc_request should take about 900 bytes, so I think most of these 1K objects are ptlrpc_requests, and there are over 8 million requests at here...&lt;/p&gt;</comment>
                            <comment id="39981" author="niu" created="Tue, 5 Jun 2012 00:17:30 +0000"  >&lt;p&gt;Was the OOM triggered during recovery or after recovery? from the log (sumom31_oss_after_reboot_lustre.log.txt.bz2), I see all the 15 OSTs have finished recovery.&lt;/p&gt;

&lt;p&gt;The huge rqbd buffer (grow only buffer) could be caused by request bursting during recovery, and only lock replay isn&apos;t throttled during recovery (resend after recovery isn&apos;t throttled as well, but I don&apos;t think there are many requests to be resend). Are there huge amount of cached write locks on client before the OSS reboot?&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10010">
                    <name>Duplicate</name>
                                            <outwardlinks description="duplicates">
                                        <issuelink>
            <issuekey id="45601">LU-9372</issuekey>
        </issuelink>
                            </outwardlinks>
                                                        </issuelinktype>
                    </issuelinks>
                <attachments>
                            <attachment id="10832" name="kmem_s_first_time.txt" size="19685" author="morrone" created="Mon, 13 Feb 2012 21:26:28 +0000"/>
                            <attachment id="10833" name="sumom31_oss_after_reboot_lustre.log.txt.bz2" size="2061549" author="morrone" created="Mon, 13 Feb 2012 21:26:28 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10490" key="com.atlassian.jira.plugin.system.customfieldtypes:datepicker">
                        <customfieldname>End date</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>Fri, 27 Jun 2014 21:26:28 +0000</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                            <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzw0db:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>10139</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                        <customfield id="customfield_10493" key="com.atlassian.jira.plugin.system.customfieldtypes:datepicker">
                        <customfieldname>Start date</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>Mon, 13 Feb 2012 21:26:28 +0000</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                    </customfields>
    </item>
</channel>
</rss>