<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:37:52 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-3897] Hang up in ldlm_pools_shrink under OOM</title>
                <link>https://jira.whamcloud.com/browse/LU-3897</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;Hi,&lt;/p&gt;

&lt;p&gt;Several Bull customers running lustre 2.1.x had a hang in ldlm_pools_shrink on a lustre client (login node).&lt;br/&gt;
The system was hung and a dump was initiated from the the bmc, by sending a NMI.&lt;br/&gt;
The dump shows there was no more activity on the system. The 12 CPUs are idle (swapper).&lt;br/&gt;
A lot of processes are in page_fault(), blocked in ldlm_pools_shrink().&lt;br/&gt;
I have attached the output of the &quot;foreach bt&quot; crash command. Let me know if you need the vmcore file.&lt;/p&gt;

&lt;p&gt;Each time, we can see a lot of OOM messages in the syslog of the dump files.&lt;/p&gt;

&lt;p&gt;This issue looks like &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-2468&quot; title=&quot;MDS out of memory, blocked in ldlm_pools_shrink()&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-2468&quot;&gt;&lt;del&gt;LU-2468&lt;/del&gt;&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Thanks,&lt;br/&gt;
Sebastien.&lt;/p&gt;</description>
                <environment></environment>
        <key id="20825">LU-3897</key>
            <summary>Hang up in ldlm_pools_shrink under OOM</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="3" iconUrl="https://jira.whamcloud.com/images/icons/priorities/major.svg">Major</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="6">Not a Bug</resolution>
                                        <assignee username="bobijam">Zhenyu Xu</assignee>
                                    <reporter username="sebastien.buisson">Sebastien Buisson</reporter>
                        <labels>
                    </labels>
                <created>Fri, 6 Sep 2013 14:02:33 +0000</created>
                <updated>Thu, 5 Dec 2013 18:11:05 +0000</updated>
                            <resolved>Thu, 5 Dec 2013 18:11:04 +0000</resolved>
                                    <version>Lustre 2.1.3</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>6</watches>
                                                                            <comments>
                            <comment id="65939" author="pjones" created="Fri, 6 Sep 2013 14:07:26 +0000"  >&lt;p&gt;Bobijam&lt;/p&gt;

&lt;p&gt;What do you recommend here?&lt;/p&gt;

&lt;p&gt;Thanks&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="66025" author="bobijam" created="Mon, 9 Sep 2013 01:40:11 +0000"  >&lt;p&gt;Oleg, &lt;/p&gt;

&lt;p&gt;Would you consider &lt;a href=&quot;http://review.whamcloud.com/#/c/4954/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#/c/4954/&lt;/a&gt; &#65311;I don&apos;t know whether Vitaly has addressed the issue somewhere.&lt;/p&gt;</comment>
                            <comment id="66032" author="green" created="Mon, 9 Sep 2013 04:40:27 +0000"  >&lt;p&gt;I dout that patch would help.&lt;/p&gt;

&lt;p&gt;Sebastien, can we get kernel dmesg log please, I wonder what&apos;s there/&lt;/p&gt;

&lt;p&gt;It seems that something forgot to release namespace lock on the client (at least I don&apos;t see anything that&apos;s in an area guarded by this lock, everybody seems to wait to acquire it), possibly there&apos;s some place that forgets to release it on an error exit path and here&apos;s hope there&apos;s some clue in the dmesg.&lt;/p&gt;</comment>
                            <comment id="66044" author="sebastien.buisson" created="Mon, 9 Sep 2013 09:18:11 +0000"  >&lt;p&gt;Hi,&lt;/p&gt;

&lt;p&gt;This is the dmesg requested by Oleg, taken from the crash dump.&lt;/p&gt;

&lt;p&gt;Sebastien.&lt;/p&gt;</comment>
                            <comment id="71106" author="bobijam" created="Fri, 8 Nov 2013 06:54:56 +0000"  >&lt;p&gt;hmm, why the dmesg does not have any lustre log informations? I just saw call trace and mem-info report in it.&lt;/p&gt;</comment>
                            <comment id="71123" author="sebastien.buisson" created="Fri, 8 Nov 2013 15:13:47 +0000"  >&lt;p&gt;Probably there is so many information dumped in the dmesg regarding the OOM issue that it replaces the Lustre logs.&lt;/p&gt;

&lt;p&gt;From the dump, do you know how we can access the memory containing the Lustre debug logs?&lt;/p&gt;</comment>
                            <comment id="71412" author="bobijam" created="Wed, 13 Nov 2013 13:52:08 +0000"  >&lt;p&gt;I haven&apos;t find relevant info to get the root cause. &lt;/p&gt;

&lt;p&gt;What&apos;s the memory usage on the clients? I wonder whether it is the client uses too much memory for lock or for data cache.&lt;/p&gt;

&lt;p&gt;what&apos;s the &quot;lctl get_param ldlm.namespaces.*.lru_size&quot; values?&lt;/p&gt;</comment>
                            <comment id="71513" author="sebastien.buisson" created="Thu, 14 Nov 2013 10:01:47 +0000"  >&lt;p&gt;Hi,&lt;/p&gt;

&lt;p&gt;Not too easy to get the lru_size from the crash dump.&lt;br/&gt;
We dumped the &apos;struct obd_device&apos; in the obd_devs array, then the &apos;struct obd_namespace&quot; to reach the &apos;ns_nr_unused&apos; field corresponding to the lru_size information.&lt;/p&gt;

&lt;p&gt;All values are 0.&lt;/p&gt;

&lt;p&gt;HTH,&lt;br/&gt;
Sebastien.&lt;/p&gt;</comment>
                            <comment id="72388" author="bobijam" created="Wed, 27 Nov 2013 14:15:25 +0000"  >&lt;p&gt;If lru_size is 0, it means lock does not take much memory, what is the &quot;lctl get_param llite.*.max_cached_mb&quot; output? Can you set it to a smaller value to keep less data cache on the client side?&lt;/p&gt;</comment>
                            <comment id="72521" author="sebastien.buisson" created="Fri, 29 Nov 2013 14:03:23 +0000"  >&lt;p&gt;Hi,&lt;/p&gt;

&lt;p&gt;In the crash dump I looked at the value of ll_async_page_max, which is 12365796, so max_cached_mb is around 48 GB, ie 3/4 of total node memory.&lt;/p&gt;

&lt;p&gt;But in any case, remember there is a regression in CLIO in 2.1 causing this max_cached_mb value to be never used!&lt;/p&gt;

&lt;p&gt;So there is no means to limit Lustre data cache &lt;img class=&quot;emoticon&quot; src=&quot;https://jira.whamcloud.com/images/icons/emoticons/sad.png&quot; height=&quot;16&quot; width=&quot;16&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/p&gt;

&lt;p&gt;Sebastien.&lt;/p&gt;</comment>
                            <comment id="72683" author="jay" created="Tue, 3 Dec 2013 07:28:08 +0000"  >&lt;p&gt;Hi Sebastien, can you please show me the output of `slabtop&apos;?&lt;/p&gt;</comment>
                            <comment id="72689" author="sebastien.buisson" created="Tue, 3 Dec 2013 10:23:57 +0000"  >&lt;p&gt;Hi,&lt;/p&gt;

&lt;p&gt;I have no idea how I can run slabtop inside crash :/&lt;br/&gt;
I have uploaded a tarball to the ftp server, under uploads/&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3897&quot; title=&quot;Hang up in ldlm_pools_shrink under OOM&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3897&quot;&gt;&lt;del&gt;LU-3897&lt;/del&gt;&lt;/a&gt;. It contains the vmcore, the vmlinux, and the Lustre modules.&lt;/p&gt;

&lt;p&gt;Maybe you will be able to get the information you are looking for.&lt;/p&gt;

&lt;p&gt;Thanks,&lt;br/&gt;
Sebastien.&lt;/p&gt;</comment>
                            <comment id="72728" author="jay" created="Tue, 3 Dec 2013 18:50:57 +0000"  >&lt;p&gt;`kmem -s&apos; will print out slabinfo in this case. Thanks for the crashdump, and I will take a look.&lt;/p&gt;</comment>
                            <comment id="72731" author="jay" created="Tue, 3 Dec 2013 19:07:01 +0000"  >&lt;p&gt;the system ran out of memory, from the output of `kmem -i&apos;:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;crash&amp;gt; kmem -i 
              PAGES        TOTAL      PERCENTAGE
 TOTAL MEM  16487729      62.9 GB         ----
      FREE    37266     145.6 MB    0% of TOTAL MEM
      USED  16450463      62.8 GB   99% of TOTAL MEM
    SHARED      401       1.6 MB    0% of TOTAL MEM
   BUFFERS       99       396 KB    0% of TOTAL MEM
    CACHED       57       228 KB    0% of TOTAL MEM
      SLAB    22513      87.9 MB    0% of TOTAL MEM

TOTAL SWAP   255857     999.4 MB         ----
 SWAP USED   255856     999.4 MB   99% of TOTAL SWAP
 SWAP FREE        1         4 KB    0% of TOTAL SWAP
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Both system memory and swap space were used up. Page cache and slab cache were normal, no excessive usage at all.&lt;/p&gt;

&lt;p&gt;It turned out most of the memory were used for anonymous memory mapping.&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;crash&amp;gt; kmem -V
  VM_STAT:
          NR_FREE_PAGES: 37266
       NR_INACTIVE_ANON: 956162
         NR_ACTIVE_ANON: 15158596
       NR_INACTIVE_FILE: 0
         NR_ACTIVE_FILE: 99
         NR_UNEVICTABLE: 0
               NR_MLOCK: 0
          NR_ANON_PAGES: 16115037
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;there was 16115037 NR_ANON_PAGES which is 61G in size. I guess the application is probably using too much memory or badly written. I don&apos;t think we can help in this case.&lt;/p&gt;</comment>
                            <comment id="72908" author="jay" created="Thu, 5 Dec 2013 18:11:05 +0000"  >&lt;p&gt;Please reopen this ticket if you have more questions&lt;/p&gt;</comment>
                    </comments>
                    <attachments>
                            <attachment id="13436" name="UJF_crash_foreach_bt.gz" size="36465" author="sebastien.buisson" created="Fri, 6 Sep 2013 14:02:33 +0000"/>
                            <attachment id="13440" name="UJF_dmesg.zip" size="45884" author="sebastien.buisson" created="Mon, 9 Sep 2013 09:18:11 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzw0nr:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>10187</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>