<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:18:27 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-1645] shrinker not shrinking/taking too long to shrink?</title>
                <link>https://jira.whamcloud.com/browse/LU-1645</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;We&apos;re seeing high load average on some Lustre clients accompanied by processes that are potentially stuck in the ldlm shrinker. Here&apos;s a sample stack trace:&lt;/p&gt;

&lt;p&gt;thread_return+0x38/0x34c&lt;br/&gt;
wake_affine+0x357/0x3b0&lt;br/&gt;
enqueue_sleeper+0x178/0x1c0&lt;br/&gt;
enqueue_entity+0x158/0x1c0&lt;br/&gt;
cfs_hash_bd_lookup_intent+0x27/0x110 &lt;span class=&quot;error&quot;&gt;&amp;#91;libcfs&amp;#93;&lt;/span&gt;&lt;br/&gt;
cfs_hash_dual_bd_unlock+0x2c/0x80 &lt;span class=&quot;error&quot;&gt;&amp;#91;libcfs&amp;#93;&lt;/span&gt;&lt;br/&gt;
cfs_hash_lookup+0x7a/0xa0 &lt;span class=&quot;error&quot;&gt;&amp;#91;libcfs&amp;#93;&lt;/span&gt;&lt;br/&gt;
ldlm_pool_shrink+0x31/0xf0 &lt;span class=&quot;error&quot;&gt;&amp;#91;ptlrpc&amp;#93;&lt;/span&gt;&lt;br/&gt;
cl_env_fetch+0x1d/0x60 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt;&lt;br/&gt;
cl_env_reexit+0xe/0x130 &lt;span class=&quot;error&quot;&gt;&amp;#91;obdclass&amp;#93;&lt;/span&gt;&lt;br/&gt;
ldlm_pools_shrink+0x1d2/0x310 &lt;span class=&quot;error&quot;&gt;&amp;#91;ptlrpc&amp;#93;&lt;/span&gt;&lt;br/&gt;
zone_watermark_ok+0x1b/0xd0&lt;br/&gt;
get_page_from_freelist+0x17a/0x720&lt;br/&gt;
apic_timer_interrupt+0xe/0x20&lt;br/&gt;
smp_call_function_many+0x1c0/0x250&lt;br/&gt;
drain_local_pages+0x0/0x10&lt;br/&gt;
smp_call_function+0x20/0x30&lt;br/&gt;
on_each_cpu+0x1d/0x40&lt;br/&gt;
__alloc_pages_slowpath+0x278/0x5f0&lt;br/&gt;
__alloc_pages_nodemask+0x13a/0x140&lt;br/&gt;
__get_free_pages+0x9/0x50&lt;br/&gt;
dup_task_struct+0x42/0x150&lt;br/&gt;
copy_process+0xb4/0xe50&lt;br/&gt;
do_fork+0x8c/0x3c0&lt;br/&gt;
sys_rt_sigreturn+0x222/0x2a0&lt;br/&gt;
stub_clone+0x13/0x20&lt;br/&gt;
system_call_fastpath+0x16/0x1b&lt;/p&gt;

&lt;p&gt;FWIW, some of the traces have cfs_hash_bd_lookup_intent+0x27 as the top line. All of them &lt;/p&gt;

&lt;p&gt;About 3/4 of the memory is inactive:&lt;/p&gt;

&lt;p&gt;pfe11 ~ # cat /proc/meminfo&lt;br/&gt;
MemTotal:       16333060 kB&lt;br/&gt;
MemFree:          344568 kB&lt;br/&gt;
Buffers:           86844 kB&lt;br/&gt;
Cached:          1488340 kB&lt;br/&gt;
SwapCached:         4864 kB&lt;br/&gt;
Active:          1523184 kB&lt;br/&gt;
Inactive:       12045612 kB&lt;br/&gt;
Active(anon):       9152 kB&lt;br/&gt;
Inactive(anon):     7012 kB&lt;br/&gt;
Active(file):    1514032 kB&lt;br/&gt;
Inactive(file): 12038600 kB&lt;br/&gt;
Unevictable:        3580 kB&lt;br/&gt;
Mlocked:            3580 kB&lt;br/&gt;
SwapTotal:      10388652 kB&lt;br/&gt;
SwapFree:       10136240 kB&lt;br/&gt;
Dirty:               244 kB&lt;br/&gt;
Writeback:           976 kB&lt;br/&gt;
AnonPages:         15600 kB&lt;br/&gt;
Mapped:            20296 kB&lt;br/&gt;
Shmem:                 0 kB&lt;br/&gt;
Slab:             870808 kB&lt;br/&gt;
SReclaimable:      64868 kB&lt;br/&gt;
SUnreclaim:       805940 kB&lt;br/&gt;
KernelStack:        4312 kB&lt;br/&gt;
PageTables:        14840 kB&lt;br/&gt;
NFS_Unstable:          0 kB&lt;br/&gt;
Bounce:                0 kB&lt;br/&gt;
WritebackTmp:          0 kB&lt;br/&gt;
CommitLimit:    18555180 kB&lt;br/&gt;
Committed_AS:    1074912 kB&lt;br/&gt;
VmallocTotal:   34359738367 kB&lt;br/&gt;
VmallocUsed:      544012 kB&lt;br/&gt;
VmallocChunk:   34343786784 kB&lt;br/&gt;
HardwareCorrupted:     0 kB&lt;br/&gt;
HugePages_Total:       0&lt;br/&gt;
HugePages_Free:        0&lt;br/&gt;
HugePages_Rsvd:        0&lt;br/&gt;
HugePages_Surp:        0&lt;br/&gt;
Hugepagesize:       2048 kB&lt;br/&gt;
DirectMap4k:        7168 kB&lt;br/&gt;
DirectMap2M:    16769024 kB&lt;/p&gt;

&lt;p&gt;We&apos;ve seen this on two clients in the last two days, and I think we have several other undiagnosed cases in the recent past. The client that did it yesterday was generating OOM messages at the time; today&apos;s client did not.&lt;/p&gt;

&lt;p&gt;I have a crash dump, but I&apos;m having trouble getting good stack traces out of it. I&apos;ll attach the output from sysrq-t to start. I can&apos;t share the crash dump due to our security policies, but I can certainly run commands against it for you, as necessary.&lt;/p&gt;

&lt;p&gt;If there&apos;s more information I can gather from a running system before we reboot it, let me know - I imagine we&apos;ll have another one soon.&lt;/p&gt;</description>
                <environment>SLES 11SP1 Kernel 2.6.32.54-0.3.1.20120223-nasa</environment>
        <key id="15257">LU-1645</key>
            <summary>shrinker not shrinking/taking too long to shrink?</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="4" iconUrl="https://jira.whamcloud.com/images/icons/priorities/minor.svg">Minor</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="5">Cannot Reproduce</resolution>
                                        <assignee username="bogl">Bob Glossman</assignee>
                                    <reporter username="rappleye">jason.rappleye@nasa.gov</reporter>
                        <labels>
                    </labels>
                <created>Wed, 18 Jul 2012 15:43:34 +0000</created>
                <updated>Wed, 27 Mar 2013 14:53:14 +0000</updated>
                            <resolved>Wed, 27 Mar 2013 14:53:13 +0000</resolved>
                                    <version>Lustre 2.1.2</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>3</watches>
                                                                            <comments>
                            <comment id="41986" author="pjones" created="Wed, 18 Jul 2012 16:53:19 +0000"  >&lt;p&gt;Bob will look into this one&lt;/p&gt;</comment>
                            <comment id="42020" author="jaylan" created="Thu, 19 Jul 2012 17:47:28 +0000"  >&lt;p&gt;I uploaded bt-a.txt, of stack traces when crash dump was taken. &lt;br/&gt;
Note that CPU2 was in shrink_slab and CPU4 and CPU5 were in shrink_zone.&lt;/p&gt;</comment>
                            <comment id="42025" author="bogl" created="Thu, 19 Jul 2012 19:20:24 +0000"  >&lt;p&gt;There&apos;s a suspicion here that this may be an instance of a known bug, &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-1576&quot; title=&quot;client sluggish after running lpurge&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-1576&quot;&gt;&lt;del&gt;LU-1576&lt;/del&gt;&lt;/a&gt;. If you can reproduce the problem, you can try dropping caches with:&lt;/p&gt;

&lt;p&gt;echo 3 &amp;gt; /proc/sys/vm/drop_caches&lt;/p&gt;

&lt;p&gt;If that raises the MemFree amount a lot and eliminates the OOMs then it&apos;s probably the known bug.&lt;br/&gt;
If so, the patch in &lt;a href=&quot;http://review.whamcloud.com/#change,3255&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#change,3255&lt;/a&gt; may help.&lt;/p&gt;</comment>
                            <comment id="42030" author="rappleye" created="Thu, 19 Jul 2012 22:37:10 +0000"  >&lt;p&gt;That looks promising. I&apos;ve asked our operations staff to try that and collect /proc/meminfo before and after. I&apos;ll report back with the results after the next incident. Thanks!&lt;/p&gt;</comment>
                            <comment id="42159" author="rappleye" created="Mon, 23 Jul 2012 20:18:42 +0000"  >&lt;p&gt;lflush + drop caches doesn&apos;t work. What&apos;s the next step in debugging this problem? I have a crash dump or two that might help, but you&apos;ll need to let me know what you need - as per our security policies, I can&apos;t send them to you.&lt;/p&gt;</comment>
                            <comment id="54916" author="pjones" created="Wed, 27 Mar 2013 14:53:14 +0000"  >&lt;p&gt;NASA report this no longer seems to be a problem so this quite possibly was a duplicate as the issue mentioned is fixed in the release that is in production nowadays.&lt;/p&gt;</comment>
                    </comments>
                    <attachments>
                            <attachment id="11703" name="bt-a.txt" size="18186" author="jaylan" created="Thu, 19 Jul 2012 17:47:28 +0000"/>
                            <attachment id="11699" name="service31-sysrq-t.txt" size="335989" author="rappleye" created="Wed, 18 Jul 2012 15:43:34 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzvkef:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>7034</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>