<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:45:26 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-4740] MDS - buffer cache not freed</title>
                <link>https://jira.whamcloud.com/browse/LU-4740</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;On our MDS, we seem to have a memory leak related to buffer cache that is&lt;br/&gt;
unreclaimable. Our workload is extremely metadata intensive, so that the MDS is under constant heavy load.&lt;/p&gt;

&lt;p&gt;After a fresh reboot the buffer cache is filling up quickly. After a while RAM is used up and the machine starts swapping basically bringing Luster to a halt&lt;br/&gt;
(clients disconnect, lock failures, etc.).&lt;/p&gt;

&lt;p&gt;The strange thing is that&lt;/p&gt;

&lt;p&gt;$ echo 3 &amp;gt; /proc/sys/vm/drop_caches&lt;/p&gt;

&lt;p&gt;only frees part of the allocated buffer cache and after a while the unreclaimable part fills up RAM completely leading to the swap disaster.&lt;/p&gt;

&lt;p&gt;Setting /proc/sys/vm/vfs_cache_pressure &amp;gt; 100 doesn&apos;t help and&lt;br/&gt;
a large value of /proc/sys/vm/min_free_kbytes is happily ignored.&lt;/p&gt;

&lt;p&gt;Also strange: After unmounting all Lustre targets and even unloading the Lustre kernel modules the kernel still shows the amount of previously allocated buffer cache as used memory even though the amount of buffer cache is then shown as close to zero. So it seems we have big memory leak.&lt;/p&gt;</description>
                <environment>vanilla 2.6.32.61&lt;br/&gt;
lustre 2.4.2&lt;br/&gt;
Hardware: Dual Xeon L5640 / 24G RAM&lt;br/&gt;
</environment>
        <key id="23545">LU-4740</key>
            <summary>MDS - buffer cache not freed</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="2" iconUrl="https://jira.whamcloud.com/images/icons/priorities/critical.svg">Critical</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="5">Cannot Reproduce</resolution>
                                        <assignee username="wc-triage">WC Triage</assignee>
                                    <reporter username="rfehren">Roland Fehrenbacher</reporter>
                        <labels>
                    </labels>
                <created>Sat, 8 Mar 2014 16:24:51 +0000</created>
                <updated>Sat, 9 Oct 2021 06:34:05 +0000</updated>
                            <resolved>Sat, 9 Oct 2021 06:34:05 +0000</resolved>
                                    <version>Lustre 2.4.2</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>8</watches>
                                                                            <comments>
                            <comment id="79030" author="rfehren" created="Tue, 11 Mar 2014 19:33:50 +0000"  >&lt;p&gt;I have appended the output of /proc/meminfo and &quot;slabtop -o -s c&quot;. Additionally, I&apos;ve compiled in&lt;br/&gt;
support for kmemleak, but &quot;cat /sys/kernel/debug/kmemleak&quot; is empty.&lt;/p&gt;</comment>
                            <comment id="79040" author="aakef" created="Tue, 11 Mar 2014 20:46:33 +0000"  >&lt;p&gt;According to meminfo swap is not used at all. Are you sure the logs are from the time when the issue comes up?&lt;/p&gt;</comment>
                            <comment id="79042" author="rfehren" created="Tue, 11 Mar 2014 21:20:12 +0000"  >&lt;p&gt;No, these files were from about 1 hour after I started Lustre. &lt;br/&gt;
To make things clearer, I deleted the old files and attached 3 pairs of meminfo and slabtop output. The first pair *.1&lt;br/&gt;
is taken after a reboot before mounting Lustre. The second pair was taken 9 1/2 hours later right after a drop_caches and&lt;br/&gt;
before I had to unmount Lustre and move things to the HA peer node (we don&apos;t want downtime during working hours).&lt;br/&gt;
Note that the swapping deadlock stage hadn&apos;t been reached at this stage yet, but one can clearly see the RAM having disappeared:&lt;/p&gt;

&lt;p&gt;MemTotal - MemFree = MemUsed = 18GB&lt;br/&gt;
Buffers + Cache + SLAB + misc = 14GB&lt;/p&gt;

&lt;p&gt;Also and most fatal buffer cache is not cleared by drop_caches.&lt;/p&gt;

&lt;p&gt;Finally the last pair is right after I unmounted Lustre. The buffer cache is cleared,&lt;br/&gt;
but Active(file) remains at 12GB (pretty much the amount of unreclaimable buffer cache&lt;br/&gt;
from before the umount).&lt;/p&gt;</comment>
                            <comment id="79150" author="adilger" created="Wed, 12 Mar 2014 17:17:52 +0000"  >&lt;p&gt;Roland, is it possible for you to test this with the current master branch?  There have been a number of fixes related to memory usage on both the client and server.&lt;/p&gt;

&lt;p&gt;In particular, &lt;a href=&quot;http://review.whamcloud.com/9223&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/9223&lt;/a&gt; fixed an allocation problem in 2.4 and 2.5, and there are others that are linked under &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4053&quot; title=&quot;client leaking objects/locks during IO&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4053&quot;&gt;&lt;del&gt;LU-4053&lt;/del&gt;&lt;/a&gt;.&lt;/p&gt;</comment>
                            <comment id="79215" author="rfehren" created="Thu, 13 Mar 2014 06:31:16 +0000"  >&lt;p&gt;Andreas, I&apos;ll try. Can a 2.5.56 (git master) MDS/MDT work with 2.4.2 OSS/OSTs or do I need to update&lt;br/&gt;
the whole cluster? Also note that on the client side, we&apos;re running a patched in-kernel client from 3.14-rc. Any expected compatibility problems with this combo? &lt;/p&gt;</comment>
                            <comment id="79973" author="adilger" created="Fri, 21 Mar 2014 10:12:50 +0000"  >&lt;p&gt;The 2.5 MDS is tested with 2.4 clients.  We haven&apos;t tested with 3.14 clients, but I expect those to be similar to 2.4.x.&lt;/p&gt;</comment>
                            <comment id="81475" author="rfehren" created="Sat, 12 Apr 2014 13:02:45 +0000"  >&lt;p&gt;Andreas,&lt;/p&gt;

&lt;p&gt;the issue is still there with 2.5.1 (where &lt;a href=&quot;http://review.whamcloud.com/9223&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/9223&lt;/a&gt; is included) on the servers.&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://review.whamcloud.com/#/c/7942/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#/c/7942/&lt;/a&gt; which is also referenced under &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4053&quot; title=&quot;client leaking objects/locks during IO&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4053&quot;&gt;&lt;del&gt;LU-4053&lt;/del&gt;&lt;/a&gt; is not in 2.5.1, in fact is not retrievable&lt;br/&gt;
in a cloned git repo. Some questions:&lt;/p&gt;

&lt;ul class=&quot;alternate&quot; type=&quot;square&quot;&gt;
	&lt;li&gt;Why is it not retrievable in a local clone even though it&apos;s visible on the git web interface ( &lt;a href=&quot;http://git.whamcloud.com/?p=fs/lustre-release.git;a=commitdiff;h=99bdfc4c8e87d7a6038cb4856281302ecfe8ad34&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://git.whamcloud.com/?p=fs/lustre-release.git;a=commitdiff;h=99bdfc4c8e87d7a6038cb4856281302ecfe8ad34&lt;/a&gt; )?&lt;/li&gt;
&lt;/ul&gt;


&lt;ul class=&quot;alternate&quot; type=&quot;square&quot;&gt;
	&lt;li&gt;What&apos;s the status of this patch? Is it advisable to apply it?&lt;/li&gt;
&lt;/ul&gt;


&lt;ul class=&quot;alternate&quot; type=&quot;square&quot;&gt;
	&lt;li&gt;Do you have any other suggestions? We still need to reboot servers every couple of days.&lt;/li&gt;
&lt;/ul&gt;
</comment>
                            <comment id="81492" author="adilger" created="Sat, 12 Apr 2014 22:28:34 +0000"  >&lt;p&gt;I would say that with the two -1 reviews on the 7942 patch should NOT be used in production.  If the memory is not listed in slabs or in /proc/sys/lnet/memused then it is not being allocated directly by Lustre (which would print a &quot;memory leaked&quot; message at unmount.  If the memory is not allocated directly by Lustre then the Lustre memory debugging code would not be useful for debugging this.  It might be a bug in the ldiskfs code but it is hard to know. &lt;/p&gt;

&lt;p&gt;Are there unusual uses of the filesystem (e.g. Many create delete cycles in large directories) or some other way that this could be reproduced easily?  Have you tried running with a server kernel using the in-kernel memory leak checking?  I don&apos;t know much of the details, but I think this can be compiled for RHEL kernels. &lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                        <issuelink>
            <issuekey id="21245">LU-4053</issuekey>
        </issuelink>
                            </outwardlinks>
                                                        </issuelinktype>
                    </issuelinks>
                <attachments>
                            <attachment id="14272" name="meminfo.1" size="1014" author="rfehren" created="Wed, 12 Mar 2014 08:50:14 +0000"/>
                            <attachment id="14273" name="meminfo.2" size="1014" author="rfehren" created="Wed, 12 Mar 2014 08:50:14 +0000"/>
                            <attachment id="14274" name="meminfo.3" size="1014" author="rfehren" created="Wed, 12 Mar 2014 08:50:14 +0000"/>
                            <attachment id="14275" name="slabtop.1" size="1655" author="rfehren" created="Wed, 12 Mar 2014 08:50:14 +0000"/>
                            <attachment id="14276" name="slabtop.2" size="1666" author="rfehren" created="Wed, 12 Mar 2014 08:50:14 +0000"/>
                            <attachment id="14277" name="slabtop.3" size="1660" author="rfehren" created="Wed, 12 Mar 2014 08:50:14 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzwh7r:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>13030</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10023"><![CDATA[4]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>