<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:53:59 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-5726] MDS buffer not freed when deleting files</title>
                <link>https://jira.whamcloud.com/browse/LU-5726</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;When deleting large numbers of files, memory usage on the MDS server grows significantly.  Attempts to reclaim memory by dropping caches only results in some of the memory being freed.  The buffer usage continues to grow until eventually the MDS server starts OOMing.&lt;/p&gt;

&lt;p&gt;The rate at which the buffer usage grows seems to vary but looks like it might be based on the number of clients that are deleting files and the speed at which the files are deleted.&lt;/p&gt;</description>
                <environment>CentOS 6.5&lt;br/&gt;
Kernel 2.6.32-358.23.2</environment>
        <key id="26970">LU-5726</key>
            <summary>MDS buffer not freed when deleting files</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="1" iconUrl="https://jira.whamcloud.com/images/icons/priorities/blocker.svg">Blocker</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="niu">Niu Yawei</assignee>
                                    <reporter username="rmohr">Rick Mohr</reporter>
                        <labels>
                    </labels>
                <created>Sat, 11 Oct 2014 06:39:34 +0000</created>
                <updated>Tue, 3 Mar 2015 16:18:00 +0000</updated>
                            <resolved>Thu, 5 Feb 2015 18:54:47 +0000</resolved>
                                    <version>Lustre 2.4.3</version>
                                    <fixVersion>Lustre 2.7.0</fixVersion>
                    <fixVersion>Lustre 2.5.4</fixVersion>
                                        <due></due>
                            <votes>0</votes>
                                    <watches>22</watches>
                                                                            <comments>
                            <comment id="96176" author="rmohr" created="Sat, 11 Oct 2014 06:42:03 +0000"  >&lt;p&gt;Below are some test results that I originally sent to the hpdd-discuss mailing list:&lt;/p&gt;

&lt;p&gt;In these tests, I created directory trees populated with empty files (stripe_count=1) and then used various methods to delete the files.  Before and after each test, I ran &quot;echo 1 &amp;gt; /proc/sys/vm/drop_caches&quot; on the MDS and recorded the &quot;base&quot; buffer usage.  Lustre servers are running 2.4.3.&lt;/p&gt;

&lt;p&gt;Test #1) On Lustre 2.5.0 client, used &quot;rm -rf&quot; to remove approx 700K files.  Buffer usage: before = 14.3 GB,  after = 15.4 GB&lt;/p&gt;

&lt;p&gt;Test #2) On Lustre 1.8.9 client, used &quot;rm -rf&quot; to remove approx 730K files.  Buffer usage: before = 15.4 GB, after = 15.8 GB&lt;/p&gt;

&lt;p&gt;Test #3) On Lustre 1.8.9 client, used &quot;clean_dirs.sh&quot; script (thanks to Steve Ayers) to remove approx 650K files.  Buffer usage: before = 15.8 GB, after = 16.0 GB&lt;/p&gt;

&lt;p&gt;Test #4) On Lustre 2.5.0 client, used &quot;clean_dirs.sh&quot; script to remove approx 730K files.  Buffer usage: before = 16.0 GB, after = 16.25 GB&lt;/p&gt;

&lt;p&gt;Test #5) On one Lustre 2.5.0 client and two Lustre 1.8.9 clients, used &quot;rm -rf&quot; to delete approx 330K files on each host simultaneously.  Buffer usage: before = 16.26 GB,  after = 17.63 GB&lt;/p&gt;

&lt;p&gt;Test #6) Similar to test #5, but used &quot;rm -rf&quot; to delete files in groups of approx 110K.  The deletion of these groups was staggered in time across the three nodes.  At some points two nodes were deleting simultaneously and at other times only one node was deleting files.  Buffer usage: before = 17.63 GB,  after = 17.8 GB.&lt;/p&gt;

&lt;p&gt;Test #7) On Lustre 2.5.0 client, deleted 9 groups of 110K files each.  The groups were deleted sequentially with some pauses between groups.  Buffer usage: before = 17.8 GB, after = 17.9 GB&lt;/p&gt;

&lt;p&gt;Test #8) On Lustre 1.8.9 client, used &quot;find $DIRNAME -delete&quot; to remove approx 1M files.  Buffer usage: before = 17.9 GB, after = 19.4 GB&lt;/p&gt;

&lt;p&gt;The tests showed quite a bit of variance between nodes and also the tools used to delete the files.  The lowest increase in buffer usage seemed to occur when files were deleted sequentially in smaller batches.  I don&apos;t see any consistent pattern except for the fact that the base buffer usage always seems to increase when files are deleted in large numbers. The setup was not completely ideal since other users were actively using the file system at the same time.  However, there are a couple of things to note:&lt;/p&gt;

&lt;p&gt;1) Over the course of the week, the MDS base buffer usage increased from 14.3 GB to 19.4 GB.  These increases only occurred during my file removal tests, and there was never any decrease in base buffer usage at any point.&lt;/p&gt;

&lt;p&gt;2) Other file system activity did not seem to contribute to the base buffer usage increase.  During the nights/weekend when I did not do testing, the overall buffer usage did increase. However, when I would drop the caches to measure the base buffer usage, it always returned to the same (or at least very nearly the same) value as it was the day before.  I also observed an application doing millions of file open/read/close operations, and none of this increased the base buffer usage.&lt;/p&gt;</comment>
                            <comment id="96177" author="rmohr" created="Sat, 11 Oct 2014 06:47:17 +0000"  >&lt;p&gt;At the suggestion of Andreas, I ran another test with the lustre debug flag +malloc enabled and captured the debug log on the MDS server.  The test involved running &quot;rm -rf&quot; to remove a directory tree containing approx 1 million files.  I have attached this log file along with the contents of the slabinfo and meminfo files captures before and after the test.  The slabinfo/meminfo was gathered after running &quot;echo 1 &amp;gt; /proc/sys/vm/drop_caches&quot;  so I could see how much memory was unreclaimable.&lt;/p&gt;</comment>
                            <comment id="96377" author="niu" created="Wed, 15 Oct 2014 07:55:28 +0000"  >&lt;p&gt;Did you sync filesystem after &quot;rm -f&quot; and before drop_caches? I think maybe the buffers is still being held by uncommitted request.&lt;/p&gt;</comment>
                            <comment id="96598" author="dmiter" created="Fri, 17 Oct 2014 17:17:58 +0000"  >&lt;p&gt;The command &quot;echo 1 &amp;gt; /proc/sys/vm/drop_caches&quot; is free page cache only. But most of allocated Lustre memory is associated with dentries and inodes. It can be freed by  command &quot;echo &lt;b&gt;3&lt;/b&gt; &amp;gt; /proc/sys/vm/drop_caches&quot;.&lt;/p&gt;</comment>
                            <comment id="96740" author="rmohr" created="Mon, 20 Oct 2014 19:05:27 +0000"  >&lt;p&gt;I tried using a sync before dropping caches, and it did not make a difference.  I also tried echoing &quot;3&quot; into the drop_caches file, and that did not make a difference either.&lt;/p&gt;</comment>
                            <comment id="96932" author="rmohr" created="Tue, 21 Oct 2014 21:12:28 +0000"  >&lt;p&gt;Tried some tests using Lustre 2.5.3, and the problem still exists there.  Also, a coworker was deleting a larger number of files across 10 different Lustre 2.5.3 client nodes, and this caused the MDS (w/ 64 GB RAM) to OOM in less than an hour.  So this bug is easy to trigger, and in principle, any user could crash the file system just by deleting enough files.&lt;/p&gt;</comment>
                            <comment id="97009" author="rmohr" created="Wed, 22 Oct 2014 14:22:36 +0000"  >&lt;p&gt;Some more info:  If we unmount the MDT, all of the unreclaimable buffer memory is freed.  We don&apos;t need to unload any kernel modules in order to get the memory back.&lt;/p&gt;</comment>
                            <comment id="97010" author="niu" created="Wed, 22 Oct 2014 14:23:11 +0000"  >&lt;p&gt;I&apos;m wondering if it&apos;s related to &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4053&quot; title=&quot;client leaking objects/locks during IO&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4053&quot;&gt;&lt;del&gt;LU-4053&lt;/del&gt;&lt;/a&gt; (client will acquire a layout lock when unlink file and that lock will be cached on client)&lt;/p&gt;

&lt;p&gt;Hi, Rick&lt;br/&gt;
Is the meminfo.before &amp;amp; slabinfo.before captured before unlink files? I didn&apos;t see much difference between before &amp;amp; after.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Tried some tests using Lustre 2.5.3, and the problem still exists there. Also, a coworker was deleting a larger number of files across 10 different Lustre 2.5.3 client nodes, and this caused the MDS (w/ 64 GB RAM) to OOM in less than an hour. So this bug is easy to trigger, and in principle, any user could crash the file system just by deleting enough files.&lt;/p&gt;&lt;/blockquote&gt;
&lt;p&gt;How many files were deleted in total?&lt;/p&gt;</comment>
                            <comment id="97021" author="rmohr" created="Wed, 22 Oct 2014 15:50:21 +0000"  >&lt;p&gt;I deleted about 1 million files.  The *.before files were collected after I dropped caches but before the files were deleted.  The *.after files were collected after the files were deleted and the cache had been dropped.&lt;/p&gt;

&lt;p&gt;For this small test, the difference is not huge.  From the meminfo files, the main difference is that Buffers increased by about 0.5 GB and that increase also roughly equals the increase in &quot;Inactive(file)&quot;.  Other memory numbers stayed about the same, and a few even decreased.  But this trend continues as more and more files are deleted so that the buffer usage keeps growing.  At first I thought that there might be a lot of locks consuming memory, but the slab usage doesn&apos;t seem to increase.  In fact, if you look at the before/after slabinfo and sort by the number of active objects, you&apos;ll see that several of those categories (ldlm_locks, size-192, selinux_inode_security, ldiskfs_inode_cache, lod_obj, mdt_obj, mdd_obj, osp_obj, ldlm_resources, size-32) have roughly 1 million fewer active objects after the files were deleted.  So the memory usage keeps increasing, but I can&apos;t seem to tie it to any particular slab structures.&lt;/p&gt;

&lt;p&gt;I had come across &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4053&quot; title=&quot;client leaking objects/locks during IO&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4053&quot;&gt;&lt;del&gt;LU-4053&lt;/del&gt;&lt;/a&gt; before filing this ticket thinking that it might be related, but upon looking at the details, I am not sure they are the same. That ticket seemed to indicate that the memory increase was due to increased slab usage (which is not the case here).  That ticket also mentioned that dropping caches released the memory, which does not work in my case.&lt;/p&gt;

&lt;p&gt;I should note that in some other tests, we tried dropping locks and caches on the client side so see if that would free up memory on the MDS.  We also tried unmounting the lustre file system.  None of those approaches freed up any MDS memory.&lt;/p&gt;</comment>
                            <comment id="97097" author="niu" created="Thu, 23 Oct 2014 05:57:05 +0000"  >&lt;blockquote&gt;
&lt;p&gt;We also tried unmounting the lustre file system. None of those approaches freed up any MDS memory.&lt;/p&gt;&lt;/blockquote&gt;
&lt;p&gt;You mean umount MDT or client?&lt;/p&gt;</comment>
                            <comment id="97292" author="rmohr" created="Thu, 23 Oct 2014 17:35:26 +0000"  >&lt;p&gt;Sorry for not being clear.  Unmounting the file system on the client did not free up the MDS memory.  However, unmounting the MDT did free up the memory.&lt;/p&gt;</comment>
                            <comment id="97461" author="rmohr" created="Fri, 24 Oct 2014 20:37:42 +0000"  >&lt;p&gt;After doing some more testing yesterday and looking back over the slabinfo/meminfo data, a coworker and I found what appears to be a connection with the buffer_head objects.  Prior to the test, the &quot;Buffer(Inactive file)&quot; usage was 867,196 kB and the number of active buffer_head objects was 216,313.  After the test, &quot;Buffer(Inactive file)&quot; was 1,386,604 kB and the active buffer_head objects was 347,257.  If the buffer_head objects accounted for the increased buffer usage, this would work out to about 4KB per buffer_head (basically, one page of memory).&lt;/p&gt;

&lt;p&gt;I don&apos;t know a lot about the Linux buffer_head structure or how it is used in I/O, but based on some online reading I was wondering if the file deletions were possibly creating a bunch of small disk I/O requests which results in buffer_head structures that point to small 4KB I/O buffers.  If those I/O requests get completed, but for some reason the buffer_head structures aren&apos;t released, then maybe there is a continuous increase in memory usage in 4KB chunks. This wouldn&apos;t matter so much for a small number of file deletions, but when several million files are deleted, it starts to consume significant amounts of memory.&lt;/p&gt;

&lt;p&gt;Is something like that possible?&lt;/p&gt;</comment>
                            <comment id="97524" author="niu" created="Mon, 27 Oct 2014 03:40:09 +0000"  >&lt;p&gt;Deleting file will use buffer_head definitely (deleting file itself and writing the unlink log record), however, I think those buffers should be reclaimed by &quot;echo 3 &amp;gt; drop_caches&quot; once the unlink operation committed to disk (synced to disk), otherwise, I think MDT won&apos;t be umount cleanly.&lt;/p&gt;</comment>
                            <comment id="97533" author="niu" created="Mon, 27 Oct 2014 09:16:11 +0000"  >&lt;p&gt;BTW, I see lots of &quot;mdt_mfd_new()&quot; and &quot;mdt_mfd_set_mode()&quot; in the debug log, I think these should belong to open operations, is there lots of ongoing open operations while you do &quot;rm&quot;?&lt;/p&gt;</comment>
                            <comment id="97548" author="rmohr" created="Mon, 27 Oct 2014 13:57:36 +0000"  >&lt;p&gt;I was capturing the data from our production file system, so it was actively being used by our user community.  It is entirely possible that some using was issuing a lot of open calls.&lt;/p&gt;</comment>
                            <comment id="97799" author="niu" created="Wed, 29 Oct 2014 03:20:50 +0000"  >&lt;p&gt;I did some local testing against master branch, looks the &quot;buffer memory&quot; did increased after an &quot;rm 5000 files&quot; operation, and it can&apos;t be reclaimed by &quot;echo 3 &amp;gt; /proc/sys/vm/drop_caches&quot;, I didn&apos;t find where the &quot;buffer memory&quot; was used so far, but looks &quot;chown 5000 files&quot; or &quot;create 5000 files&quot; doesn&apos;t have the problem.&lt;/p&gt;</comment>
                            <comment id="97817" author="niu" created="Wed, 29 Oct 2014 13:14:17 +0000"  >&lt;p&gt;Hi, Rick&lt;br/&gt;
The memory showed as &quot;Buffer&quot;/&quot;Inactive(file)&quot; are available and can be reclaimed for other purpose, so I tend to think it&apos;s not the real cause of OOM. Though I don&apos;t know how to force kernel to immediately reclaim them as &quot;MemFree&quot;, I believe they could be reclaimed when necessary.&lt;/p&gt;

&lt;p&gt;I think it could be the same problem of &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-5333&quot; title=&quot;rm cause MDS to complain hung tasks and disconnecting clients&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-5333&quot;&gt;&lt;del&gt;LU-5333&lt;/del&gt;&lt;/a&gt; &amp;amp; &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-5503&quot; title=&quot;MDS (2.4.2) are getting &amp;quot;Service thread ... inactive&amp;quot; and file-system times out&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-5503&quot;&gt;&lt;del&gt;LU-5503&lt;/del&gt;&lt;/a&gt;. Did you ever observe similar stack trace reported in that two tickets?&lt;/p&gt;</comment>
                            <comment id="97868" author="rmohr" created="Wed, 29 Oct 2014 18:25:43 +0000"  >&lt;p&gt;Niu,&lt;/p&gt;

&lt;p&gt;There are some similarities with &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-5333&quot; title=&quot;rm cause MDS to complain hung tasks and disconnecting clients&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-5333&quot;&gt;&lt;del&gt;LU-5333&lt;/del&gt;&lt;/a&gt; and &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-5503&quot; title=&quot;MDS (2.4.2) are getting &amp;quot;Service thread ... inactive&amp;quot; and file-system times out&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-5503&quot;&gt;&lt;del&gt;LU-5503&lt;/del&gt;&lt;/a&gt; but also some differences.  Our stack traces do seem to point to the cause being memory related.  Before one of the first crashes, we saw errors like this:&lt;/p&gt;

&lt;p&gt;Sep 13 08:03:40 medusa-mds1 kernel: LNetError: 2702:0:(o2iblnd_cb.c:3075:kiblnd_check_conns()) Timed out RDMA with 172.16.30.1@o2ib (141): c: 62, oc: 0, rc: 63&lt;/p&gt;

&lt;p&gt;This was followed immediately with a stack trace like this:&lt;/p&gt;

&lt;p&gt;Sep 13 08:03:40 medusa-mds1 kernel: INFO: task kswapd0:178 blocked for more than 120 seconds.&lt;br/&gt;
Sep 13 08:03:40 medusa-mds1 kernel: &quot;echo 0 &amp;gt; /proc/sys/kernel/hung_task_timeout_secs&quot; disables this message.&lt;br/&gt;
Sep 13 08:03:40 medusa-mds1 kernel: kswapd0       D 0000000000000006     0   178      2 0x00000000&lt;br/&gt;
Sep 13 08:03:40 medusa-mds1 kernel: ffff8808322a1a80 0000000000000046 ffffea0001cc0f98 ffff8808322a1a50&lt;br/&gt;
Sep 13 08:03:40 medusa-mds1 kernel: ffffea0000fe4ed8 ffff8808322a1b50 0000000000000020 000000000000001f&lt;br/&gt;
Sep 13 08:03:40 medusa-mds1 kernel: ffff88083226fab8 ffff8808322a1fd8 000000000000fb88 ffff88083226fab8&lt;br/&gt;
Sep 13 08:03:40 medusa-mds1 kernel: Call Trace:&lt;br/&gt;
Sep 13 08:03:40 medusa-mds1 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff8109708e&amp;gt;&amp;#93;&lt;/span&gt; ? prepare_to_wait+0x4e/0x80&lt;br/&gt;
Sep 13 08:03:40 medusa-mds1 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa0b7518a&amp;gt;&amp;#93;&lt;/span&gt; start_this_handle+0x27a/0x4a0 &lt;span class=&quot;error&quot;&gt;&amp;#91;jbd2&amp;#93;&lt;/span&gt;&lt;br/&gt;
Sep 13 08:03:40 medusa-mds1 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff81096da0&amp;gt;&amp;#93;&lt;/span&gt; ? autoremove_wake_function+0x0/0x40&lt;br/&gt;
Sep 13 08:03:40 medusa-mds1 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa0b755b0&amp;gt;&amp;#93;&lt;/span&gt; jbd2_journal_start+0xd0/0x110 &lt;span class=&quot;error&quot;&gt;&amp;#91;jbd2&amp;#93;&lt;/span&gt;&lt;br/&gt;
Sep 13 08:03:40 medusa-mds1 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa0c36546&amp;gt;&amp;#93;&lt;/span&gt; ldiskfs_journal_start_sb+0x56/0xe0 &lt;span class=&quot;error&quot;&gt;&amp;#91;ldiskfs&amp;#93;&lt;/span&gt;&lt;br/&gt;
Sep 13 08:03:40 medusa-mds1 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffffa0c368c4&amp;gt;&amp;#93;&lt;/span&gt; ldiskfs_dquot_drop+0x34/0x80 &lt;span class=&quot;error&quot;&gt;&amp;#91;ldiskfs&amp;#93;&lt;/span&gt;&lt;br/&gt;
Sep 13 08:03:40 medusa-mds1 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff811e04e2&amp;gt;&amp;#93;&lt;/span&gt; vfs_dq_drop+0x52/0x60&lt;br/&gt;
Sep 13 08:03:40 medusa-mds1 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff8119d363&amp;gt;&amp;#93;&lt;/span&gt; clear_inode+0x93/0x140&lt;br/&gt;
Sep 13 08:03:40 medusa-mds1 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff8119d450&amp;gt;&amp;#93;&lt;/span&gt; dispose_list+0x40/0x120&lt;br/&gt;
Sep 13 08:03:40 medusa-mds1 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff8119d7a4&amp;gt;&amp;#93;&lt;/span&gt; shrink_icache_memory+0x274/0x2e0&lt;br/&gt;
Sep 13 08:03:40 medusa-mds1 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff81131ffa&amp;gt;&amp;#93;&lt;/span&gt; shrink_slab+0x12a/0x1a0&lt;br/&gt;
Sep 13 08:03:40 medusa-mds1 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff811351ea&amp;gt;&amp;#93;&lt;/span&gt; balance_pgdat+0x59a/0x820&lt;br/&gt;
Sep 13 08:03:40 medusa-mds1 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff811355a4&amp;gt;&amp;#93;&lt;/span&gt; kswapd+0x134/0x3c0&lt;br/&gt;
Sep 13 08:03:40 medusa-mds1 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff81096da0&amp;gt;&amp;#93;&lt;/span&gt; ? autoremove_wake_function+0x0/0x40&lt;br/&gt;
Sep 13 08:03:40 medusa-mds1 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff81135470&amp;gt;&amp;#93;&lt;/span&gt; ? kswapd+0x0/0x3c0&lt;br/&gt;
Sep 13 08:03:40 medusa-mds1 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff81096a36&amp;gt;&amp;#93;&lt;/span&gt; kthread+0x96/0xa0&lt;br/&gt;
Sep 13 08:03:40 medusa-mds1 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff8100c0ca&amp;gt;&amp;#93;&lt;/span&gt; child_rip+0xa/0x20&lt;br/&gt;
Sep 13 08:03:40 medusa-mds1 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff810969a0&amp;gt;&amp;#93;&lt;/span&gt; ? kthread+0x0/0xa0&lt;br/&gt;
Sep 13 08:03:40 medusa-mds1 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff8100c0c0&amp;gt;&amp;#93;&lt;/span&gt; ? child_rip+0x0/0x20&lt;/p&gt;

&lt;p&gt;There were some more of these (I will attach a more complete log later), and while they weren&apos;t all identical, they did all seem to contain the same two lines:&lt;/p&gt;

&lt;p&gt;Sep 13 08:03:40 medusa-mds1 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff8119d7a4&amp;gt;&amp;#93;&lt;/span&gt; shrink_icache_memory+0x274/0x2e0&lt;br/&gt;
Sep 13 08:03:40 medusa-mds1 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff81131ffa&amp;gt;&amp;#93;&lt;/span&gt; shrink_slab+0x12a/0x1a0&lt;/p&gt;

&lt;p&gt;So I would assume that all of those instances were caused by the system trying to free up memory but being unable to do so.&lt;/p&gt;

&lt;p&gt;The main difference I see with the stack traces from &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-5333&quot; title=&quot;rm cause MDS to complain hung tasks and disconnecting clients&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-5333&quot;&gt;&lt;del&gt;LU-5333&lt;/del&gt;&lt;/a&gt;/&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-5503&quot; title=&quot;MDS (2.4.2) are getting &amp;quot;Service thread ... inactive&amp;quot; and file-system times out&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-5503&quot;&gt;&lt;del&gt;LU-5503&lt;/del&gt;&lt;/a&gt; is that those stack traces contain this line:&lt;/p&gt;

&lt;p&gt; &lt;span class=&quot;error&quot;&gt;&amp;#91;&amp;lt;ffffffff81134b99&amp;gt;&amp;#93;&lt;/span&gt; ? zone_reclaim+0x349/0x400&lt;/p&gt;

&lt;p&gt;Shortly after we upgraded from 1.8 to 2.4 in May, we saw cases where the MDS would seem to grind to a halt, and based on those stack traces, it looked like the system was having problems allocating memory.  Our early stack traces in May contained that &quot;zone_reclaim&quot; line.  However, we soon realized that we forgot to reapply the sysctl setting &quot;vm.zone_reclaim_mode = 0&quot; on the MDS server after the upgrade.  Once we did that, we didn&apos;t have any further problems until just recently.&lt;/p&gt;

&lt;p&gt;If the systems in &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-5333&quot; title=&quot;rm cause MDS to complain hung tasks and disconnecting clients&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-5333&quot;&gt;&lt;del&gt;LU-5333&lt;/del&gt;&lt;/a&gt;/&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-5503&quot; title=&quot;MDS (2.4.2) are getting &amp;quot;Service thread ... inactive&amp;quot; and file-system times out&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-5503&quot;&gt;&lt;del&gt;LU-5503&lt;/del&gt;&lt;/a&gt; have zone_recaim_mode enabled, perhaps those problems are the same as mine.  My &quot;solution&quot; to disable zone_reclaim_mode may have just made it easier for the MDS server to find free memory and so it just took longer for the same underlying problem to become evident again.&lt;/p&gt;</comment>
                            <comment id="97872" author="rmohr" created="Wed, 29 Oct 2014 18:50:42 +0000"  >&lt;p&gt;Attached log messages around the time of first incident.&lt;/p&gt;</comment>
                            <comment id="97874" author="rmohr" created="Wed, 29 Oct 2014 19:08:42 +0000"  >&lt;p&gt;Niu,&lt;/p&gt;

&lt;p&gt;There are a couple of things about the log that I should point out.  The first events were around 8AM on Sept 13 and you&apos;ll notice that the oom-killer fired to kill off the nslcd process.  However, the server seems to have remained functional for at least some time after that.  Our Nagios monitoring checks did not report any problems until almost 5PM that same day, and it wasn&apos;t until about 6 or 7PM that the system became completely unresponsive.  So it seems like it was a slow decline.&lt;/p&gt;

&lt;p&gt;Also, we ran into the same (or at least we assume the same) problem on Sep 15 around 1PM.  We noticed the memory usage kept increasing until the oom-killer fired and killed several processes, after which point  the system became unresponsive.  However, in that incident, the oom-killer was not preceded by any lustre stack traces in the log file.&lt;/p&gt;</comment>
                            <comment id="97908" author="niu" created="Thu, 30 Oct 2014 03:15:06 +0000"  >&lt;blockquote&gt;
&lt;p&gt;There are a couple of things about the log that I should point out. The first events were around 8AM on Sept 13 and you&apos;ll notice that the oom-killer fired to kill off the nslcd process. However, the server seems to have remained functional for at least some time after that. Our Nagios monitoring checks did not report any problems until almost 5PM that same day, and it wasn&apos;t until about 6 or 7PM that the system became completely unresponsive. So it seems like it was a slow decline.&lt;/p&gt;&lt;/blockquote&gt;
&lt;p&gt;Is it possible that the min_free_kbytes is too low for your MDS? Following is quoted from RH manual:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;Setting min_free_kbytes too low prevents the system from reclaiming memory, This can result in system hangs and OOM-killing multiple processes.
However, setting &lt;span class=&quot;code-keyword&quot;&gt;this&lt;/span&gt; parameter too high (5% - 10% of total system memory) will cause your system out-of-memory immediately.
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;I&apos;m wondering if you can try to increase the min_free_kbytes and see if it can alleviate the situation?&lt;/p&gt;</comment>
                            <comment id="97974" author="rmohr" created="Thu, 30 Oct 2014 18:55:52 +0000"  >&lt;p&gt;We have vm.min_free_kbytes set to 131072.  This is about 0.2% of the total memory.  Is there a recommended value for this?&lt;/p&gt;</comment>
                            <comment id="98148" author="niu" created="Mon, 3 Nov 2014 00:06:24 +0000"  >&lt;p&gt;I don&apos;t have experience on tuning these parameters, so I&apos;m afraid that I can&apos;t give any suggestions on this. Probably you can increase it little by little and see how it works.&lt;/p&gt;</comment>
                            <comment id="98164" author="rjh" created="Mon, 3 Nov 2014 08:29:36 +0000"  >&lt;p&gt;a couple of thoughts... maybe not relevant, but hey...&lt;/p&gt;

&lt;p&gt;modern lustre sets memory affinity set for a lot of its threads (which IMHO is an over-optimisation). forced affinity means that if allocates are imbalanced then it is possible for one zone (eg. normal1) on a server to run out of ram, whilst there is still heaps of memory free in another zone (eg. normal0). we have seen OOM&apos;s on our OSS&apos;s due to this (although only when aggressive inode caching is turned on with vfs_cache_pressure=0, which probably isn&apos;t a common setting) and now set &quot;options libcfs cpu_npartitions=1&quot; on our OSS&apos;s for this reason. my testing hasn&apos;t shown such a severe problem with affinity on MDS&apos;s, but still, you could check &quot;egrep -i &apos;zone|slab|present|free&apos; /proc/zoneinfo&quot; and make sure the big zones have approximately equal free pages and equal used slab.&lt;/p&gt;

&lt;p&gt;also, have you tried dropping ldlm locks on the client before drop_caches? eg. &quot;lctl set_param ldlm.namespaces.fsName-&apos;*&apos;.lru_size=clear&quot;. without ldlm locks being dropped on the client, I suspect drop_caches on the client won&apos;t drop many inodes at all, and in turn that may mean that drop_caches on the MDS can&apos;t free much either?&lt;/p&gt;</comment>
                            <comment id="98438" author="adilger" created="Wed, 5 Nov 2014 18:23:18 +0000"  >&lt;p&gt;I would agree with Robin that I think the majority of the memory pressure is coming from MDT/MDD/LOD/OSD objects-&amp;gt;DLM locks-&amp;gt;ldiskfs inodes being cached/pinned in memory on the MDT.  I wonder if we are caching the locks too aggressively?  In theory, the LDLM pools should be pushing memory pressure from the lock server back to the clients, so the clients cancel DLM locks to relieve memory pressure on the server, but it is entirely possible that this isn&apos;t working as well as it should.&lt;/p&gt;</comment>
                            <comment id="98450" author="mdiep" created="Wed, 5 Nov 2014 19:48:57 +0000"  >&lt;p&gt;Do we know if 2.5.3 server have this issue?&lt;/p&gt;</comment>
                            <comment id="98463" author="rmohr" created="Wed, 5 Nov 2014 20:04:56 +0000"  >&lt;p&gt;Niu:&lt;/p&gt;

&lt;p&gt;I can try increasing vm.min_free_kbytes a little, but since the documentation says that there can be problems if this value is too high, I am reluctant to play around with it too much.&lt;/p&gt;

&lt;p&gt;Robin:&lt;/p&gt;

&lt;p&gt;We ran into some issues with our MDS server slowly grinding to a halt earlier this year and resolved those by setting vm.zone_reclaim_mode=0.  I have not looked at /proc/zoneinfo, but I will check that out to see if I notice anything unusual.  We do not set vfs_cache_pressure=0, but I may look into &quot;options libcfs cpu_npartitions&quot; since I am not familiar with that option.  As far as ldlm locks go, that is one of the things I suspected too.  However, we ran a test where we dropped all client locks on the system that was deleting files.  Unfortunately, even after dropping caches on the server, it did not seem to have any affect at all on the MDS memory usage growth.  We did of course see a decrease in memory used by the locks, but in the course of our testing, the total amount of memory used by locks seemed to pretty consistently hover around 2-3 GB.  So while the MDS memory usage steadily grew, the ldlm lock memory usage did not.  I took this to mean that locks were not really the cause of the memory pressure.  Nonetheless, we have applied some limits to lru_size and lru_max_age on our Lustre clients on the off chance that it might help.  Unfortunately, this attempt is somewhat complicated by bug &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-5727&quot; title=&quot;MDS OOMs with 2.5.3 clients and lru_size != 0&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-5727&quot;&gt;&lt;del&gt;LU-5727&lt;/del&gt;&lt;/a&gt; so the limits we apply on the client are not actually being honored.&lt;/p&gt;</comment>
                            <comment id="98464" author="rmohr" created="Wed, 5 Nov 2014 20:07:34 +0000"  >&lt;p&gt;Minh: In our testing, we have seen this on 2.5.3 servers.&lt;/p&gt;</comment>
                            <comment id="98477" author="adilger" created="Wed, 5 Nov 2014 21:59:08 +0000"  >&lt;p&gt;If you are hitting this problem of the available memory running out on the MDS, it would be useful to see if running &lt;tt&gt;lctl set&amp;#95;param ldlm.namespaces.&lt;b&gt;MDT&lt;/b&gt;.lru&amp;#95;size=clear&lt;/tt&gt; on all (or some subset) of clients, and then on the MDS &lt;tt&gt;sysctl -w vm.drop&amp;#95;caches=3&lt;/tt&gt; to drop the page and inode caches to see if this will reduce the MDS memory usage.  Without dropping the locks on the clients, the MDS inodes cannot be freed from cache (this includes one of each { mdt&amp;#95;obj, mdd&amp;#95;obj, lod&amp;#95;obj, osp&amp;#95;obj, ldlm&amp;#95;lock * N, ldlm&amp;#95;resource, ldiskfs&amp;#95;inode&amp;#95;cache } per inode so it definitely adds up (about 3KB per inode by my quick calculations).&lt;/p&gt;

&lt;p&gt;If this helps reduce MDS memory usage, the next question is why the LDLM pool is not shrinking the client DLM lock cache under memory pressure.&lt;/p&gt;</comment>
                            <comment id="98501" author="niu" created="Thu, 6 Nov 2014 02:10:17 +0000"  >&lt;p&gt;Hi, Andreas&lt;/p&gt;

&lt;p&gt;I don&apos;t think the LDLM pool is able to shrink client dlm locks when MDS is under memory pressure: The server ldlm shrinker is called when MDS is under memory pressure, however, it can&apos;t reclaim memory directly, it can only bump the SLV and wait for clients to aware of the increased SLV then start cancel locks. I think there are several problems in above schema:&lt;/p&gt;
&lt;ul class=&quot;alternate&quot; type=&quot;square&quot;&gt;
	&lt;li&gt;The time span of &lt;span class=&quot;error&quot;&gt;&amp;#91;server under memory pressure -- client start cancel locks&amp;#93;&lt;/span&gt; could be very long (idle client can only get bumped SLV on next ping).&lt;/li&gt;
	&lt;li&gt;The SLV isn&apos;t increased much in such situation (see ldlm_srv_pool_shrink()), I&apos;m afraid that client probably won&apos;t cancel locks as expected even if receives the new SLV in time.&lt;/li&gt;
	&lt;li&gt;The SLV could be overwritten by SLV recalculation thread immediately after it&apos;s bumped by shrinker.&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;Probably it&apos;s time to reconsider the whole ldlm pool mechanism?&lt;/p&gt;

&lt;p&gt;Hi, Rick&lt;/p&gt;

&lt;p&gt;Could you try to apply the fix of &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-5727&quot; title=&quot;MDS OOMs with 2.5.3 clients and lru_size != 0&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-5727&quot;&gt;&lt;del&gt;LU-5727&lt;/del&gt;&lt;/a&gt; to your 2.5 clients and use lru_size on clients then  retry your test to see if there is any difference? Thanks.&lt;/p&gt;</comment>
                            <comment id="98723" author="adilger" created="Sat, 8 Nov 2014 21:43:23 +0000"  >&lt;p&gt;Niu, I think we need to make the ldlm pool kick in much earlier to avoid memory pressure on the MDS. It doesn&apos;t make sense to have so many locks on the MDS that it is running out of memory. Since each lock is also keeping lots of other memory pinned (inode, buffer, MDT, MDD, LOD, OSD objects) we need to start shrinking the ldlm pool sooner. &lt;/p&gt;

&lt;p&gt;I haven&apos;t looked at this code in a long time, but is there an upper limit that can be imposed on the number of locks on the server?  What is used to calculate this limit, and is it reasonable?&lt;/p&gt;</comment>
                            <comment id="98738" author="niu" created="Mon, 10 Nov 2014 02:33:07 +0000"  >&lt;blockquote&gt;
&lt;p&gt;Niu, I think we need to make the ldlm pool kick in much earlier to avoid memory pressure on the MDS. It doesn&apos;t make sense to have so many locks on the MDS that it is running out of memory. Since each lock is also keeping lots of other memory pinned (inode, buffer, MDT, MDD, LOD, OSD objects) we need to start shrinking the ldlm pool sooner.&lt;/p&gt;&lt;/blockquote&gt;
&lt;p&gt;Indeed, kernel is always supposing shrinker can reclaim memory immediately, so the way of not reclaim memory but only decreasing SLV in shrinker (then hope clients start to cancel locks once they received the decreased SLV) looks inappropriate to me. The SLV recalculate thread is supposed to kick off lock cancel in early stage, so the server shrinker looks not necessary to me.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;I haven&apos;t looked at this code in a long time, but is there an upper limit that can be imposed on the number of locks on the server? What is used to calculate this limit, and is it reasonable?&lt;/p&gt;&lt;/blockquote&gt;
&lt;p&gt;The upper limit is 50 locks per 1M MDS memory.&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;/*
 * 50 ldlm locks &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; 1MB of RAM.
 */
#define LDLM_POOL_HOST_L ((NUM_CACHEPAGES &amp;gt;&amp;gt; (20 - PAGE_CACHE_SHIFT)) * 50)
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;But this isn&apos;t a hard limit, it&apos;s just a factor for SLV calculating.&lt;/p&gt;

&lt;p&gt;I ran some local unlink testing, it showed that &quot;Buffers/Inactive(file)&quot; grows after unlink (same as Rick&apos;s test result), and that didn&apos;t happen on creating or unlink from ldiskfs directly. I didn&apos;t figure out the reason yet, and I&apos;m not sure if it&apos;s really related to the OOM problem. While investigating this problem further, I&apos;m also trying to find machines with huge memory which can reproduce the OOM problem.&lt;/p&gt;

&lt;p&gt;Hi, Rick&lt;br/&gt;
I forgot to ask, did you observe how many locks cached on clients when the MDS is running out of memory?&lt;/p&gt;</comment>
                            <comment id="98898" author="rmohr" created="Tue, 11 Nov 2014 17:20:26 +0000"  >&lt;p&gt;(Sorry for taking a while to respond. I have been busy putting out some other fires.)&lt;/p&gt;

&lt;p&gt;In response to some of the comments:&lt;/p&gt;

&lt;p&gt;1) I have not tried increasing vm.min_free_kbytes yet, but will do so this week.  (Although I noticed that in &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-5841&quot; title=&quot;Lustre 2.4.2 MDS, hitting OOM errors &quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-5841&quot;&gt;&lt;del&gt;LU-5841&lt;/del&gt;&lt;/a&gt; a similar thing was tried and didn&apos;t seem to have any effect.)&lt;/p&gt;

&lt;p&gt;2) We have already tried dropping client locks before clearing the MDS server cache.  This did not help free up any memory.&lt;/p&gt;

&lt;p&gt;3) A coworker and I did some digging into how the MDS server determines how many locks it can support (and how many locks a client can cache).  We came across the SLV, but struggled to understand how it was supposed to work.  We were also unable to find any way to cap the total number of locks a MDS server would grant.  (We tried setting ldlm.namespaces.&amp;lt;mdt&amp;gt;.pool.limit but that didn&apos;t work.)  If such a way does not exist, it might be a very good thing to include.  It would be handy to have a simple way to ensure that the server&apos;s mem usage for locks can be controlled.&lt;/p&gt;

&lt;p&gt;4) I can try to apply the &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-5727&quot; title=&quot;MDS OOMs with 2.5.3 clients and lru_size != 0&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-5727&quot;&gt;&lt;del&gt;LU-5727&lt;/del&gt;&lt;/a&gt; patch to a client, but is there a reason to expect that this should make a difference? Since we have manually dropped client locks and seen no effect, I am not sure how this patch would change anything.&lt;/p&gt;

&lt;p&gt;5) I did not capture the numbers of locks on our Lustre clients when we had the MDS crashes.  We did look at the number of locks on one client that was purging large numbers of files while we observed the MDS mem usage increasing.  That client had over a million locks.  (We had placed a limit of 2000 locks on the client, and this behavior is what brought our attention to &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-5727&quot; title=&quot;MDS OOMs with 2.5.3 clients and lru_size != 0&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-5727&quot;&gt;&lt;del&gt;LU-5727&lt;/del&gt;&lt;/a&gt;.)&lt;/p&gt;</comment>
                            <comment id="98951" author="niu" created="Wed, 12 Nov 2014 07:35:41 +0000"  >&lt;blockquote&gt;
&lt;p&gt;4) I can try to apply the &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-5727&quot; title=&quot;MDS OOMs with 2.5.3 clients and lru_size != 0&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-5727&quot;&gt;&lt;del&gt;LU-5727&lt;/del&gt;&lt;/a&gt; patch to a client, but is there a reason to expect that this should make a difference? Since we have manually dropped client locks and seen no effect, I am not sure how this patch would change anything.&lt;/p&gt;&lt;/blockquote&gt;
&lt;p&gt;I was wondering if the OOM isn&apos;t caused by the growing &quot;Buffers&quot; (which won&apos;t be decreased by cancelling locks as you mentioned)? That&apos;s why I want to know many locks were cached on clients on OOM happened.&lt;/p&gt;</comment>
                            <comment id="99218" author="rmohr" created="Fri, 14 Nov 2014 20:37:43 +0000"  >&lt;p&gt;Niu,&lt;/p&gt;

&lt;p&gt;Just wanted to let you know that I increased vm.min_free_kbytes by a factor of 10 to 1310720 (~1.2 GB) which is about 2% of total memory.  I will let you know what happens.&lt;/p&gt;</comment>
                            <comment id="99231" author="rmohr" created="Fri, 14 Nov 2014 22:03:35 +0000"  >&lt;p&gt;Niu,&lt;/p&gt;

&lt;p&gt;If I want to know how many client locks are cached on the MDS server when it OOMs, is that info in the /proc/fs/lustre/ldlm/namespaces/mdt-&amp;lt;fsname&amp;gt;-MDT0000_UUID/lock_count file?  Since it isn&apos;t really feasible to query the lock counts on each client individually, I wanted to verify that I could get the same info from the server-side. (Although even that might not work if the MDS server becomes unresponsive when it OOMs.)&lt;/p&gt;

&lt;p&gt;I also wanted to let you know that we have started to run some purges from Lustre 2.4.3 clients, and it looks like maybe the server memory usage doesn&apos;t grow as fast compared to using a Lustre 2.5 client.  We don&apos;t have quantitative info yet, but if we are able to run some tests and gather numbers, I will pass them along to you. &lt;/p&gt;</comment>
                            <comment id="99315" author="niu" created="Mon, 17 Nov 2014 02:41:47 +0000"  >&lt;blockquote&gt;
&lt;p&gt;If I want to know how many client locks are cached on the MDS server when it OOMs, is that info in the /proc/fs/lustre/ldlm/namespaces/mdt-&amp;lt;fsname&amp;gt;-MDT0000_UUID/lock_count file? Since it isn&apos;t really feasible to query the lock counts on each client individually, I wanted to verify that I could get the same info from the server-side. (Although even that might not work if the MDS server becomes unresponsive when it OOMs.)&lt;/p&gt;&lt;/blockquote&gt;
&lt;p&gt;Probably you can write a script to read the proc file from all clients remotely? As you mentioned, MDS will be unresponsive when OOM, reading proc file on MDS isn&apos;t practical.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;I also wanted to let you know that we have started to run some purges from Lustre 2.4.3 clients, and it looks like maybe the server memory usage doesn&apos;t grow as fast compared to using a Lustre 2.5 client. We don&apos;t have quantitative info yet, but if we are able to run some tests and gather numbers, I will pass them along to you.&lt;/p&gt;&lt;/blockquote&gt;
&lt;p&gt;It could because of the &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-5727&quot; title=&quot;MDS OOMs with 2.5.3 clients and lru_size != 0&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-5727&quot;&gt;&lt;del&gt;LU-5727&lt;/del&gt;&lt;/a&gt;?&lt;/p&gt;</comment>
                            <comment id="100049" author="rmohr" created="Tue, 25 Nov 2014 16:52:37 +0000"  >&lt;p&gt;Niu,&lt;/p&gt;

&lt;p&gt;We had another MDS crash last Friday while running purges.  The vm.min_free_kbytes value had been increased to 1.2 GB from 128 MB, but this did not prevent the crash (and most likely caused it to crash slightly sooner because it was trying to keep more memory in reserve).  As a result, we have reverted to using our previous value for that parameter.&lt;/p&gt;

&lt;p&gt;Prior to the crash, I had run some tests to see if there was a difference in the rate of mem usage between clients running Lustre 2.5.3 and 2.4.3.  Based on my tests of deleting ~1M files, it looked like the 2.5.3 client caused the buffer usage to grow by about 2.7 GB and the 2.4.3 client caused usage to grow by about 2 GB.  (It should be noted that these tests were run on a file system that was in production. This might skew the absolute value of the numbers, but I think the relative difference is still reasonably accurate.)  Based on this info, we started using Lustre 2.4.3 clients to run our file system purges.  Since these clients also do not suffer from &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-5727&quot; title=&quot;MDS OOMs with 2.5.3 clients and lru_size != 0&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-5727&quot;&gt;&lt;del&gt;LU-5727&lt;/del&gt;&lt;/a&gt;, we were able to apply lower limits on the number of cached locks (2000 per client) and verified that the clients were honoring the new limits.  Nevertheless, buffer usage on the MDS continued to grow while the purges were happening, and this resulted in the latest MDS crash.&lt;/p&gt;

&lt;p&gt;The MDS was rebooted after the crash 4 days ago.  Since then we have not run any more purges.  When I checked the MDS this morning, the memory usage looked like this:&lt;/p&gt;

&lt;p&gt;MemTotal:       66053640 kB&lt;br/&gt;
MemFree:        41155360 kB&lt;br/&gt;
Buffers:        14507900 kB&lt;br/&gt;
Cached:           402160 kB&lt;br/&gt;
SwapCached:            0 kB&lt;br/&gt;
Active:          5331216 kB&lt;br/&gt;
Inactive:        9637736 kB&lt;br/&gt;
Active(anon):     149548 kB&lt;br/&gt;
Inactive(anon):    42128 kB&lt;br/&gt;
Active(file):    5181668 kB&lt;br/&gt;
Inactive(file):  9595608 kB&lt;/p&gt;

&lt;p&gt;After running &quot;echo 1 &amp;gt; /proc/sys/vm/drop_caches&quot;, the memory stats were like this:&lt;/p&gt;

&lt;p&gt;MemTotal:       66053640 kB&lt;br/&gt;
MemFree:        56045260 kB&lt;br/&gt;
Buffers:          249556 kB&lt;br/&gt;
Cached:           144876 kB&lt;br/&gt;
SwapCached:            0 kB&lt;br/&gt;
Active:            66128 kB&lt;br/&gt;
Inactive:         387976 kB&lt;br/&gt;
Active(anon):      59276 kB&lt;br/&gt;
Inactive(anon):   132400 kB&lt;br/&gt;
Active(file):       6852 kB&lt;br/&gt;
Inactive(file):   255576 kB&lt;/p&gt;

&lt;p&gt;So during normal usage, the buffer usage grew to about 14.5 GB (as would be expected) but dropping caches easily reclaimed the memory and buffer usage dropped to 0.25 GB.  As far as I can tell, it is only when we delete large numbers of files that we get into situations where the buffer usage will not drop.&lt;/p&gt;

&lt;p&gt;This pretty much guarantees that the MDS will crash whenever we run purges.  For the moment, we are closely monitoring mem usage during purges, and if it grows too much, we will preemptively unmount/remount the MDT to free up memory.  This makes it a little easier for clients to recover, but we always run the risk that some client(s) will not handle the MDT disappearance well and end up getting evicted (which can cause IO errors for end users).  This is definitely a regression from Lustre 1.8.9 where we could go months without any sort of Lustre failure.&lt;/p&gt;</comment>
                            <comment id="100051" author="rmohr" created="Tue, 25 Nov 2014 17:00:58 +0000"  >&lt;p&gt;Is this issue related to &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4740&quot; title=&quot;MDS - buffer cache not freed&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4740&quot;&gt;&lt;del&gt;LU-4740&lt;/del&gt;&lt;/a&gt;?  I came across that bug report when I was first researching my problem.  The failure looked almost identical.  It looked like someone came up with a potential patch at one point, but then it was suggested that the patch was not ready for production use.  Has there been any progress made on a new version of that patch?&lt;/p&gt;</comment>
                            <comment id="100108" author="niu" created="Wed, 26 Nov 2014 03:42:58 +0000"  >&lt;p&gt;Thank you for the information, Rick. I was investigating on the problem of &quot;growed Buffers&quot;, but had no luck so far. I&apos;ll keep investigating further.&lt;/p&gt;

&lt;p&gt;As for &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4740&quot; title=&quot;MDS - buffer cache not freed&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4740&quot;&gt;&lt;del&gt;LU-4740&lt;/del&gt;&lt;/a&gt;, it looks like same problem, and the patch mentioned in &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4740&quot; title=&quot;MDS - buffer cache not freed&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4740&quot;&gt;&lt;del&gt;LU-4740&lt;/del&gt;&lt;/a&gt; is actually aiming for &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4053&quot; title=&quot;client leaking objects/locks during IO&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4053&quot;&gt;&lt;del&gt;LU-4053&lt;/del&gt;&lt;/a&gt; (which is introduced since 2.5, but no present in 2.4).&lt;/p&gt;</comment>
                            <comment id="101823" author="rmohr" created="Wed, 17 Dec 2014 16:50:05 +0000"  >&lt;p&gt;Is there any update on this issue?&lt;/p&gt;</comment>
                            <comment id="101824" author="haisong" created="Wed, 17 Dec 2014 17:04:46 +0000"  >&lt;p&gt;SDSC too, is desperately waiting on a fix of this problem. We have had several filesystem down-times caused by this problem recently.&lt;/p&gt;

&lt;p&gt;thank you,&lt;br/&gt;
Haisong&lt;/p&gt;</comment>
                            <comment id="101938" author="niu" created="Thu, 18 Dec 2014 13:35:48 +0000"  >&lt;blockquote&gt;
&lt;p&gt;Is there any update on this issue?&lt;/p&gt;&lt;/blockquote&gt;
&lt;p&gt;Rick, I&apos;m still working on this, but didn&apos;t make a dent so far. As a temporary workaround, could you modify your purge program to deleting files step by step instead of deleting all the files in one shot? You mentioned earlier in this ticket that deleting small amount of file won&apos;t increase the buffers footprint, I&apos;m wondering if it&apos;s helpful for your situation.&lt;/p&gt;</comment>
                            <comment id="102007" author="haisong" created="Thu, 18 Dec 2014 22:00:20 +0000"  >&lt;p&gt;Would like to add 2 comments:&lt;/p&gt;

&lt;p&gt;1) filesystem deletion often happens in user land. If they choose to delete 1million files with one &quot;rm&quot; command, there is little sysadm can control over. This is true at lease in our environment&lt;br/&gt;
2) my experience tells me that, slow-deletion only slow down dying process.&lt;/p&gt;

&lt;p&gt;thanks,&lt;br/&gt;
Haisong&lt;/p&gt;</comment>
                            <comment id="102026" author="rmohr" created="Fri, 19 Dec 2014 02:43:20 +0000"  >&lt;p&gt;I think Haisong&apos;s comments are spot on.  This problem was originally brought to our attention because a user triggered the OOM by deleting files.  Slowing down the deletions only postpones the inevitable, and we still are forced to carefully monitor the system and eventually do a controlled MDS reboot.  Plus, when we have a user that pushes the file system&apos;s inode usage up to 80%, we need to delete files faster, not slower.&lt;/p&gt;

&lt;p&gt;--Rick&lt;/p&gt;</comment>
                            <comment id="102718" author="haisong" created="Wed, 7 Jan 2015 04:36:14 +0000"  >&lt;p&gt;Over the last 2 weeks, we have had 3 MDS crashes because of this bug - the pattern is easily distinguishable and problem is  reproduceable. We request Intel engineers to please provide a fix for the bug.&lt;/p&gt;

&lt;p&gt;thanks,&lt;br/&gt;
Haisong   &lt;/p&gt;</comment>
                            <comment id="102817" author="mdiep" created="Wed, 7 Jan 2015 22:08:23 +0000"  >&lt;p&gt;Niu, have you been able to reproduce this problem. according to SDSC, this can be very easy to reproduce. All you do is create and remove over 1 million files. Please let me know if you need any help.&lt;/p&gt;</comment>
                            <comment id="102837" author="niu" created="Thu, 8 Jan 2015 03:08:01 +0000"  >&lt;blockquote&gt;
&lt;p&gt;Niu, have you been able to reproduce this problem. according to SDSC, this can be very easy to reproduce. All you do is create and remove over 1 million files. Please let me know if you need any help.&lt;/p&gt;&lt;/blockquote&gt;
&lt;p&gt;I didn&apos;t reproduce the OOM problem, I reproduced the problem of &quot;Buffers&quot; not shrinking after &quot;rm&quot; with less files, but unfortunately I didn&apos;t find out the reason yet. I&apos;m now wondering if that is the real cause of OOM? If you could help me to book some real machines to verify this (rm 1 million files to trigger OOM) it&apos;ll be helpful. (all my tests were done in my local vm). Thank you.&lt;/p&gt;</comment>
                            <comment id="102838" author="mdiep" created="Thu, 8 Jan 2015 04:54:47 +0000"  >&lt;p&gt;I could be wrong but if the buffer is not freed after &apos;rm&apos;, eventually, it will cause the system to OOM. so I think OOM is the result from &apos;buffer not freed&apos;&lt;/p&gt;</comment>
                            <comment id="102855" author="niu" created="Thu, 8 Jan 2015 11:58:14 +0000"  >&lt;blockquote&gt;
&lt;p&gt;I could be wrong but if the buffer is not freed after &apos;rm&apos;, eventually, it will cause the system to OOM. so I think OOM is the result from &apos;buffer not freed&apos;&lt;/p&gt;&lt;/blockquote&gt;
&lt;p&gt;Well, I&apos;m not 100 percent sure about this. As far as I can see (in my local vm testing), those &quot;Buffers&quot; should be the buffer_heads against the MDT block device, and I&apos;m not sure if it could be reclaimed for other purpose when the buffer size reaches some point.&lt;/p&gt;</comment>
                            <comment id="102877" author="haisong" created="Thu, 8 Jan 2015 15:45:16 +0000"  >&lt;p&gt;Hi Niu,&lt;/p&gt;

&lt;p&gt;I also think that OOM is a byproduct of buffer memory not free up. On our MDS servers, we have reserved 10% of total physical RAM, set aggressive flashing dirty cache policy (ie   vm.dirty_background_ratio=5  &amp;amp; vm.dirty_ratio=5), but none of these measures helped. We are watching memory buffer increasing to the point where MDS threads start to hang, kernel trace dumping, clients disconnecting, eventually MDS itself would become unresponsive to it&apos;s own shell commands. Sometimes it ends with kernel panic and sometimes it ends with completely lock up of the machine. Most of the cases we don&apos;t see OOM.&lt;/p&gt;

&lt;p&gt;Haisong &lt;/p&gt;</comment>
                            <comment id="102922" author="rmohr" created="Thu, 8 Jan 2015 21:18:41 +0000"  >&lt;p&gt;Niu,&lt;/p&gt;

&lt;p&gt;A couple of things I wanted to clarify:&lt;/p&gt;

&lt;p&gt;1) Removing 1M files is not necessarily enough to cause an OOM.  I just used that many files in my test because it was a large enough so that there was a noticeable change in the MDS memory usage, and I could verify that the memory never seemed to be reclaimed.  In production, the MDS memory usage seems to grow continuously over time until it eventually reaches a point where it causes problems.&lt;/p&gt;

&lt;p&gt;2) When I initially created this ticket, we had observed an OOM event on the MDS.  However, in subsequent events, this has not always been the case.  Several times, the situation was more like Haisong described:  The MDS server &quot;loses&quot; memory until it reaches a point where Lustre threads slow to a crawl and eventually the system becomes completely unresponsive.&lt;/p&gt;

&lt;p&gt;So I don&apos;t think reproducing the OOM is necessarily needed to investigate the issue.&lt;/p&gt;

&lt;p&gt;--Rick&lt;/p&gt;</comment>
                            <comment id="102990" author="niu" created="Fri, 9 Jan 2015 12:59:06 +0000"  >&lt;p&gt;Thank you for the information, Haisong &amp;amp; Rick.&lt;/p&gt;</comment>
                            <comment id="103479" author="mdiep" created="Wed, 14 Jan 2015 16:16:39 +0000"  >&lt;p&gt;Niu, do you have any update on this?&lt;/p&gt;</comment>
                            <comment id="103680" author="minyard" created="Thu, 15 Jan 2015 22:00:08 +0000"  >&lt;p&gt;We&apos;ve been watching this ticket at TACC as we&apos;ve noticed similar behavior with the Lustre 2.5.2 version MDS for our /scratch filesystem where we have to perform occasional purges.  We also have had it crash with what looks like an OOM condition, especially after we&apos;ve run a purge removing millions of files.   I mentioned it to Peter Jones during a call yesterday and he may have relayed some additional details.   We took the opportunity during our maintenance on Tuesday to try a few things and have some additional information that might be helpful to track down this issue.  From what we found, it appears that something in the kernel is not allowing the Inactive(file) portion of the memory to get released and used when needed, which is what the kernel should do.  Before we did anything to the MDS during the maintenance, we looked at the memory and had 95GB in Buffers (according to /proc/meminfo, 128GB total memory in the MDS box) but also had 94GB of memory in the Inactive(file) portion of the memory.   To see if it could release this buffer cache, we issued the vm.drop_caches=3 and while it released some cached file memory, it did not release the buffer memory like it usually does.   We then unmounted the MDT and removed the Lustre module files and then the Buffers portion of the memory dropped to a very low value, but there was still 94GB of memory in the Inactive(file).   We then tried to run some programs that would use the memory, however, none of them could ever get back any of the 94GB used by the Inactive(file) portion of memory.  The only way we found to recover this portion of memory was to reboot the server.   So even though the usage is shown in buffers, it seems that the Inactive(file) memory is the portion it that the kernel can&apos;t seem to recover after many files have been removed.   Not sure if you have noticed the same behavior, but we thought this might help in tracking down this issue.   &lt;/p&gt;

&lt;p&gt;We&apos;re running some tests on another testbed filesystem so if there is some additional information you would like to have, let us know.  We definitely need to get this resolved as it is requiring us to reboot the MDS after every purge to prevent it from running out of memory.&lt;/p&gt;</comment>
                            <comment id="103856" author="niu" created="Mon, 19 Jan 2015 15:59:54 +0000"  >&lt;p&gt;After quite a lot of testing &amp;amp; debugging with Lai, we found that a brelse() is missed in ldiskfs large EA patch, I&apos;ll post patch soon.&lt;/p&gt;</comment>
                            <comment id="103857" author="gerrit" created="Mon, 19 Jan 2015 16:03:58 +0000"  >&lt;p&gt;Niu Yawei (yawei.niu@intel.com) uploaded a new patch: &lt;a href=&quot;http://review.whamcloud.com/13452&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/13452&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-5726&quot; title=&quot;MDS buffer not freed when deleting files&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-5726&quot;&gt;&lt;del&gt;LU-5726&lt;/del&gt;&lt;/a&gt; ldiskfs: missed brelse() in large EA patch&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: 1eb46ffbec85016db1054594094abde6d09a3616&lt;/p&gt;</comment>
                            <comment id="103858" author="niu" created="Mon, 19 Jan 2015 16:06:43 +0000"  >&lt;p&gt;patch to master: &lt;a href=&quot;http://review.whamcloud.com/13452&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/13452&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="103955" author="adilger" created="Tue, 20 Jan 2015 00:36:49 +0000"  >&lt;p&gt;Niu, Lai, excellent work finding and fixing this bug.&lt;/p&gt;

&lt;p&gt;A question for the users hitting this problem - is the &lt;tt&gt;ea_inode&lt;/tt&gt; (also named &lt;tt&gt;large_xattr&lt;/tt&gt;) feature enabled on the MDT filesystem?  Running &lt;tt&gt;dumpe2fs -h /dev/{mdtdev} | grep features&lt;/tt&gt; on the MDT device would list &lt;tt&gt;ea_inode&lt;/tt&gt; in the &lt;tt&gt;Filesystem features:&lt;/tt&gt; output.  This feature is needed if there are more than 160 OSTs in the filesystem, or if many and/or large xattrs are being stored (e.g. lots of ACLs, user xattrs, etc).&lt;/p&gt;

&lt;p&gt;While I hope that is the case and we can close this bug, if the &lt;tt&gt;ea_inode&lt;/tt&gt; feature is not enabled on your MDT, then this patch is unlikely to solve your problem.&lt;/p&gt;</comment>
                            <comment id="103965" author="haisong" created="Tue, 20 Jan 2015 02:29:50 +0000"  >
&lt;p&gt;We are running 2.4.3 and 2.5.3 default MDT settings,  so ea_inode is not enable (Here is output from one of our MDT):&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;root@puma-mds-10-5 ~&amp;#93;&lt;/span&gt;# dumpe2fs -h /dev/md0 | grep features&lt;br/&gt;
dumpe2fs 1.42.7.wc1 (12-Apr-2013)&lt;br/&gt;
Filesystem features:      has_journal ext_attr resize_inode dir_index filetype needs_recovery mmp flex_bg sparse_super large_file huge_file uninit_bg dir_nlink quota  &lt;br/&gt;
Journal features:         journal_incompat_revoke&lt;/p&gt;

&lt;p&gt;In addition, all our filesystems hit this bug have less than 160 OSTs.&lt;/p&gt;

&lt;p&gt;Haisong&lt;/p&gt;
</comment>
                            <comment id="103969" author="niu" created="Tue, 20 Jan 2015 03:17:04 +0000"  >&lt;p&gt;Andreas, ea_inode/large_xattr isn&apos;t enabled in my testing, but I also observed the &quot;growing buffers&quot; problem, I think this bug will be triggered as long as the inode has ea_in_inode.&lt;/p&gt;

&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;&lt;span class=&quot;code-object&quot;&gt;int&lt;/span&gt;
ldiskfs_xattr_delete_inode(handle_t *handle, struct inode *inode,
                        struct ldiskfs_xattr_ino_array **lea_ino_array)
{
        struct buffer_head *bh = NULL;
        struct ldiskfs_xattr_ibody_header *header;
        struct ldiskfs_inode *raw_inode;
        struct ldiskfs_iloc iloc;
        struct ldiskfs_xattr_entry *entry;
        &lt;span class=&quot;code-object&quot;&gt;int&lt;/span&gt; error = 0;

        &lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt; (!ldiskfs_test_inode_state(inode, LDISKFS_STATE_XATTR))
                &lt;span class=&quot;code-keyword&quot;&gt;goto&lt;/span&gt; delete_external_ea;

        error = ldiskfs_get_inode_loc(inode, &amp;amp;iloc);
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;As long as the LDISKFS_STATE_XATTR is set on inode, it&apos;ll get the bh.&lt;/p&gt;</comment>
                            <comment id="103975" author="gerrit" created="Tue, 20 Jan 2015 09:54:29 +0000"  >&lt;p&gt;Niu Yawei (yawei.niu@intel.com) uploaded a new patch: &lt;a href=&quot;http://review.whamcloud.com/13464&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/13464&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-5726&quot; title=&quot;MDS buffer not freed when deleting files&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-5726&quot;&gt;&lt;del&gt;LU-5726&lt;/del&gt;&lt;/a&gt; ldiskfs: missed brelse() in large EA patch&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: b2_5&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: 516a0cf6020fa169b0890ba6a51dc8295c1a44cd&lt;/p&gt;</comment>
                            <comment id="103976" author="niu" created="Tue, 20 Jan 2015 09:55:15 +0000"  >&lt;p&gt;Port to b2_5: &lt;a href=&quot;http://review.whamcloud.com/13464&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/13464&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="104352" author="gerrit" created="Thu, 22 Jan 2015 17:53:22 +0000"  >&lt;p&gt;Oleg Drokin (oleg.drokin@intel.com) merged in patch &lt;a href=&quot;http://review.whamcloud.com/13452/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/13452/&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-5726&quot; title=&quot;MDS buffer not freed when deleting files&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-5726&quot;&gt;&lt;del&gt;LU-5726&lt;/del&gt;&lt;/a&gt; ldiskfs: missed brelse() in large EA patch&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: &lt;br/&gt;
Commit: ffd42ff529f5823b5a04529e1db2ea3b32a9f59f&lt;/p&gt;</comment>
                            <comment id="104382" author="rmohr" created="Thu, 22 Jan 2015 20:08:22 +0000"  >&lt;p&gt;In response to Andreas&apos; question:&lt;/p&gt;

&lt;p&gt;dumpe2fs 1.42.12.wc1 (15-Sep-2014)&lt;br/&gt;
Filesystem features:      has_journal ext_attr resize_inode dir_index filetype needs_recovery extent mmp flex_bg dirdata sparse_super large_file huge_file uninit_bg dir_nlink extra_isize quota&lt;br/&gt;
Journal features:         journal_incompat_revoke&lt;/p&gt;

&lt;p&gt;Our file system has 90 OSTs.&lt;/p&gt;</comment>
                            <comment id="104460" author="niu" created="Fri, 23 Jan 2015 02:55:23 +0000"  >&lt;p&gt;Rick, could you verify that if the patch can fix your problem? It works for me, after applied the patch, I didn&apos;t see the &quot;growing buffers&quot; problem anymore.&lt;/p&gt;</comment>
                            <comment id="104574" author="rmohr" created="Fri, 23 Jan 2015 21:27:26 +0000"  >&lt;p&gt;Niu,&lt;/p&gt;

&lt;p&gt;Our testbed is currently down, but we are trying to get it back up and running again.  Once that is done, we will work on applying your patch and testing it.  (Although this might not happen for another week.)&lt;/p&gt;</comment>
                            <comment id="104734" author="gerrit" created="Mon, 26 Jan 2015 19:24:22 +0000"  >&lt;p&gt;Oleg Drokin (oleg.drokin@intel.com) merged in patch &lt;a href=&quot;http://review.whamcloud.com/13464/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/13464/&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-5726&quot; title=&quot;MDS buffer not freed when deleting files&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-5726&quot;&gt;&lt;del&gt;LU-5726&lt;/del&gt;&lt;/a&gt; ldiskfs: missed brelse() in large EA patch&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: b2_5&lt;br/&gt;
Current Patch Set: &lt;br/&gt;
Commit: 8f7181ee5553aa22ecfe51202f3db1a4162361e7&lt;/p&gt;</comment>
                            <comment id="105280" author="rmohr" created="Fri, 30 Jan 2015 23:04:52 +0000"  >&lt;p&gt;Niu,&lt;/p&gt;

&lt;p&gt;We were able to get a test file system setup and applied the patch.  Initial testing seems to show that the problem is fixed.  We have applied the patch to Lustre 2.4.3 and plan to roll it out to our production file system next week.  After that, we will run some further tests and let you know if there are any problems.&lt;/p&gt;</comment>
                            <comment id="105282" author="pjones" created="Fri, 30 Jan 2015 23:21:56 +0000"  >&lt;p&gt;That&apos;s great news - thanks Rick!&lt;/p&gt;</comment>
                            <comment id="105747" author="rmohr" created="Wed, 4 Feb 2015 22:43:16 +0000"  >&lt;p&gt;Niu,&lt;/p&gt;

&lt;p&gt;We just applied the patch today to our production file system (Lustre 2.4.3) and are running some heavy purges right now.  I collected some info about the memory usage. Prior to the patch, it seemed like the memory growth was dominated by the &quot;Inactive(file)&quot; in /proc/meminfo.  I dropped the cache in the MDS server (echo 3 &amp;gt; /proc/sys/vm/drop_caches) and collected Inactive(file) usage every minute:&lt;/p&gt;

&lt;p&gt;nactive(file):  1146656 kB&lt;br/&gt;
Inactive(file):  3426128 kB&lt;br/&gt;
Inactive(file):  5510484 kB&lt;br/&gt;
Inactive(file):  6634728 kB&lt;br/&gt;
Inactive(file):  7514500 kB&lt;br/&gt;
Inactive(file):  8075948 kB&lt;br/&gt;
Inactive(file):  8662528 kB&lt;br/&gt;
Inactive(file):  9210796 kB&lt;br/&gt;
Inactive(file):  9576412 kB&lt;br/&gt;
Inactive(file):  9974336 kB&lt;br/&gt;
Inactive(file): 10400772 kB&lt;br/&gt;
Inactive(file): 10710464 kB&lt;br/&gt;
Inactive(file): 10964180 kB&lt;br/&gt;
Inactive(file): 11280900 kB&lt;br/&gt;
Inactive(file): 11591336 kB&lt;br/&gt;
Inactive(file): 11731164 kB&lt;br/&gt;
Inactive(file): 11817340 kB&lt;br/&gt;
Inactive(file): 11920016 kB&lt;br/&gt;
Inactive(file): 12040800 kB&lt;br/&gt;
Inactive(file): 12196232 kB&lt;br/&gt;
Inactive(file): 12148272 kB&lt;br/&gt;
Inactive(file): 12269224 kB&lt;br/&gt;
Inactive(file): 12251768 kB&lt;br/&gt;
Inactive(file): 12263596 kB&lt;/p&gt;

&lt;p&gt;The number initially ramped up fast, but then leveled off a bit.  Just to double check, I dropped the cache again:&lt;/p&gt;

&lt;p&gt;Inactive(file):   401152 kB&lt;br/&gt;
Inactive(file):  2724788 kB&lt;br/&gt;
Inactive(file):  4409916 kB&lt;br/&gt;
Inactive(file):  6003208 kB&lt;br/&gt;
Inactive(file):  6532220 kB&lt;br/&gt;
Inactive(file):  7319768 kB&lt;br/&gt;
Inactive(file):  8154560 kB&lt;br/&gt;
Inactive(file):  8769084 kB&lt;br/&gt;
Inactive(file):  9271760 kB&lt;br/&gt;
Inactive(file):  9650020 kB&lt;br/&gt;
Inactive(file):  9918932 kB&lt;br/&gt;
Inactive(file): 10170456 kB&lt;br/&gt;
Inactive(file): 10303404 kB&lt;br/&gt;
Inactive(file): 10602256 kB&lt;br/&gt;
Inactive(file): 10972760 kB&lt;br/&gt;
Inactive(file): 11509680 kB&lt;br/&gt;
Inactive(file): 11986980 kB&lt;br/&gt;
Inactive(file): 12436528 kB&lt;br/&gt;
Inactive(file): 12770672 kB&lt;br/&gt;
Inactive(file): 13195352 kB&lt;br/&gt;
Inactive(file): 13463276 kB&lt;br/&gt;
Inactive(file): 13807816 kB&lt;br/&gt;
Inactive(file): 14029160 kB&lt;br/&gt;
Inactive(file): 14749976 kB&lt;br/&gt;
Inactive(file): 14879704 kB&lt;br/&gt;
Inactive(file): 14908984 kB&lt;br/&gt;
Inactive(file): 14988196 kB&lt;br/&gt;
Inactive(file): 15123316 kB&lt;br/&gt;
Inactive(file): 15240824 kB&lt;br/&gt;
Inactive(file): 15341328 kB&lt;br/&gt;
Inactive(file): 15464332 kB&lt;/p&gt;

&lt;p&gt;We got the same behavior, and more importantly, we seem to be reclaiming the memory from Inactive(file).  I also checked MemFree and Buffers before/after dropping caches:&lt;/p&gt;

&lt;p&gt;(Before)&lt;br/&gt;
MemTotal:       66053640 kB&lt;br/&gt;
MemFree:        51291028 kB&lt;br/&gt;
Buffers:        10685976 kB&lt;/p&gt;

&lt;p&gt;(After)&lt;br/&gt;
MemTotal:       66053640 kB&lt;br/&gt;
MemFree:        63239432 kB&lt;br/&gt;
Buffers:          198148 kB&lt;/p&gt;

&lt;p&gt;Buffer usage dropped below 200 MB.  Given the rate at which we are purging, that never would have happened prior to applying the patch.&lt;/p&gt;

&lt;p&gt;It feel 90% confident this patch solved the problem.  If we can continue purging at this rate over the couple of days without increased memory usage, then I think I will be 100% confident.&lt;/p&gt;</comment>
                            <comment id="105755" author="gerrit" created="Wed, 4 Feb 2015 23:17:59 +0000"  >&lt;p&gt;Minh Diep (minh.diep@intel.com) uploaded a new patch: &lt;a href=&quot;http://review.whamcloud.com/13655&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/13655&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-5726&quot; title=&quot;MDS buffer not freed when deleting files&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-5726&quot;&gt;&lt;del&gt;LU-5726&lt;/del&gt;&lt;/a&gt; ldiskfs: missed brelse() in large EA patch&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: b2_4&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: 8e29ef136f426fef66b6008d379afc5e5ddc4ab5&lt;/p&gt;</comment>
                            <comment id="105851" author="rmohr" created="Thu, 5 Feb 2015 16:21:18 +0000"  >&lt;p&gt;Niu,&lt;/p&gt;

&lt;p&gt;I checked the MDS mem usage again this morning:&lt;/p&gt;

&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;MemTotal:       66053640 kB
MemFree:         5568288 kB
Buffers:        55504980 kB
Active:         22374284 kB
Inactive:       33260116 kB
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;After I dropped caches:&lt;/p&gt;

&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;MemTotal:       66053640 kB
MemFree:        63146420 kB
Buffers:           59788 kB
Active:            57960 kB
Inactive:          93452 kB
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Looks like the patch is successful.&lt;/p&gt;</comment>
                            <comment id="105904" author="pjones" created="Thu, 5 Feb 2015 18:54:47 +0000"  >&lt;p&gt;Great news. Landed for 2.5.4 and 2.7&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                                                <inwardlinks description="is related to">
                                        <issuelink>
            <issuekey id="25546">LU-5333</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="27409">LU-5841</issuekey>
        </issuelink>
                            </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                            <attachment id="15935" name="lustre-debug-malloc.gz" size="239" author="rmohr" created="Sat, 11 Oct 2014 07:13:39 +0000"/>
                            <attachment id="16276" name="mds-crash-log-20140913" size="48262" author="rmohr" created="Wed, 29 Oct 2014 18:50:42 +0000"/>
                            <attachment id="15934" name="meminfo.after" size="1241" author="rmohr" created="Sat, 11 Oct 2014 07:10:55 +0000"/>
                            <attachment id="15933" name="meminfo.before" size="1241" author="rmohr" created="Sat, 11 Oct 2014 07:10:55 +0000"/>
                            <attachment id="15932" name="slabinfo.after" size="26338" author="rmohr" created="Sat, 11 Oct 2014 07:10:55 +0000"/>
                            <attachment id="15931" name="slabinfo.before" size="26339" author="rmohr" created="Sat, 11 Oct 2014 07:10:55 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzwybj:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>16083</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>