<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:44:34 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-4641] ldiskfs_inode_cache slab high usage</title>
                <link>https://jira.whamcloud.com/browse/LU-4641</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;We are seeing high usage from the ldiskfs_inode_cache slab. This filesystem has ~430M files in it, and currently we are utilizing ~90GB of ldiskfs_inode_cache.&lt;/p&gt;

&lt;p&gt;We are currently undergoing testing to create 500M files into a test filesystem, wanted to break this out from &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4570&quot; title=&quot;Metadata slowdowns on production filesystem at ORNL&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4570&quot;&gt;&lt;del&gt;LU-4570&lt;/del&gt;&lt;/a&gt;. More data to come.&lt;/p&gt;</description>
                <environment>RHEL 6.4, kernel  2.6.32_358.23.2.el6, including patch 9127 from &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4579&quot; title=&quot;Timeout system horribly broken&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4579&quot;&gt;&lt;strike&gt;LU-4579&lt;/strike&gt;&lt;/a&gt;, and &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4006&quot; title=&quot;LNET Messages staying in Queue&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4006&quot;&gt;&lt;strike&gt;LU-4006&lt;/strike&gt;&lt;/a&gt;.</environment>
        <key id="23181">LU-4641</key>
            <summary>ldiskfs_inode_cache slab high usage</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="4" iconUrl="https://jira.whamcloud.com/images/icons/priorities/minor.svg">Minor</priority>
                        <status id="6" iconUrl="https://jira.whamcloud.com/images/icons/statuses/closed.png" description="The issue is considered finished, the resolution is correct. Issues which are closed can be reopened.">Closed</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="6">Not a Bug</resolution>
                                        <assignee username="niu">Niu Yawei</assignee>
                                    <reporter username="hilljjornl">Jason Hill</reporter>
                        <labels>
                    </labels>
                <created>Mon, 17 Feb 2014 18:31:48 +0000</created>
                <updated>Thu, 3 Jul 2014 20:33:01 +0000</updated>
                            <resolved>Thu, 3 Jul 2014 20:33:01 +0000</resolved>
                                    <version>Lustre 2.4.2</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>10</watches>
                                                                            <comments>
                            <comment id="77200" author="pjones" created="Mon, 17 Feb 2014 19:06:33 +0000"  >&lt;p&gt;Niu&lt;/p&gt;

&lt;p&gt;Could you please advise on this ticket as the data comes in?&lt;/p&gt;

&lt;p&gt;Thanks&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="77201" author="hilljjornl" created="Mon, 17 Feb 2014 19:23:02 +0000"  >&lt;p&gt;Update:&lt;/p&gt;

&lt;p&gt;echo 2 &amp;gt; /proc/sys/vm/drop_caches has helped immensely. Slab usage down to ~20GB, not growing highly.&lt;/p&gt;

&lt;p&gt;On test system where we are trying to create large number of files we see slab size growing at 1MB/s. Currently have 8M files created, target is 500M. Watching via collectl for slab usage on the production MDS.&lt;/p&gt;

&lt;p&gt;Trying to determine if something within the center is running mlocate via cron and causing a full filesystem walk to occur. After the MDS was rebooted on Friday the cache grew and grew until we started having issues allocating memory.&lt;/p&gt;

&lt;p&gt;Stay tuned. Thanks.&lt;/p&gt;</comment>
                            <comment id="77205" author="hilljjornl" created="Mon, 17 Feb 2014 20:27:03 +0000"  >&lt;p&gt;We think we have the culprit. Found 4 Cray nodes that had updatedb enabled and the last update to the db was 2/16. I will ask to resolve this once we&apos;ve verified this keeps the inode_cache slab usage down.&lt;/p&gt;</comment>
                            <comment id="77209" author="ezell" created="Mon, 17 Feb 2014 23:52:17 +0000"  >&lt;p&gt;Do you have any suggestions to prevent this from happening in the future?  Obviously, we want to keep updatedb from running against Lustre mounts, but will the MDS eventually cache enough inodes from normal usage?&lt;/p&gt;

&lt;p&gt;Should we set vm.zone_reclaim_mode or vm.min_free_kbytes ?  To my knowledge (Jason, please confirm), we leave these at the default.&lt;/p&gt;</comment>
                            <comment id="77210" author="niu" created="Tue, 18 Feb 2014 00:55:40 +0000"  >&lt;p&gt;What&apos;s the value of vfs_cache_pressure? I think you&apos;d tune it to a higher value to keep kernel reclaiming inode cache harder. Following is copied from kernel doc:&lt;/p&gt;

&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;vfs_cache_pressure
------------------

Controls the tendency of the kernel to reclaim the memory which is used &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt;
caching of directory and inode objects.

At the &lt;span class=&quot;code-keyword&quot;&gt;default&lt;/span&gt; value of vfs_cache_pressure=100 the kernel will attempt to
reclaim dentries and inodes at a &lt;span class=&quot;code-quote&quot;&gt;&quot;fair&quot;&lt;/span&gt; rate with respect to pagecache and
swapcache reclaim.  Decreasing vfs_cache_pressure causes the kernel to prefer
to retain dentry and inode caches. When vfs_cache_pressure=0, the kernel will
never reclaim dentries and inodes due to memory pressure and &lt;span class=&quot;code-keyword&quot;&gt;this&lt;/span&gt; can easily
lead to out-of-memory conditions. Increasing vfs_cache_pressure beyond 100
causes the kernel to prefer to reclaim dentries and inodes.

&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="77211" author="hilljjornl" created="Tue, 18 Feb 2014 01:16:22 +0000"  >&lt;p&gt;Matt, correct we do not set these parameters specifically:&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;root@atlas-mds1 ~&amp;#93;&lt;/span&gt;# cat /proc/sys/vm/zone_reclaim_mode &lt;br/&gt;
0&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;root@atlas-mds1 ~&amp;#93;&lt;/span&gt;# cat /proc/sys/vm/min_free_kbytes   &lt;br/&gt;
90112&lt;/p&gt;

&lt;p&gt;Niu:&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;root@atlas-mds1 ~&amp;#93;&lt;/span&gt;# cat /proc/sys/vm/vfs_cache_pressure &lt;br/&gt;
100&lt;/p&gt;</comment>
                            <comment id="77215" author="niu" created="Tue, 18 Feb 2014 04:38:16 +0000"  >&lt;p&gt;100 is the default value, I think you&apos;d increase it to a larger value to see if it helps.&lt;/p&gt;</comment>
                            <comment id="79305" author="jfc" created="Fri, 14 Mar 2014 01:36:16 +0000"  >&lt;p&gt;Jason or Matt,&lt;br/&gt;
Any further progress/action on this issue?&lt;br/&gt;
Thanks,&lt;br/&gt;
~ jfc&lt;/p&gt;</comment>
                            <comment id="80543" author="jfc" created="Sat, 29 Mar 2014 01:16:22 +0000"  >&lt;p&gt;Looks like initial issue was resolved, and suggestions made to prevent a recurrence.&lt;br/&gt;
~ jfc.&lt;/p&gt;</comment>
                            <comment id="81293" author="jamesanunez" created="Wed, 9 Apr 2014 15:18:26 +0000"  >&lt;p&gt;I&apos;d like to get feedback from ORNL on if the larger vfs_cache_pressure solves their problem before we close this ticket. &lt;/p&gt;</comment>
                            <comment id="82994" author="jamesanunez" created="Thu, 1 May 2014 15:17:55 +0000"  >&lt;p&gt;Per ORNL, they made some configuration changes and this issue has not been seen since. Thus, they are not able to test if changes to vfs_cache_pressure help this issue. &lt;/p&gt;

&lt;p&gt;Please reopen the ticket if this problem is seen again.&lt;/p&gt;</comment>
                            <comment id="85159" author="blakecaldwell" created="Thu, 29 May 2014 17:58:04 +0000"  >&lt;p&gt;My original question when discussing this with James and Peter Jones was regarding a reasonable value for vfs_cache_pressure. However, I believe that our current issue may not be appropriate for tuning vfs_cache_pressure and more suitable for min_free_kbytes. We are still page allocation failures, but not from high ldiskfs_inode_cache usage. The catalyst to the page allocation failures is a process that reads inodes from the MDT device, so I would expect the buffer cache is being used now. Lustre processes are unable to allocate memory in these circumstances. Sample error messages on the mds:&lt;/p&gt;

&lt;p&gt;May 29 10:27:43 atlas-mds1 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;1451338.151860&amp;#93;&lt;/span&gt; LustreError: 12106:0:(lvfs_lib.c:151:lprocfs_stats_alloc_one()) LNET: out of memory at /data/buildsystem/jsimmons-atlas/rpmbuild/BUILD/lustre-2.4.3/lustre/lvfs/lvfs_lib.c:151 (tried to alloc &apos;(stats-&amp;gt;ls_percpu&lt;span class=&quot;error&quot;&gt;&amp;#91;cpuid&amp;#93;&lt;/span&gt;)&apos; = 4224)&lt;br/&gt;
May 29 10:27:43 atlas-mds1 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;1451338.187813&amp;#93;&lt;/span&gt; LustreError: 12106:0:(lvfs_lib.c:151:lprocfs_stats_alloc_one()) LNET: 1493692160 total bytes allocated by lnet&lt;br/&gt;
May 29 10:30:01 atlas-mds1 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;1451476.211959&amp;#93;&lt;/span&gt; swapper: page allocation failure. order:2, mode:0x20&lt;/p&gt;


&lt;p&gt;order:2, mode:0x20 is a GFP_ATOMIC allocation, so it can be satisfied with reserved pages, so I believe increasing vm.min_free_kbytes would be better? If I&apos;m off base here, please let me know, otherwise, we will plan on increasing vm.min_free_kbytes to 131072 from its current value 90112.&lt;/p&gt;

&lt;p&gt;Sar output showing the increase in number pages stolen from the caches.&lt;/p&gt;

&lt;p&gt;08:40:01     pgpgin/s pgpgout/s   fault/s  majflt/s  pgfree/s pgscank/s pgscand/s pgsteal/s    %vmeff&lt;/p&gt;

&lt;p&gt;07:10:01       186.65   9765.13   1416.66      0.00   2098.86      0.00      0.00      0.00      0.00&lt;br/&gt;
07:20:01       210.79  10055.52   1397.76      0.00   2861.71      0.00      0.00      0.00      0.00&lt;br/&gt;
07:30:01       131.14  10974.77   1235.35      0.00   1652.99      0.00      0.00      0.00      0.00&lt;br/&gt;
07:40:01    426522.51   6263.64   2756.01      0.00 478545.36      0.00      0.00      0.00      0.00&lt;br/&gt;
07:50:01    361767.26   8035.57  14631.19      0.00  48198.88      0.00      0.00      0.00      0.00&lt;br/&gt;
08:00:01    204183.64   9792.66   1260.39      0.00   1829.13      0.00      0.00      0.00      0.00&lt;br/&gt;
08:10:01    285855.99   9190.99    891.64      0.00   1629.84      0.00      0.00      0.00      0.00&lt;br/&gt;
08:20:01    605935.81   8329.74   2458.08      0.00  42271.84      0.00      0.00      0.00      0.00&lt;br/&gt;
08:30:01    350884.42   9249.91    882.08      0.00   1874.32      0.00      0.00      0.00      0.00&lt;br/&gt;
08:40:01    182957.52  12157.11    881.74      0.00   1645.76      0.00      0.00      0.00      0.00&lt;br/&gt;
08:50:01    116249.49  10314.71    869.47      0.00   1584.80     25.73      0.00     22.22     86.37&lt;br/&gt;
09:00:01    162919.34   9482.70    877.63      0.13  30862.31   9237.61      0.00   9176.72     99.34&lt;br/&gt;
09:10:01    192086.91   9473.00    910.78      0.00   6749.49    193.89      7.57    122.82     60.96&lt;br/&gt;
09:20:01    163713.12  10507.92    872.73      0.00  20656.07   6495.97      3.82   6459.04     99.37&lt;br/&gt;
09:30:01    143792.88  10704.30    873.22      0.00   6633.74   1056.53    200.97   1242.53     98.81&lt;br/&gt;
09:40:01    104059.86  10166.70    886.59      0.12 102152.89   6556.49    359.32   6890.73     99.64&lt;br/&gt;
09:50:02    104577.50  10842.55    886.01      0.02  14661.68   2384.46    353.41   2734.60     99.88&lt;br/&gt;
10:00:01    180071.92  10456.90    805.04      0.00  62215.50   4995.73    223.59   5219.43    100.00&lt;br/&gt;
10:10:01    136018.15  11376.03   1009.02      0.00  19493.03   7271.04    552.33   7823.36    100.00&lt;br/&gt;
10:20:01    150911.00  10693.11    872.92      0.12  15671.27   8858.82    778.53   9637.33    100.00&lt;br/&gt;
10:30:01    117841.52  13424.69    898.84      0.00   8208.61   4252.03    638.23   4890.13    100.00&lt;br/&gt;
10:40:01    143435.88  10189.68    900.52      0.01  39258.78   4963.34   1007.84   5971.28    100.00&lt;br/&gt;
10:50:01       124.18  10771.05    849.31      0.00  42637.89      0.00      0.00      0.00      0.00&lt;br/&gt;
11:00:01       432.96  12335.04    899.87      0.00    911.12      0.00      0.00      0.00      0.00&lt;/p&gt;</comment>
                            <comment id="85865" author="pjones" created="Thu, 5 Jun 2014 17:08:25 +0000"  >&lt;p&gt;Reopening to track follow on question&lt;/p&gt;</comment>
                            <comment id="85955" author="niu" created="Fri, 6 Jun 2014 01:54:12 +0000"  >&lt;p&gt;I don&apos;t have objection on this, if current problem is short of lowmem, we&apos;d consider increasing the value of min_free_kbytes and decreasing the value of lowmem_reserve_ratio.&lt;/p&gt;</comment>
                            <comment id="86464" author="blakecaldwell" created="Thu, 12 Jun 2014 20:08:38 +0000"  >&lt;p&gt;Increasing min_free_kbytes to reserve 128MB did not prevent these allocation failures. We just saw more. Next step is to decrease lowmem_reserve_ratio. Do you have a recommended value? The current value is &lt;br/&gt;
vm.lowmem_reserve_ratio = 256   256     32&lt;/p&gt;</comment>
                            <comment id="86515" author="niu" created="Fri, 13 Jun 2014 02:06:36 +0000"  >&lt;p&gt;The middle value (256) is for normal zone, if you decrease this value, kernel will defending this zone more aggressively, however, I&apos;m not sure why what value is proper.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;We are still page allocation failures, but not from high ldiskfs_inode_cache usage. The catalyst to the page allocation failures is a process that reads inodes from the MDT device, so I would expect the buffer cache is being used now.&lt;/p&gt;&lt;/blockquote&gt;
&lt;p&gt;I think reading inodes will populate ldiskfs_inode_cache, won&apos;t tuning vfs_cache_pressure help?&lt;/p&gt;</comment>
                            <comment id="86693" author="blakecaldwell" created="Mon, 16 Jun 2014 15:12:43 +0000"  >&lt;p&gt;I captured some more stats while the issue was present this morning.&lt;/p&gt;

&lt;p&gt;Jun 16 10:00:08 atlas-mds1 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;515727.290363&amp;#93;&lt;/span&gt; ptlrpcd_18: page allocation failure. order:1, mode:0x20&lt;br/&gt;
Jun 16 10:00:08 atlas-mds1 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;515727.290537&amp;#93;&lt;/span&gt; ptlrpcd_4: page allocation failure. order:1, mode:0x20&lt;br/&gt;
Jun 16 10:00:08 atlas-mds1 kernel: &lt;span class=&quot;error&quot;&gt;&amp;#91;515727.290567&amp;#93;&lt;/span&gt; ptlrpcd_12: page allocation failure. order:1, mode:0x20&lt;/p&gt;

&lt;p&gt;I&apos;m attaching the kernel logs with page stats that were dumped at the time of the failed allocations above. The normal zone looks to have plenty of pages available.  It appears DMA zones are much tighter on pages. Could these be the source of allocation failures?&lt;/p&gt;

&lt;p&gt;Since it is reading directly from block device (not through lustre), the page cache is used rather than filesystem (buffer) caches. Since the inode/dentry cache is not of concern, I don&apos;t think tuning vfs_cache_pressure will help.&lt;/p&gt;

&lt;p&gt;System wide usage:&lt;br/&gt;
MEM |  tot   252.2G |  free  672.5M |  cache 107.7G  | dirty   5.2M  | buff   50.4G  | slab   51.4G |&lt;/p&gt;

&lt;p&gt;Userspace program has RSS of 38.6G. Out of 51G of slab usage, 28G are from the size-512. Only 4G is is used by ldiskfs_inode_cache.&lt;/p&gt;

&lt;p&gt;/proc/zoneinfo currently:&lt;br/&gt;
Node 0, zone      DMA&lt;br/&gt;
  pages free     3935&lt;br/&gt;
        min      1&lt;br/&gt;
        low      1&lt;br/&gt;
        high     1&lt;br/&gt;
        protection: (0, 1931, 129191, 129191)&lt;/p&gt;

&lt;p&gt;Node 0, zone    DMA32&lt;br/&gt;
  pages free     96690&lt;br/&gt;
        min      244&lt;br/&gt;
        low      305&lt;br/&gt;
        high     366&lt;br/&gt;
        protection: (0, 0, 127260, 127260)&lt;/p&gt;

&lt;p&gt;Node 0, zone   Normal&lt;br/&gt;
  pages free     25763&lt;br/&gt;
        min      16132&lt;br/&gt;
        low      20165&lt;br/&gt;
        high     24198&lt;br/&gt;
        protection: (0, 0, 0, 0)&lt;/p&gt;

&lt;p&gt;Node 1, zone   Normal&lt;br/&gt;
  pages free     23517&lt;br/&gt;
        min      16388&lt;br/&gt;
        low      20485&lt;br/&gt;
        high     24582&lt;br/&gt;
        protection: (0, 0, 0, 0)&lt;/p&gt;
</comment>
                            <comment id="86695" author="blakecaldwell" created="Mon, 16 Jun 2014 15:14:09 +0000"  >&lt;p&gt;page allocation failure log messages&lt;/p&gt;</comment>
                            <comment id="87016" author="niu" created="Thu, 19 Jun 2014 14:12:34 +0000"  >&lt;blockquote&gt;
&lt;p&gt;Since it is reading directly from block device (not through lustre), the page cache is used rather than filesystem (buffer) caches. Since the inode/dentry cache is not of concern, I don&apos;t think tuning vfs_cache_pressure will help.&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;When reclaiming inode/dentry, the pagecache associated with the inode will be reclaimed too. Reading block device would consume lots of pagepache, I think it&apos;s worth a try that tuning the vfs_cache_pressure. (BTW: will drop_cache relieve the situation immediately?)&lt;/p&gt;</comment>
                            <comment id="88139" author="jamesanunez" created="Thu, 3 Jul 2014 20:33:01 +0000"  >&lt;p&gt;Per a conversation with ORNL, we can close this ticket. &lt;/p&gt;

&lt;p&gt;Please reopen if more work or information is needed.&lt;/p&gt;</comment>
                    </comments>
                    <attachments>
                            <attachment id="15152" name="atlas-mds1_page_allocation_failures" size="48557" author="blakecaldwell" created="Mon, 16 Jun 2014 15:14:09 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzwf8v:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>12692</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>