<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 03:23:32 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-16047] cache contention in &quot;.lustre/fid/</title>
                <link>https://jira.whamcloud.com/browse/LU-16047</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;This issue was observed with robinhood clients:&lt;/p&gt;
&lt;ul&gt;
	&lt;li&gt;robinhood become slower to sync the fs with changelog with the time&lt;/li&gt;
	&lt;li&gt;robinhood become slower if the reader is late (more negative entries generated).&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;&quot;strace&quot; on reader threads reveal that the FID stats could take several seconds.&lt;br/&gt;
drop_cache 2 or 3 fixes temporary the issue.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Reproducer&lt;/b&gt;&lt;br/&gt;
I was able to reproduce the issue with a &quot;dumb&quot; executable that generate a lot of &quot;negative entries&quot; with parallel stats on &quot;&amp;lt;fs&amp;gt;/.lustre/fid/&amp;lt;non_existent_fid&amp;gt;&quot;.&lt;br/&gt;
The  &lt;span class=&quot;nobr&quot;&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/attachment/44815/44815_perf_fid_cont.svg&quot; title=&quot;perf_fid_cont.svg attached to LU-16047&quot;&gt;perf_fid_cont.svg&lt;sup&gt;&lt;img class=&quot;rendericon&quot; src=&quot;https://jira.whamcloud.com/images/icons/link_attachment_7.gif&quot; height=&quot;7&quot; width=&quot;7&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/sup&gt;&lt;/a&gt;&lt;/span&gt; is a flamegraph on the threads of the test process (fid_rand).&lt;/p&gt;

&lt;p&gt;Most of the threads of fid_rand wait for the mutex of &quot;./lustre/fid&quot; inode in:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-c&quot;&gt;
&lt;span class=&quot;code-keyword&quot;&gt;static&lt;/span&gt; &lt;span class=&quot;code-keyword&quot;&gt;&lt;span class=&quot;code-object&quot;&gt;int&lt;/span&gt;&lt;/span&gt; lookup_slow(&lt;span class=&quot;code-keyword&quot;&gt;struct&lt;/span&gt; nameidata *nd, &lt;span class=&quot;code-keyword&quot;&gt;struct&lt;/span&gt; path *path)
{                                                              
        &lt;span class=&quot;code-keyword&quot;&gt;struct&lt;/span&gt; dentry *dentry, *parent;                        
        &lt;span class=&quot;code-keyword&quot;&gt;&lt;span class=&quot;code-object&quot;&gt;int&lt;/span&gt;&lt;/span&gt; err;                                               
                                                               
        parent = nd-&amp;gt;path.dentry;                              
        BUG_ON(nd-&amp;gt;inode != parent-&amp;gt;d_inode);                  
                                                               
        mutex_lock(&amp;amp;parent-&amp;gt;d_inode-&amp;gt;i_mutex);                            &amp;lt;--- contention here
        dentry = __lookup_hash(&amp;amp;nd-&amp;gt;last, parent, nd-&amp;gt;flags);  
        mutex_unlock(&amp;amp;parent-&amp;gt;d_inode-&amp;gt;i_mutex); 
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;&lt;b&gt;workarround&lt;/b&gt;&lt;/p&gt;
&lt;ul&gt;
	&lt;li&gt;crontab with &quot;echo 2 &amp;gt; /proc/sys/vm/drop_caches&quot;&lt;/li&gt;
	&lt;li&gt;set the &quot;/proc/sys/fs/negative-dentry-limit&quot; on 3.10.0-1160 kernel&lt;/li&gt;
&lt;/ul&gt;
</description>
                <environment>VMs + 2.12.8 + 3.10.0-1160.59.1&lt;br/&gt;
robinhood v3 + 2.12.8 + 3.10.0-1062</environment>
        <key id="71494">LU-16047</key>
            <summary>cache contention in &quot;.lustre/fid/</summary>
                <type id="4" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11310&amp;avatarType=issuetype">Improvement</type>
                                            <priority id="4" iconUrl="https://jira.whamcloud.com/images/icons/priorities/minor.svg">Minor</priority>
                        <status id="1" iconUrl="https://jira.whamcloud.com/images/icons/statuses/open.png" description="The issue is open and ready for the assignee to start work on it.">Open</status>
                    <statusCategory id="2" key="new" colorName="default"/>
                                    <resolution id="-1">Unresolved</resolution>
                                        <assignee username="eaujames">Etienne Aujames</assignee>
                                    <reporter username="eaujames">Etienne Aujames</reporter>
                        <labels>
                            <label>performance</label>
                            <label>robinhood</label>
                    </labels>
                <created>Mon, 25 Jul 2022 19:23:52 +0000</created>
                <updated>Wed, 27 Jul 2022 10:44:34 +0000</updated>
                                                                                <due></due>
                            <votes>0</votes>
                                    <watches>5</watches>
                                                                            <comments>
                            <comment id="341500" author="eaujames" created="Mon, 25 Jul 2022 20:33:29 +0000"  >&lt;p&gt;With:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[root@client ~]# cat /proc/sys/fs/dentry-state
8499096 8483804 45      0       8470127 0     
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;And 100 threads doing state on non-existent fid and 20 threads doing stat on existent fid, I have the following latencies:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[root@client ~]#  for i in {1..10}; do  time stat /media/lustrefs/client/.lustre/fid/[0x200000402:0x66:0x0] ; done |&amp;amp; grep real
real    0m0.333s
real    0m0.352s
real    0m0.370s
real    0m0.003s
real    0m0.311s
real    0m0.296s
real    0m0.172s
real    0m0.345s
real    0m0.383s
real    0m0.330s
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="341505" author="adilger" created="Mon, 25 Jul 2022 21:41:52 +0000"  >&lt;p&gt;Do you have any stats on how many negative dentries need to accumulate in the &lt;tt&gt;.lustre/fid&lt;/tt&gt; directory for this to become a problem, and how long it takes for that many negative dentries to accumulate?&#160; That would allow setting &lt;tt&gt;negative-dentry-limit&lt;/tt&gt; to a reasonable default value (e.g. via &lt;tt&gt;/usr/lib/sysctl.d/lustre.conf&lt;/tt&gt;) at startup.&lt;/p&gt;

&lt;p&gt;One possible fix is to not cache negative dentries for the &lt;tt&gt;.lustre/fid&lt;/tt&gt; directory at all, but this might increase loading on the MDS due to repeated negative FID lookup RPCs being sent to the MDS. That depends heavily on how often the same non-existent FID is being looked up multiple times, and unfortunately I have no idea whether that is common or not. It likely also depends heavily on whether the Changelog reader itself will discard repeated records for the same FID (essentially implementing its own negative FID cache in userspace).&lt;/p&gt;

&lt;p&gt;Alternately, having some kind of periodic purging of old negative dentries on this directory would be possible. Something like dropping negative dentries after 30s (tunable?) of inactivity seems like a reasonable starting point. I couldn&apos;t find the &lt;tt&gt;/proc/sys/fs/negative-dentry-limit&lt;/tt&gt; tunable on my RHEL8 server, &lt;a href=&quot;https://access.redhat.com/solutions/5777081&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;but it appears that the newer kernel handles negative dentries better and does not need this tunable&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;However, if the &lt;tt&gt;negative-dentry-limit&lt;/tt&gt; parameter is working reasonably well for el7.9 kernels, and el8.x kernels don&apos;t have a problem, then maybe there isn&apos;t a need for a Lustre-specific patch?&#160; I do recall a number of patches being sent to &lt;tt&gt;linux-fsdevel&lt;/tt&gt; related to limiting the negative dentry count, but I don&apos;t know if any of those patches landed.&#160; It seems highly unlikely that they were landed upstream in time for the el8.x kernel, but &lt;a href=&quot;https://lwn.net/Articles/814535/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;maybe one of those patches was backported to el8.x while they were still making up their minds about the upstream solution&lt;/a&gt;.&lt;/p&gt;</comment>
                            <comment id="341556" author="eaujames" created="Tue, 26 Jul 2022 07:56:42 +0000"  >&lt;p&gt;The robinhood node have the 3.10.0-1062, negative-dentry-limit parameter does not exist for this kernel. So we use &quot;echo 2 &amp;gt; /proc/sys/vm/drop_caches&quot; every minute to keep the number of changelog dequeued by robinhood up to 10k.&lt;/p&gt;

&lt;p&gt;robinhood already have a de-duplicate mechanism on changelog to limit the number of &quot;stat&quot; on the filesystem, so it does not require negative dentry cache.&lt;br/&gt;
For now, I was not able to reproduce this with a lustre 2.15 (on the same scale, dentries are freed regularly), so maybe the &lt;a href=&quot;https://review.whamcloud.com/39685/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/39685/&lt;/a&gt; (&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-13909&quot; title=&quot;release invalid dentries proactively on client&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-13909&quot;&gt;&lt;del&gt;LU-13909&lt;/del&gt;&lt;/a&gt; llite: prune invalid dentries) could help.&lt;/p&gt;

&lt;p&gt;The CEA will try to upgrade the robinhood nodes with RHEL8 kernel to benefit from the cache improvements.&lt;/p&gt;</comment>
                    </comments>
                    <attachments>
                            <attachment id="44815" name="perf_fid_cont.svg" size="61401" author="eaujames" created="Mon, 25 Jul 2022 19:24:58 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|i02vgf:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                                                                                </customfields>
    </item>
</channel>
</rss>