<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:49:05 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-5164] Limit lu_object cache (ZFS and osd-zfs)</title>
                <link>https://jira.whamcloud.com/browse/LU-5164</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;For OSDs like ZFS to perform optimally it&apos;s import that they be allowed to manage their own cache.  This maximizes the likelyhood that the ARC will prefetch and cache the right buffers.  In the existing ZFS OSD code a cached LU object pins buffers in the ARC preventing them from being dropped.  As the LU cache grows it can consume the entire ARC preventing buffers for other objects, such as the OIs, from being cached and severely impacting the performance for FID lookups.&lt;/p&gt;

&lt;p&gt;The proposed patch addresses this by limiting the size of the lu_cache but alternate approaches are welcome.  We are carrying this patch in LLNLs tree and it does help considerably.&lt;/p&gt;</description>
                <environment></environment>
        <key id="25071">LU-5164</key>
            <summary>Limit lu_object cache (ZFS and osd-zfs)</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="3" iconUrl="https://jira.whamcloud.com/images/icons/priorities/major.svg">Major</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="utopiabound">Nathaniel Clark</assignee>
                                    <reporter username="behlendorf">Brian Behlendorf</reporter>
                        <labels>
                            <label>patch</label>
                            <label>performance</label>
                            <label>server</label>
                            <label>zfs</label>
                    </labels>
                <created>Mon, 9 Jun 2014 17:24:30 +0000</created>
                <updated>Sat, 11 Oct 2014 04:42:07 +0000</updated>
                            <resolved>Wed, 18 Jun 2014 13:08:24 +0000</resolved>
                                    <version>Lustre 2.6.0</version>
                    <version>Lustre 2.4.2</version>
                    <version>Lustre 2.5.3</version>
                                    <fixVersion>Lustre 2.6.0</fixVersion>
                    <fixVersion>Lustre 2.5.4</fixVersion>
                                        <due></due>
                            <votes>0</votes>
                                    <watches>10</watches>
                                                                            <comments>
                            <comment id="86125" author="behlendorf" created="Mon, 9 Jun 2014 17:33:03 +0000"  >&lt;p&gt;&lt;a href=&quot;http://review.whamcloud.com/#/c/10237/1&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#/c/10237/1&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="86126" author="pjones" created="Mon, 9 Jun 2014 17:39:30 +0000"  >&lt;p&gt;Hi Nathaniel&lt;/p&gt;

&lt;p&gt;Could you please review this patch?&lt;/p&gt;

&lt;p&gt;Thanks&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="86763" author="isaac" created="Tue, 17 Jun 2014 00:29:26 +0000"  >&lt;p&gt;Hi Brian, do you have some data to share about &quot;it does help considerably&quot;?&lt;/p&gt;</comment>
                            <comment id="86827" author="behlendorf" created="Tue, 17 Jun 2014 17:50:08 +0000"  >&lt;p&gt;Unfortunately, I&apos;ve been swamped and haven&apos;t been able to collect any actual before and after test results.  However, without this patch we would clearly see virtually all the ARC buffers which back the OI ZAPs get forced out of the ARC cache.  That meant at least one physical IO for every look up.  With the patch the active sections of the OIs now stay cached in the ARC and we see a much better hit rate.  Which is got to help performance considerably but I just haven&apos;t collected the data.  I think this patch really should go in to 2.6, we&apos;re running with it in our tree and have seen no issues.&lt;/p&gt;</comment>
                            <comment id="86909" author="pjones" created="Wed, 18 Jun 2014 13:08:24 +0000"  >&lt;p&gt;Landed for 2.6&lt;/p&gt;</comment>
                            <comment id="87404" author="bzzz" created="Tue, 24 Jun 2014 19:21:47 +0000"  >&lt;p&gt;we discussed this yet another time on the call today and it seems I missed the important thing in the original patch. I don&apos;t think making LU cache tiny is a good idea - it means we&apos;ll have to lookup in OI very often and initialize SA handler very often. I do understand the original reason and that we want more flexibility for ARC, but I&apos;d think even many thousands of objects in LU won&apos;t make it worse at all, rather better - because we don&apos;t need to lookup in OI and do expensive SA initialization nearly every RPC.&lt;/p&gt;</comment>
                            <comment id="87437" author="behlendorf" created="Tue, 24 Jun 2014 23:32:06 +0000"  >&lt;p&gt;It sounds like we&apos;re going to need to run some benchmarks to get a handle of the real performance implications of this.  When the cache is very small (effectively zero) you have concerns about the cost of initializing a SA for nearly every RPC.  Conversely when the LU cache is allowed to grow to fill memory it forces all the OI ZAP blocks out of the ARC meaning nearly every FID lookup must go to disk.  There&apos;s perhaps some reasonable middle ground we can settle on for the short term based on the benchmark results.  &lt;/p&gt;

&lt;p&gt;Longer term we should think about how to restructure the OSD to avoid both of these problems.  The posix layer avoids this issue by only keeping a hold on the dnode for the duration of the relevant system call.  Arguably the OSD should be doing something analogous and only holding the dnode for the length of the RPC.&lt;/p&gt;</comment>
                            <comment id="87471" author="bzzz" created="Wed, 25 Jun 2014 07:06:09 +0000"  >&lt;p&gt;yes, I think we need some golden middle here. it&apos;s not just SA, it&apos;s also OI lookups themselves. they can&apos;t be free, especially when OI is huge.&lt;/p&gt;

&lt;p&gt;as for ZFS/posix, I&apos;m not sure I agree - dnodes are cached and you don&apos;t need to go through metadnode, initialize dnode structures again and again. it&apos;s just that ARC &lt;b&gt;knows&lt;/b&gt; how to deal with payload properly?&lt;/p&gt;</comment>
                            <comment id="87746" author="behlendorf" created="Sat, 28 Jun 2014 00:05:44 +0000"  >&lt;p&gt;&amp;gt; they can&apos;t be free, especially when OI is huge.&lt;/p&gt;

&lt;p&gt;Right, they can&apos;t be free.  And we knew adding this layer of indirection would cost us a lookup.  But we can strive to make them as cheap as possible.  If fact, if the configuration contains a respectable number of OSTs (&amp;gt;10) it does become very reasonable to cache the entire OI.&lt;/p&gt;

&lt;p&gt;&amp;gt; dnodes are cached and you don&apos;t need to go through metadnode&lt;/p&gt;

&lt;p&gt;It seems to me that the Lustre LU object cache is directly analogous to the VFS inode cache.  A lu_object in the cache should be able to behave just like a inode/znode.  That means a few things.&lt;/p&gt;

&lt;p&gt;1) The number of objects in the cache should be allowed to grow and will be pruned due under memory pressure.&lt;br/&gt;
2) Each object in the cache can have a long lived shared SA handle (znodes do)&lt;br/&gt;
3) Each object cached object may only reference its assoicated dnode by object number.&lt;br/&gt;
4) All holds for a dnode must be dropped before returning from the system call or RPC.&lt;/p&gt;

&lt;p&gt;Correct me if I&apos;m wrong, but it looks to me like the Lustre code does 1) and 2) today.  If we update the OSD so the lu_object only references the dnode when needed by its object numberr.  The I don&apos;t think we&apos;d need to impose any artificial limits on the cache.  The key bit is the cached but inactive object must not have any outstanding holds.  This would allow the ARC to evict whatever buffers it needed too regardless of what lu_objects are cached.  This exactly how the Posix layer works.&lt;/p&gt;</comment>
                            <comment id="87769" author="bzzz" created="Mon, 30 Jun 2014 07:07:05 +0000"  >&lt;p&gt;actually I do have a patch which doesn&apos;t pin dnode&apos;s dbuf, but I&apos;m still concerned about SA overhead. in contrast with POSIX we have to modify many objects every operation (parent, child, last_rcvd, logs). IMHO, ideally we shouldn&apos;t pin SA for rarely used objects, but for frequently accessed ones (like logs, last_rcvd, shared directories) it&apos;s be better to have SA ready. this is why I agree it&apos;s probably better to limit LU cache - frequently accessed objects are here and cheap to use.&lt;/p&gt;

&lt;p&gt;also, notice VFS does pin inode with dentry. literally meaning once you have resolved a path to specific dentry you have the inode found. for sure this isn&apos;t free - dentry pins amount of data and MM algorithms has to deal with this.&lt;/p&gt;

&lt;p&gt;having that said, I&apos;m fine to experiment with the approach holding neither dbuf nor SA handle.&lt;/p&gt;</comment>
                            <comment id="88056" author="behlendorf" created="Wed, 2 Jul 2014 23:46:26 +0000"  >&lt;p&gt;&amp;gt; IMHO, ideally we shouldn&apos;t pin SA for rarely used objects, but for frequently accessed ones&lt;/p&gt;

&lt;p&gt;That sounds reasonable to me.  Do we have an easy way to tell the different between frequently accessed objects which should keep their SA cached and rarely accessed objects where it&apos;s less critical?  I don&apos;t want to cache more than we have too.&lt;/p&gt;

&lt;p&gt;&amp;gt; also, notice VFS does pin inode with dentry.&lt;/p&gt;

&lt;p&gt;Sure, but the MM system has code to deal with this.  The dentry cache is always pruned before the inode cache which ensures some number of inodes can always be freed.&lt;/p&gt;</comment>
                            <comment id="88245" author="bzzz" created="Mon, 7 Jul 2014 10:33:22 +0000"  >&lt;p&gt;&amp;gt; That sounds reasonable to me. Do we have an easy way to tell the different between frequently accessed objects which should keep their SA cached and rarely accessed objects where it&apos;s less critical? I don&apos;t want to cache more than we have too.&lt;/p&gt;

&lt;p&gt;lu_object_put() calls -&amp;gt;loo_object_release() when the last reference to the object gone, but this is not what we need, I guess:&lt;br/&gt;
this won&apos;t work for a client accessing a directory exclusively as every time RPC completes we&apos;ll be getting -&amp;gt;loo_object_release()&lt;br/&gt;
while few cycles later we get another RPC to the same directory.&lt;/p&gt;

&lt;p&gt;we could introduce yet another method, probably.. to release resource from the objects from the tail of LRU. but this is yet additional&lt;br/&gt;
complexity to the algorithm and additional overhead. this is why I like the idea of limiting cache. but limit I had in mind was in millions&lt;br/&gt;
(so memory footprint isn&apos;t enormous), rather than literally few objects.&lt;/p&gt;

&lt;p&gt;&amp;gt; Sure, but the MM system has code to deal with this. The dentry cache is always pruned before the inode cache which ensures some number of inodes can always be freed.&lt;/p&gt;

&lt;p&gt;well, we do register lu_cache_shrink() which is the way MM recycles the memory? very similar if not the same?&lt;/p&gt;</comment>
                            <comment id="89803" author="isaac" created="Tue, 22 Jul 2014 23:15:04 +0000"  >&lt;p&gt;My knowledge of the Lustre server stack is very limited, so I&apos;m not sure whether it&apos;s feasible or not. But here&apos;s my thoughts:&lt;/p&gt;

&lt;p&gt;1. Get rid of the LRU completely. Objects are freed once the last reference is dropped. Then it&apos;d be equivalent to the ZPL way of holding on to DMU objects/buffers only for the duration of system calls. This also gives the ARC the freedom to decide which buffers to keep or evict. After all, the ARC is supposed to do a better job than a simple LRU.&lt;/p&gt;

&lt;p&gt;2. When osd-zfs has the knowledge that certain objects are frequently used or will be used soon, hold references to those objects proactively. For example:&lt;/p&gt;
&lt;ul class=&quot;alternate&quot; type=&quot;square&quot;&gt;
	&lt;li&gt;If last_rcvd is used for most RPCs, hold a ref for the lifetime of the MDS kernel module.&lt;/li&gt;
	&lt;li&gt;When a RPC is queued, do some preprocessing, look at the objects that will be needed, and look them up in the lu_site cache:
	&lt;ul class=&quot;alternate&quot; type=&quot;square&quot;&gt;
		&lt;li&gt;If it&apos;s already there, add a ref to it so that it stays in the cache.&lt;/li&gt;
		&lt;li&gt;If it&apos;s not there already, we may do nothing if cache size is near a threshold, or load the object into the cache aggressively.&lt;/li&gt;
	&lt;/ul&gt;
	&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;This way ARC has the freedom it needs, and osd-zfs also contributes when it knows better what to cache. It should be able to handle to case Alex outlined where a client accesses a directory exclusively, because the queued RPCs will keep objects used by the current RPC in the cache.&lt;/p&gt;</comment>
                            <comment id="89804" author="isaac" created="Tue, 22 Jul 2014 23:19:36 +0000"  >&lt;blockquote&gt;&lt;p&gt;lu_object_put() calls -&amp;gt;loo_object_release() when the last reference to the object gone, but this is not what we need, I guess: this won&apos;t work for a client accessing a directory exclusively as every time RPC completes we&apos;ll be getting -&amp;gt;loo_object_release() while few cycles later we get another RPC to the same directory.&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;In this case, if the ARC is doing a decent job, the buffers should still be cached and no IO will be need to create the object again. Is there any expensive operation we want to avoid when creating an object from ARC buffers (i.e. no disk IO)?&lt;/p&gt;</comment>
                            <comment id="89814" author="bzzz" created="Wed, 23 Jul 2014 02:24:32 +0000"  >&lt;p&gt;&amp;gt; Is there any expensive operation we want to avoid when creating an object from ARC buffers (i.e. no disk IO)?&lt;/p&gt;

&lt;p&gt;OI lookup, SA initialization.&lt;/p&gt;</comment>
                            <comment id="94576" author="yujian" created="Sat, 20 Sep 2014 06:54:03 +0000"  >&lt;p&gt;Here is the back-ported patch for Lustre b2_5 branch: &lt;a href=&quot;http://review.whamcloud.com/12001&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/12001&lt;/a&gt;&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                                                <inwardlinks description="is related to">
                                        <issuelink>
            <issuekey id="25256">LU-5240</issuekey>
        </issuelink>
                            </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzwnzj:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>14237</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>