<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:16:05 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-1376] ldlm_poold noise on clients significantly reduces applification performance</title>
                <link>https://jira.whamcloud.com/browse/LU-1376</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;Our users found that their application was scaling very poorly on our &quot;zin&quot; cluster.  It is a sandy bridge cluster, 16 cores per node, roughly 3000 nodes.  At relatively low node counts (512 nodes), they found that their performance on zin now that it is one the secure network is 1/4 of what is was when zin was on the open network.&lt;/p&gt;

&lt;p&gt;One of the few differences is that zin now talks to 3000+ OSTs on the secure network, whereas it only talk to a few hundred OSTs while it was shaken down on the open network.  One of our engineers noted that the ldlm_poold was frequently using 0.3% of CPU time on zin.&lt;/p&gt;

&lt;p&gt;The application in question is HIGHLY sensitive to system daemons and other CPU noise on the compute nodes because it highly MPI coordinated.  I created the attached patch (ldlm_poold_period.patch) that allows me to change the sleep interval used by the ldlm_poold.  Sure enough, if I change the sleep time to 300 seconds, the application&apos;s performance immediate improves by 4X.&lt;/p&gt;

&lt;p&gt;The ldlm_poold walking a list of 3000+ namespaces every second and doing nothing most of the time (because client namespaces are only actually &quot;recalculated&quot; every 10s) is a very bad design.  The patch was just to determine if that was really the cause.&lt;/p&gt;

&lt;p&gt;I will now work on a real fix.&lt;/p&gt;

&lt;p&gt;I think instead of making the ldlm_poold&apos;s sleep time configurable, I will make both the  LDLM_POOL_SRV_DEF_RECALC_PERIOD and LDLM_POOL_CLI_DEF_RECALC_PERIOD tunables.  Then I will make the ldlm_poold will dynamically sleep based on the next period in the list of namespaces...although I probably don&apos;t want each name space to have its own starting time.&lt;/p&gt;

&lt;p&gt;I probably want to keep the server and client namespace periods in sync with the namespaces of the same type, and then perhaps order the list as well to avoid walking the entire list unnecessarily.&lt;/p&gt;

&lt;p&gt;No work needed by Whamcloud right now, except perhaps to comment on my approach if you think there is something that I should be doing differently (or if there is already work in this area that I haven&apos;t found).&lt;/p&gt;</description>
                <environment>&lt;a href=&quot;https://github.com/chaos/lustre/commits/2.1.1-10chaos,&quot;&gt;https://github.com/chaos/lustre/commits/2.1.1-10chaos,&lt;/a&gt; 3000+ OSTs across 4 filesystems</environment>
        <key id="14274">LU-1376</key>
            <summary>ldlm_poold noise on clients significantly reduces applification performance</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="3" iconUrl="https://jira.whamcloud.com/images/icons/priorities/major.svg">Major</priority>
                        <status id="1" iconUrl="https://jira.whamcloud.com/images/icons/statuses/open.png" description="The issue is open and ready for the assignee to start work on it.">Open</status>
                    <statusCategory id="2" key="new" colorName="default"/>
                                    <resolution id="-1">Unresolved</resolution>
                                        <assignee username="green">Oleg Drokin</assignee>
                                    <reporter username="morrone">Christopher Morrone</reporter>
                        <labels>
                            <label>jitter</label>
                            <label>llnl</label>
                    </labels>
                <created>Fri, 4 May 2012 17:25:20 +0000</created>
                <updated>Fri, 8 Nov 2019 02:53:28 +0000</updated>
                                            <version>Lustre 2.3.0</version>
                    <version>Lustre 2.1.2</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>10</watches>
                                                                            <comments>
                            <comment id="38185" author="adilger" created="Fri, 4 May 2012 18:13:30 +0000"  >&lt;p&gt;Fujitsu implmemented a patch similar to this for the K computer FEFS, which has 10000 OSTs in total.  AFAIK, they would walk only a fraction of the namespaces with each wakeup to reduce the amount of ongoing jitter introduced by the ldlm pool thread.&lt;/p&gt;

&lt;p&gt;It may be that Oleg has a version of their 1.8 patch that could be used as a starting point for your 2.x patch, since I don&apos;t think this code has changed dramatically.&lt;/p&gt;</comment>
                            <comment id="38189" author="morrone" created="Fri, 4 May 2012 19:38:00 +0000"  >&lt;p&gt;Yes, I had compared the 2.1 code to 1.8 to see if this was a regression, and I agree that the changes do not look very significant.  Word from the users is that this bug has probably been around for quite some time, but folks finally had the time and inclination to begin investigating.&lt;/p&gt;

&lt;p&gt;I would love to see Fujitsu&apos;s patch.&lt;/p&gt;</comment>
                            <comment id="38222" author="pjones" created="Mon, 7 May 2012 09:27:17 +0000"  >&lt;p&gt;Oleg&lt;/p&gt;

&lt;p&gt;Could you please comment?&lt;/p&gt;

&lt;p&gt;Thanks&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="38685" author="morrone" created="Fri, 11 May 2012 20:38:11 +0000"  >&lt;p&gt;I took some time to look at the code fore this today.   The more I read, the worse things look.&lt;/p&gt;

&lt;p&gt;First of all, I think the use of these globals makes the code less readable that it could be:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;extern cfs_atomic_t ldlm_srv_namespace_nr;
extern cfs_atomic_t ldlm_cli_namespace_nr;
extern cfs_semaphore_t ldlm_srv_namespace_lock;
extern cfs_list_t ldlm_srv_namespace_list;
extern cfs_semaphore_t ldlm_cli_namespace_lock;
extern cfs_list_t ldlm_cli_namespace_list;
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;If we just made a struct with the three variables, and then had two global variables of that type, I think we could reduce the number of special accessor functions to one.&lt;/p&gt;

&lt;p&gt;Next, having a thread regularly poll all of the namespaces seems like a poor approach to handling SLV on the client side.  I would think that when a change SLV comes in over the network only then would we want to put the involved pool on a list of work, and send a signal to a sleeping thread to handle the work.&lt;/p&gt;

&lt;p&gt;Also, each pool tracks its own last recalculation time in pl_recalc_time, and has its own pl_recalc_period.  Why?  I see no reason at all for these.  If not for the time it takes to walk the namespace list, these recalc times would stay exactly in sync.  So why not just have a single value that can actually be adjusted?&lt;/p&gt;

&lt;p&gt;On the client side, we wake every second and shuffle the entire client list of namespaces.  9 times out of 10 there will be nothing to do because the pl_recalc_period is 10 seconds in each ns.  That is pretty silly.&lt;/p&gt;

&lt;p&gt;We could perhaps order the list based on time remaining to next recalc needed...but that doesn&apos;t seem to be worth while.&lt;/p&gt;

&lt;p&gt;I can perhaps see walking only a limited number of namespaces each time, as Andreas mentioned above.  This means that each namespace will only have its pool recalculated relatively rarely.  So then the question is whether it is better to have frequent small bits of noise, or rarer but larger amounts of noise.&lt;/p&gt;

&lt;p&gt;In either case though, we make SLV less effective as a tool to reduce server lock load.  Of course, SLV seems to currently be a pretty ineffective way of reducing lock usage when the server is under memory pressure (&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-1128&quot; title=&quot;Complete investigation of the LDLM pool shrinker and SLV handling&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-1128&quot;&gt;&lt;del&gt;LU-1128&lt;/del&gt;&lt;/a&gt;), so maybe I just shouldn&apos;t worry about that for now.&lt;/p&gt;

&lt;p&gt;I think I will just make the ldlm_poold&apos;s polling interval configurable.  I will also probably synchronize the polling intervals with the wall clock, because it is much better for all of the nodes to see the noise at the same time.&lt;/p&gt;

&lt;p&gt;But I think we need to come up with a plan for overhauling this in a future version.&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                                                <inwardlinks description="is related to">
                                        <issuelink>
            <issuekey id="17790">LU-2924</issuekey>
        </issuelink>
                            </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                            <attachment id="11300" name="ldlm_poold_period.patch" size="1264" author="morrone" created="Fri, 4 May 2012 17:25:20 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10490" key="com.atlassian.jira.plugin.system.customfieldtypes:datepicker">
                        <customfieldname>End date</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>Fri, 27 Jun 2014 17:25:20 +0000</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                            <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzv3an:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>4033</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                        <customfield id="customfield_10493" key="com.atlassian.jira.plugin.system.customfieldtypes:datepicker">
                        <customfieldname>Start date</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>Fri, 4 May 2012 17:25:20 +0000</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                    </customfields>
    </item>
</channel>
</rss>