<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 02:13:03 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-7920] hsm coordinator request_count and max_requests not used consistently   </title>
                <link>https://jira.whamcloud.com/browse/LU-7920</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;I&apos;ve always wondered why sometimes max_requests has an effect and other times it appears to be completely ignored. &lt;/p&gt;

&lt;p&gt;The hsm coordinator receives new actions from user space in action lists, and there can be between and ~50 actions in the list. As the coordinator sends the actions (aka &quot;requests&quot;) to agents, it tracks how many are being processed in cdt-&amp;gt;cdt_request_count. There is also a tunable parameter, cdt-&amp;gt;cdt_max_requests, that is presumably intended to limit the number of requests sent to the agents. The count is compared against max_requests in the main cdt loop prior to processing each action list:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;	/* still room for work ? */
	if (atomic_read(&amp;amp;cdt-&amp;gt;cdt_request_count) ==
	    cdt-&amp;gt;cdt_max_requests)
		break;
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Note it is checking for equality there.&lt;/p&gt;

&lt;p&gt;Since this check occurs prior to processing an action list, and there can be multiple requests per list, it is easy to see that request_count can very easily greatly exceed max_requests, and when the happens the coordinator continues to send all available requests to the agents, which might actually be the right thing to do anyway. &lt;/p&gt;

&lt;p&gt;If we really want a limit, then a simple workaround here is to change &quot;==&quot; to &quot;&amp;gt;=&quot; to provide some kind of limit, but it really seems wrong to have a single global limit regardless of the number of agents and archives. Ideally the agents should be able to maximize the throughput to each of the archives they are handling. I believe there have been some discussions on how best to do this, but I don&apos;t think we&apos;ve reached a consensus yet.  &lt;/p&gt;
</description>
                <environment></environment>
        <key id="35610">LU-7920</key>
            <summary>hsm coordinator request_count and max_requests not used consistently   </summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="4" iconUrl="https://jira.whamcloud.com/images/icons/priorities/minor.svg">Minor</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="utopiabound">Nathaniel Clark</assignee>
                                    <reporter username="rread">Robert Read</reporter>
                        <labels>
                            <label>cea</label>
                    </labels>
                <created>Fri, 25 Mar 2016 17:38:25 +0000</created>
                <updated>Fri, 24 Mar 2017 17:11:11 +0000</updated>
                            <resolved>Tue, 14 Jun 2016 15:16:37 +0000</resolved>
                                    <version>Lustre 2.8.0</version>
                                    <fixVersion>Lustre 2.9.0</fixVersion>
                                        <due></due>
                            <votes>1</votes>
                                    <watches>11</watches>
                                                                            <comments>
                            <comment id="147315" author="jcl" created="Wed, 30 Mar 2016 06:23:45 +0000"  >&lt;p&gt;We need to keep a global max request to avoid a storm of requests to be send to agent (like find  . -exec hsm_archive ...). Today this is broken. You are right a better limit should be a max_request per archive backend and then we suppose any agent can serve the same number of request. If we want a limit for each agent it has to be defined by the agent at registration.&lt;/p&gt;</comment>
                            <comment id="147343" author="rread" created="Wed, 30 Mar 2016 15:38:19 +0000"  >&lt;p&gt;That is an interesting idea to for the agent to set this on registration, as I would like to see the agent more involved. My preference would be to add flow control between agent and CDT so the rate can adapt for different situations.  Perhaps it would be sufficient if the agent could adjust its per-archive limits at any time me and not just at registration. This could also be used to shut down the agent cleanly by stopping new requests and allowing existing ones to drain. &lt;/p&gt;</comment>
                            <comment id="148150" author="jhammond" created="Thu, 7 Apr 2016 16:51:18 +0000"  >&lt;p&gt;Fr&#233;d&#233;rick,&lt;/p&gt;

&lt;p&gt;I believe that this is the issue that you described in your LUG talk. A patch to change == to &amp;gt;= is forthcoming.&lt;/p&gt;</comment>
                            <comment id="148159" author="gerrit" created="Thu, 7 Apr 2016 17:25:39 +0000"  >&lt;p&gt;Nathaniel Clark (nathaniel.l.clark@intel.com) uploaded a new patch: &lt;a href=&quot;http://review.whamcloud.com/19382&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/19382&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-7920&quot; title=&quot;hsm coordinator request_count and max_requests not used consistently   &quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-7920&quot;&gt;&lt;del&gt;LU-7920&lt;/del&gt;&lt;/a&gt; hsm: Account for decreasing max request count&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: d6f83d318368a197dc68c0dad68add4e6857b712&lt;/p&gt;</comment>
                            <comment id="148168" author="fredlefebvre" created="Thu, 7 Apr 2016 18:05:58 +0000"  >&lt;p&gt;John,&lt;/p&gt;

&lt;p&gt;It sure does look like the same issue.  I also agree with Jacques-Charles that we need to keep a global limit.  It makes sense to be able to limit parallel HSM operations a specific filesystem can safely handle.  The HSM agent / copytool should handle the logic of figuring it how much parallel request each client can handle. &lt;/p&gt;</comment>
                            <comment id="148280" author="rread" created="Fri, 8 Apr 2016 20:07:37 +0000"  >&lt;p&gt;I agree we need to limit the number of operations the movers can do, but this isn&apos;t the right place to do it. We need to separate the number of actions that have been submitted to the mover from the number of parallel operations allowed per client. The parallel IO limit should be implemented by the mover, and coordinator should simply be sending as many requests to the mover as the movers ask for. &lt;/p&gt;

&lt;p&gt;Consider the use case of restoring data from a cold storage archive such as AWS Glacier. Typically it will take around 4 hours for an object to be retrieved from the archive and made available to download.  If a user needs to restore  many thousands (or millions?) of files, the mover should be able to submit as many as possible for retrieval all at once, and not be limited by the number of parallel operations the filesystem supports. Once the files are available to download, then the mover will limit how many are copied in parallel. &lt;/p&gt;

&lt;p&gt;Likewise, tape archive solutions such as DMF are able to manage the tape IO more efficiently if all requests are sent directly to DMF as quickly as possible.&lt;/p&gt;

</comment>
                            <comment id="149965" author="sthiell" created="Sun, 24 Apr 2016 04:17:49 +0000"  >&lt;p&gt;Indeed max_requests only works as expected when you start from scratch (no active_requests and all started copytool agents must be able to handle max_requests). Also, max_requests is broken when you try to decrease its value when active requests are running. In that particular case I&#8217;ve seen the same behavior Fr&#233;d&#233;rick reported at LUG. During my initial tests, I did often play with active_requests_timeout to clean things as a workaround. As it is clearly a defect, it would be nice to fix this in the Lustre 2.5 Intel version too. Please keep a global max_requests to limit system resources.&lt;/p&gt;

&lt;p&gt;I&#8217;m using Google as the Lustre/HSM backend for one of our filesystems, and like other cloud storage, I am constrained by service quotas like requests limit per period of time&#8230; During the initial archival process, my main issue was that the global to-the-cloud request rate, implied by all running copytools, depends on the size of the files being archived. Ideally I would like to be able to set a max_requests per period of time (and per archive_id) in a flexible way, like Robert described, done in user space by the copytool agent itself, with a more advanced CDT/agent protocol to control the global requests/x_secs of all running copytools. Google recommended to implement an exponential backoff error handling strategy in the copytool (&lt;a href=&quot;https://developers.google.com/drive/v2/web/handle-errors#exponential-backoff&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://developers.google.com/drive/v2/web/handle-errors#exponential-backoff&lt;/a&gt; ), and this is what I did. While I am still seeing wasteful network requests, it&#8217;s working just fine, so maybe it&#8217;s not worth the hassle after all.&lt;/p&gt;</comment>
                            <comment id="155616" author="gerrit" created="Tue, 14 Jun 2016 03:53:34 +0000"  >&lt;p&gt;Oleg Drokin (oleg.drokin@intel.com) merged in patch &lt;a href=&quot;http://review.whamcloud.com/19382/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/19382/&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-7920&quot; title=&quot;hsm coordinator request_count and max_requests not used consistently   &quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-7920&quot;&gt;&lt;del&gt;LU-7920&lt;/del&gt;&lt;/a&gt; hsm: Account for decreasing max request count&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: &lt;br/&gt;
Commit: 5bfc22a47debfd5a6103862424546c100b3ad94e&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10010">
                    <name>Duplicate</name>
                                                                <inwardlinks description="is duplicated by">
                                        <issuelink>
            <issuekey id="35961">LU-7995</issuekey>
        </issuelink>
                            </inwardlinks>
                                    </issuelinktype>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                                                <inwardlinks description="is related to">
                                                        </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzy5qf:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>