<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 02:19:11 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-8626] limit number of items in HSM action queue</title>
                <link>https://jira.whamcloud.com/browse/LU-8626</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;Several presentations at RUG&apos;16 mentioned that Lustre has poor performance when there are very large numbers of HSM actions outstanding on the coordinator.&lt;/p&gt;

&lt;p&gt;Firstly, having a /proc file that exposes the number of entries currently in the HSM action list would allow RBH and monitoring scripts to easily monitor the number of enties.&lt;/p&gt;</description>
                <environment></environment>
        <key id="39907">LU-8626</key>
            <summary>limit number of items in HSM action queue</summary>
                <type id="4" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11310&amp;avatarType=issuetype">Improvement</type>
                                            <priority id="3" iconUrl="https://jira.whamcloud.com/images/icons/priorities/major.svg">Major</priority>
                        <status id="4" iconUrl="https://jira.whamcloud.com/images/icons/statuses/reopened.png" description="This issue was once resolved, but the resolution was deemed incorrect. From here issues are either marked assigned or resolved.">Reopened</status>
                    <statusCategory id="2" key="new" colorName="default"/>
                                    <resolution id="-1">Unresolved</resolution>
                                        <assignee username="bougetq">Quentin Bouget</assignee>
                                    <reporter username="adilger">Andreas Dilger</reporter>
                        <labels>
                            <label>hsm</label>
                    </labels>
                <created>Mon, 19 Sep 2016 15:37:47 +0000</created>
                <updated>Wed, 4 Apr 2018 07:24:30 +0000</updated>
                                                                                <due></due>
                            <votes>2</votes>
                                    <watches>16</watches>
                                                                            <comments>
                            <comment id="166397" author="adilger" created="Mon, 19 Sep 2016 16:23:01 +0000"  >&lt;p&gt;I don&apos;t know enough about the details to write a good description of what needs to be done here.  There is a &lt;tt&gt;max_requests&lt;/tt&gt; tunable, but that appears to control the number of requests outstanding from the coordinator to a single copytool (&lt;tt&gt;cdt_max_requests&lt;/tt&gt;).&lt;/p&gt;

&lt;p&gt;Stephane, could you please add some info here about what variable(s) should be exposed via /proc, and what should be used to limit the size of the queue?  What error should be returned by the coordinator if the action queue size limit is exceeded, -EFBIG?&lt;/p&gt;</comment>
                            <comment id="166499" author="bfaccini" created="Tue, 20 Sep 2016 08:26:18 +0000"  >&lt;p&gt;Andreas, you sure about the fact that &lt;span class=&quot;error&quot;&gt;&amp;#91;cdt_&amp;#93;&lt;/span&gt;max_requests only concerns requests from the coordinator to a single copytool? I thought (and seems confirmed with my related code reading) it is used to account and limit the whole set of &quot;active&quot; requests been handled by a CDT and regardless to the concerned Agent(s). &lt;/p&gt;

&lt;p&gt;About the performance to access the hsm_actions LLOG content when there is a huge back-log, I think that there have been already some work done, and not only in &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-7988&quot; title=&quot;HSM: high lock contention for cdt_llog_lock&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-7988&quot;&gt;&lt;del&gt;LU-7988&lt;/del&gt;&lt;/a&gt;. I will try to gather these infos and add to this ticket.&lt;/p&gt;

&lt;p&gt;And also, as a first step, could it be acceptable to implement a simple/basic limit of a maximum number of (active and not) requests to be handled by a single CDT and simply return an error (-EFBIG ?) when reached ?&lt;/p&gt;</comment>
                            <comment id="166683" author="adilger" created="Wed, 21 Sep 2016 08:43:12 +0000"  >&lt;p&gt;Hopefully St&#233;phane or Patrick can answer your question here.  I was just trying to record the issue raised during RUG.  It appears this is related to &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-7988&quot; title=&quot;HSM: high lock contention for cdt_llog_lock&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-7988&quot;&gt;&lt;del&gt;LU-7988&lt;/del&gt;&lt;/a&gt; also caused by having a large number of items in the action list.&lt;/p&gt;</comment>
                            <comment id="166701" author="sthiell" created="Wed, 21 Sep 2016 13:27:54 +0000"  >&lt;p&gt;max_requests is the maximum number of active requests at the same time per coordinator, so it has nothing to do with the HSM action queue. By the way, the Lustre documentation for max_requests is correct.&lt;/p&gt;

&lt;p&gt;actions is used to dump the action queue&lt;br/&gt;
(Note: it is also accessible through &quot;lctl get_param mdt.lustre-MDT0000.hsm.actions&quot;)&lt;/p&gt;

&lt;p&gt;Typical entries in actions look like:&lt;/p&gt;

&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;lrh=[type=10680000 len=136 idx=342/13652] fid=[0x200034e87:0x8700:0x0] dfid=[0x200034e87:0x8700:0x0] compound/cookie=0x57d8a6d9/0x57d89f3e action=ARCHIVE archive#=1 flags=0x0 extent=0x0-0xffffffffffffffff gid=0x0 datalen=0 status=STARTED data=[]
lrh=[type=10680000 len=136 idx=342/13655] fid=[0x20002a3ec:0xf909:0x0] dfid=[0x20002a3ec:0xf909:0x0] compound/cookie=0x57d8a6dc/0x57d89f41 action=ARCHIVE archive#=1 flags=0x0 extent=0x0-0xffffffffffffffff gid=0x0 datalen=0 status=SUCCEED data=[]
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;To retrieve statistics from this file to put them into Graphite/Grafana, I use the following script that extracts the number of lines by grouping them by &quot;action&quot; and also &quot;status&quot;:&lt;/p&gt;

&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;prefix=&lt;span class=&quot;code-quote&quot;&gt;&quot;$CARBON_PREFIX.$MDT.hsm.actions&quot;&lt;/span&gt;

lctl get_param mdt.$MDT.hsm.actions | awk -v prefix=$prefix &lt;span class=&quot;code-quote&quot;&gt;&apos;BEGIN { now = systime() } / status=/ { action=gensub(/action=(.+)/, &lt;span class=&quot;code-quote&quot;&gt;&quot;\\1&quot;&lt;/span&gt;, &lt;span class=&quot;code-quote&quot;&gt;&quot;g&quot;&lt;/span&gt;, $7); status=gensub(/status=(.+)/, &lt;span class=&quot;code-quote&quot;&gt;&quot;\\1&quot;&lt;/span&gt;, &lt;span class=&quot;code-quote&quot;&gt;&quot;g&quot;&lt;/span&gt;, $(NF-1)); arr[action&lt;span class=&quot;code-quote&quot;&gt;&quot;.&quot;&lt;/span&gt;status] += 1   } END { &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; (s in arr) { printf &lt;span class=&quot;code-quote&quot;&gt;&quot;%s.%s %s %d\n&quot;&lt;/span&gt;, prefix, s, arr[s], now} }&apos;&lt;/span&gt;
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;So I get lines that are valid for Graphite that look like:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;srcc.sherlock.lustre.mdt.regal-MDT0000.hsm.actions.ARCHIVE.SUCCEED 3 1474463707
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;I am sure you do something similar for IML as you have a graph for HSM, but it is also broken when too many hsm actions are present.&lt;/p&gt;

&lt;p&gt;The problem is that when the actions file reaches 100K&apos;s entries, it takes so long it is not parsable in a timely manner anymore...&lt;/p&gt;

&lt;p&gt;I think having counters per &quot;action&quot; and &quot;status&quot; could be very useful, perhaps something like:&lt;/p&gt;

&lt;p&gt;actions_stats:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;status=archive status=STARTED count=23
status=archive status=SUCCEED count=1234
status=restore status=STARTED count=0
status=restore status=SUCCEED count=1
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;About the way to limit the size of the queue, a tunable like max_actions and -EFBIG sounds good, but still there might be a problem if restore actions are triggered when the action queue is full. Maybe the best would be a max_actions per action (archive, restore, remove)?...&lt;/p&gt;</comment>
                            <comment id="166707" author="adilger" created="Wed, 21 Sep 2016 13:53:47 +0000"  >&lt;p&gt;I think we should stick with a single value per file, since this is required when moving stats into /sys/fs/lustre, so something like &lt;tt&gt;action_archive_started_count&lt;/tt&gt;, &lt;tt&gt;action_archive_succeed_count&lt;/tt&gt;, &lt;tt&gt;action_restore_started_count&lt;/tt&gt;, &lt;tt&gt;action_restore_succeed_count&lt;/tt&gt;.&lt;/p&gt;</comment>
                            <comment id="206230" author="gerrit" created="Thu, 24 Aug 2017 11:06:03 +0000"  >&lt;p&gt;Quentin Bouget (quentin.bouget@cea.fr) uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/28677&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/28677&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-8626&quot; title=&quot;limit number of items in HSM action queue&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-8626&quot;&gt;LU-8626&lt;/a&gt; hsm: count the number of started requests of each type&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: 4e84ba8aa669554b2d1b77459ebe79770aa4ad37&lt;/p&gt;</comment>
                            <comment id="215120" author="gerrit" created="Fri, 1 Dec 2017 12:13:17 +0000"  >&lt;p&gt;Quentin Bouget (quentin.bouget@cea.fr) uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/30336&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/30336&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-8626&quot; title=&quot;limit number of items in HSM action queue&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-8626&quot;&gt;LU-8626&lt;/a&gt; hsm: expose the number of active hsm requests per type&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: 294a0e4ff77fb09ac643b5e15af224027ded4aee&lt;/p&gt;</comment>
                            <comment id="216500" author="gerrit" created="Sun, 17 Dec 2017 06:18:48 +0000"  >&lt;p&gt;Oleg Drokin (oleg.drokin@intel.com) merged in patch &lt;a href=&quot;https://review.whamcloud.com/28677/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/28677/&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-8626&quot; title=&quot;limit number of items in HSM action queue&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-8626&quot;&gt;LU-8626&lt;/a&gt; hsm: count the number of started requests of each type&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: &lt;br/&gt;
Commit: 973759d1ff3bbcb217754bd9942fdf670dec2d96&lt;/p&gt;</comment>
                            <comment id="217428" author="gerrit" created="Thu, 4 Jan 2018 02:48:22 +0000"  >&lt;p&gt;Oleg Drokin (oleg.drokin@intel.com) merged in patch &lt;a href=&quot;https://review.whamcloud.com/30336/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/30336/&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-8626&quot; title=&quot;limit number of items in HSM action queue&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-8626&quot;&gt;LU-8626&lt;/a&gt; hsm: expose the number of active hsm requests per type&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: &lt;br/&gt;
Commit: 42e40555f250b83730d233dc5e22fd1f9396ccfe&lt;/p&gt;</comment>
                            <comment id="217473" author="pjones" created="Thu, 4 Jan 2018 13:30:30 +0000"  >&lt;p&gt;Landed for 2.11&lt;/p&gt;</comment>
                            <comment id="217476" author="bougetq" created="Thu, 4 Jan 2018 13:36:39 +0000"  >&lt;p&gt;Hi Peter,&lt;/p&gt;

&lt;p&gt;I am not sure this issue should be marked as resolved yet.&lt;br/&gt;
The patches that landed only provide information about how many requests the coordinator is currently handling, there are no built-in limitations yet.&lt;/p&gt;</comment>
                            <comment id="217477" author="pjones" created="Thu, 4 Jan 2018 13:41:05 +0000"  >&lt;p&gt;ok&lt;/p&gt;</comment>
                            <comment id="225010" author="tl-cea" created="Tue, 3 Apr 2018 11:00:48 +0000"  >&lt;p&gt;About already landed &lt;a href=&quot;https://review.whamcloud.com/30336/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/30336/&lt;/a&gt;:&lt;/p&gt;

&lt;p&gt;Given the implemented counters refers to the contents of &quot;active_requests&quot; list, they should rather be named &quot;active_archive_count&quot;, &quot;active_restore_count&quot;, ... instead of &quot;archive_count&quot;, etc... to be more explicit and avoid any confusion with the contents of hsm/actions that contains all requested actions.&lt;/p&gt;

&lt;p&gt;This change should be done before releasing 2.11 to avoid changing names in /proc later after the feature is released.&lt;/p&gt;</comment>
                            <comment id="225086" author="adilger" created="Wed, 4 Apr 2018 07:24:30 +0000"  >&lt;p&gt;2.11 has already been released. &lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                        <issuelink>
            <issuekey id="35819">LU-7988</issuekey>
        </issuelink>
                            </outwardlinks>
                                                        </issuelinktype>
                    </issuelinks>
                <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzyown:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                                                                                </customfields>
    </item>
</channel>
</rss>