<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 03:04:37 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-13834] HSM requests lost after retries (NoRetryAction disabled)</title>
                <link>https://jira.whamcloud.com/browse/LU-13834</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;A bit of setup first, our hpss lhsm agents have a configuration that tell them to accept up to x restore, y archive etc etc.&lt;/p&gt;

&lt;p&gt;On the other hand the number of requests that can be sent by the server is greater than the number of either of these (because why make restores wait if the servers are only busy archiving), but there is no knob to specify at a coordinator level the max per type of operation so the max concurrent requests is bigger than what the agents can handle with any single type of operation.&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;Back in 2.10, we noticed disabling NoRetryAction was bugged (request was dropped when told to try again but the coordinator would keep lock on file so that was pretty horrible)... So we kept the setting to it is by default, and when the coordinator sends such a request the agents refuse and the request is just dropped. Restore requests (e.g. client read) in 2.10 would just keep retrying, our own hsm helpers also retried, and archives would also just be retried later all is fine.&lt;/p&gt;

&lt;p&gt;Upon upgrading to 2.12, users complained frequently seeing &quot;no data available&quot; when reading released files. We noticed that apparently on 2.12 if all servers are busy and a request is refused, the client behaviour changed from transparent retry to just giving the error back to userspace, and user codes aren&apos;t ready to handle that (despite our efforts to tell them to use our helper...)&lt;/p&gt;

&lt;p&gt;This led to us re-enabling the retry (disabling NoRetryAction), as after audit we were convinced the problem we had in 2.10 is no longer there in 2.12 (that&apos;s why we never opened a ticket for it back then, that issue IS fixed)&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;Now I am seeing some retries happening, but we still experience some troubles:&lt;/p&gt;
&lt;ul&gt;
	&lt;li&gt;after a few retries, the request is stuck as &quot;STARTED&quot; in &lt;tt&gt;lctl get_param mdt.*.hsm.actions&lt;/tt&gt;{{}}
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;lrh=[type=10680000 len=144 idx=163/35484] fid=[0xc0002e5a8:0x3ed:0x0] dfid=[0xc0002e5a8:0x3ed:0x0] compound/cookie=0x0/0x5efd18e3 action=ARCHIVE archive#=1 flags=0x0 extent=0x0-0xffffffffffffffff gid=0x0 datalen=6 status=STARTED data=[636F733D32]
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;&lt;/li&gt;
&lt;/ul&gt;


&lt;ul&gt;
	&lt;li&gt;&lt;tt&gt;I see no such request on the lhsm agents, looking at logs all requests have been bounced as busy, for example:&lt;/tt&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;2020/07/30 03:51:24 lhsmd_hpss[26995]: Request 0x5efd18e3: action=ARCHIVE, len=78, fid=0xc0002e5a8:0x3ed:0x0 dfid=0xc0002e5a8:0x3ed:0x0
2020/07/30 03:51:24 lhsmd_hpss[26995]: Too many simultaneous &apos;archive&apos; operations (8), telling coordinator to retry later
--
2020/07/30 03:51:25 lhsmd_hpss[26995]: Request 0x5efd18e3: action=ARCHIVE, len=78, fid=0xc0002e5a8:0x3ed:0x0 dfid=0xc0002e5a8:0x3ed:0x0
2020/07/30 03:51:25 lhsmd_hpss[26995]: Too many simultaneous &apos;archive&apos; operations (8), telling coordinator to retry later
--
2020/07/30 03:51:26 lhsmd_hpss[26995]: Request 0x5efd18e3: action=ARCHIVE, len=78, fid=0xc0002e5a8:0x3ed:0x0 dfid=0xc0002e5a8:0x3ed:0x0
2020/07/30 03:51:26 lhsmd_hpss[26995]: Too many simultaneous &apos;archive&apos; operations (8), telling coordinator to retry later
--
2020/07/30 03:51:27 lhsmd_hpss[26995]: Request 0x5efd18e3: action=ARCHIVE, len=78, fid=0xc0002e5a8:0x3ed:0x0 dfid=0xc0002e5a8:0x3ed:0x0
2020/07/30 03:51:27 lhsmd_hpss[26995]: Too many simultaneous &apos;archive&apos; operations (8), telling coordinator to retry later
--
2020/07/30 03:51:28 lhsmd_hpss[26995]: Request 0x5efd18e3: action=ARCHIVE, len=78, fid=0xc0002e5a8:0x3ed:0x0 dfid=0xc0002e5a8:0x3ed:0x0
2020/07/30 03:51:28 lhsmd_hpss[26995]: Too many simultaneous &apos;archive&apos; operations (8), telling coordinator to retry later
--
2020/07/30 03:51:29 lhsmd_hpss[26995]: Request 0x5efd18e3: action=ARCHIVE, len=78, fid=0xc0002e5a8:0x3ed:0x0 dfid=0xc0002e5a8:0x3ed:0x0
2020/07/30 03:51:29 lhsmd_hpss[26995]: Too many simultaneous &apos;archive&apos; operations (8), telling coordinator to retry later
--
2020/07/30 03:51:30 lhsmd_hpss[26995]: Request 0x5efd18e3: action=ARCHIVE, len=78, fid=0xc0002e5a8:0x3ed:0x0 dfid=0xc0002e5a8:0x3ed:0x0
2020/07/30 03:51:30 lhsmd_hpss[26995]: Too many simultaneous &apos;archive&apos; operations (8), telling coordinator to retry later
--
2020/07/30 03:51:31 lhsmd_hpss[26995]: Request 0x5efd18e3: action=ARCHIVE, len=78, fid=0xc0002e5a8:0x3ed:0x0 dfid=0xc0002e5a8:0x3ed:0x0
2020/07/30 03:51:31 lhsmd_hpss[26995]: Too many simultaneous &apos;archive&apos; operations (8), telling coordinator to retry later
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;&#160;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;


&lt;ul&gt;
	&lt;li&gt;&lt;tt&gt;Also, the cdt_request_list&lt;/tt&gt; is completely empty on the server:
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;crash&amp;gt; struct coordinator 0xffff8875b86b3a38
# coherent data
crash&amp;gt; struct -o coordinator | grep cdt_request_list
 [0x1a0] struct list_head cdt_request_list;
crash&amp;gt; p/x 0xffff8875b86b3a38 + 0x1a0
 $2 = 0xffff8875b86b3bd8
crash&amp;gt; list -H 0xffff8875b86b3bd8
 (empty)&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;So there are no action on the server, yet the action list is full of seemingly started actions?&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;


&lt;ul&gt;
	&lt;li&gt;After the request timeout, requests that were started there do get CANCELLED ; so there must be some reference somewhere I&apos;m missing? or does it get it back from catalog at some point? I&apos;m not sure...&lt;/li&gt;
	&lt;li&gt;While the request is STARTED there, if another lfs hsm_archive/hsm_restore is issued it is ignored completely. This is particularily annoying for restore, as a read will then be stuck for however long the hsm timeout is (8 hours for us); then probably fail with no data available? (a process I had left stuck there did eventually terminate but I didn&apos;t get to see how)&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;I&apos;ve just enabled hsm debug logs on the MDS, will provide more infos if I have something.&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;</description>
                <environment></environment>
        <key id="60189">LU-13834</key>
            <summary>HSM requests lost after retries (NoRetryAction disabled)</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="4" iconUrl="https://jira.whamcloud.com/images/icons/priorities/minor.svg">Minor</priority>
                        <status id="1" iconUrl="https://jira.whamcloud.com/images/icons/statuses/open.png" description="The issue is open and ready for the assignee to start work on it.">Open</status>
                    <statusCategory id="2" key="new" colorName="default"/>
                                    <resolution id="-1">Unresolved</resolution>
                                        <assignee username="martinetd">Dominique Martinet</assignee>
                                    <reporter username="martinetd">Dominique Martinet</reporter>
                        <labels>
                            <label>CEA</label>
                    </labels>
                <created>Thu, 30 Jul 2020 10:28:03 +0000</created>
                <updated>Thu, 30 Jul 2020 13:11:55 +0000</updated>
                                            <version>Lustre 2.12.4</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>3</watches>
                                                                            <comments>
                            <comment id="276378" author="pjones" created="Thu, 30 Jul 2020 13:01:01 +0000"  >&lt;p&gt;Dominique&lt;/p&gt;

&lt;p&gt;Is this something that you plan to investigate?&lt;/p&gt;

&lt;p&gt;Thanks&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="276380" author="martinetd" created="Thu, 30 Jul 2020 13:10:45 +0000"  >&lt;p&gt;Hi Peter,&lt;/p&gt;

&lt;p&gt;I will at least look at the dk log tomorrow and report on that, but not sure I will have time to look further.&lt;/p&gt;

&lt;p&gt;FYI I am leaving CEA next week (!!), so don&apos;t expect too much !&lt;/p&gt;

&lt;p&gt;Dominique&lt;/p&gt;</comment>
                            <comment id="276381" author="pjones" created="Thu, 30 Jul 2020 13:11:55 +0000"  >&lt;p&gt;Ok. All the best in your future endeavours!&lt;/p&gt;</comment>
                    </comments>
                    <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|i016j3:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>