<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:38:48 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-4004] CLONE - flow control of HSM requests</title>
                <link>https://jira.whamcloud.com/browse/LU-4004</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;In a stress test I did today, I created 40K files and archive them with 2 clients. The requests were queued into MDT successfully but it caused other problems.&lt;/p&gt;

&lt;p&gt;the first problem is the lprocfs implementation of agent_action. The symptom is:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[root@mds01 ~]# lctl get_param mdt.*.hsm.agent_actions
error: get_param: read(&apos;/proc/fs/lustre/mdt/hsm-MDT0000/hsm/agent_actions&apos;) failed: Cannot allocate memory
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Though I didn&apos;t look at it yet, I think the root cause is that the llog is too long so it ran into a problem for some reason.&lt;/p&gt;

&lt;p&gt;I think the more severe problem is flow control. It&apos;s not good to keep the requests in queue so much long, at least we should have a parameter to control how long the maximum length of queue will be.&lt;/p&gt;

&lt;p&gt;Another problem I saw in the test is that:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;LustreError: 27319:0:(mdt_coordinator.c:1418:mdt_hsm_update_request_state()) hsm-MDT0000: Cannot find running request for cookie 0x5226bb27 on fid=[0x200000400:0xee5:0x0]
LustreError: 27319:0:(mdt_coordinator.c:1418:mdt_hsm_update_request_state()) Skipped 74 previous similar messages
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;There were a huge number of this warning. I will dig it tomorrow&lt;/p&gt;</description>
                <environment></environment>
        <key id="21109">LU-4004</key>
            <summary>CLONE - flow control of HSM requests</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="3" iconUrl="https://jira.whamcloud.com/images/icons/priorities/major.svg">Major</priority>
                        <status id="6" iconUrl="https://jira.whamcloud.com/images/icons/statuses/closed.png" description="The issue is considered finished, the resolution is correct. Issues which are closed can be reopened.">Closed</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="2">Won&apos;t Fix</resolution>
                                        <assignee username="jay">Jinshan Xiong</assignee>
                                    <reporter username="jay">Jinshan Xiong</reporter>
                        <labels>
                            <label>HSM</label>
                    </labels>
                <created>Tue, 24 Sep 2013 20:43:25 +0000</created>
                <updated>Wed, 17 Sep 2014 08:51:25 +0000</updated>
                            <resolved>Wed, 13 Nov 2013 16:07:08 +0000</resolved>
                                                                        <due></due>
                            <votes>0</votes>
                                    <watches>5</watches>
                                                                            <comments>
                            <comment id="67457" author="jlevi" created="Tue, 24 Sep 2013 20:44:26 +0000"  >&lt;p&gt;Jinshan Xiong added a comment - 09/Sep/13 6:07 PM - edited&lt;br/&gt;
patch is at: &lt;a href=&quot;http://review.whamcloud.com/7589&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/7589&lt;/a&gt;&lt;br/&gt;
Just fix the problem of ENOMEM. More work will be needed to add flow control.&lt;br/&gt;
John Hammond added a comment - 09/Sep/13 6:09 PM&lt;br/&gt;
From the autotest logs I have also seen this file return -EIO causing sanity-hsm test 40 to pass when it should have failed. Does anyone have any idea why it might do so?&lt;br/&gt;
Jinshan Xiong added a comment - 18/Sep/13 9:34 AM&lt;br/&gt;
In 2.5, we&apos;re going to fix the problem of dumping a huge amount of agent_actions only. The real flow control will be fixed in 2.6 due to limited resource.&lt;/p&gt;</comment>
                            <comment id="67458" author="jlevi" created="Tue, 24 Sep 2013 20:44:54 +0000"  >&lt;p&gt;This ticket is being created for the follow on work to be completed in the 2.6 Release.&lt;/p&gt;</comment>
                            <comment id="71429" author="jlevi" created="Wed, 13 Nov 2013 16:07:08 +0000"  >&lt;p&gt;Determined that no follow on work is needed.&lt;/p&gt;</comment>
                            <comment id="83328" author="jamesanunez" created="Tue, 6 May 2014 17:30:04 +0000"  >&lt;p&gt;During scale testing on IEEL build #26, I see the second problem reported above, seen on the MDT dmesg:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;LustreError: 23286:0:(mdt_coordinator.c:1465:mdt_hsm_update_request_state()) scratch-MDT0000: Cannot find running request for cookie 0x53682fe2 on fid=[0x200000405:0x1fc8:0x0]
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;On the client node running the test, I see the following many times on the console:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Cannot send HSM request (use of /lustre/scratch/tdir/e.00456): Operation not permitted
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The script being run has 4 processes writing a file and computing the checksum. Then the file is archived and released. At the same time there are three other processes checking if files are released and, if so, restore the file by checking the MD5 checksum. This is running on the OpenSFS cluster with a Lustre file system storing the data and a Lustre file system for the archive and a single agent.&lt;/p&gt;

&lt;p&gt;I don&apos;t have enough information to say what operation was not permitted, but is was either hsm_state, hsm_archive or hsm_release.&lt;/p&gt;

&lt;p&gt;Should I open a new ticket for this?&lt;/p&gt;</comment>
                            <comment id="94224" author="mkukat" created="Wed, 17 Sep 2014 08:51:25 +0000"  >&lt;p&gt;I didn&apos;t find a new ticket for this and this one looks very related to a problem we have seen yesterday. I hope, i&apos;m not wrong here.&lt;br/&gt;
There still seems to be a flow control issue with HSM requests and i have something that seems to be reliably reproducible to show this:&lt;/p&gt;

&lt;p&gt;Environment: 1 MDS, 2 OSSs, 2 clients, RHEL 6.5, Lustre 2.6.0, all VMs hosted on KVM.&lt;br/&gt;
Boot machines, start Lustre services (MDS, OSS1, OSS2, mount on both clients)&lt;/p&gt;

&lt;p&gt;Start copytool on client 2:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;lhsmtool_posix -A 3 -A 4 -p /tmp /lustre
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Start test script on client 1:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;arch=3
rm -f /lustre/mytest 
echo thisfile &amp;gt;/lustre/mytest
lfs hsm_archive -a $arch /lustre/mytest
sleep 3
lfs hsm_state /lustre/mytest
sleep 3
lfs hsm_release /lustre/mytest
sleep 3
lfs hsm_state /lustre/mytest
sleep 3
cat /etc/passwd &amp;gt;&amp;gt;/lustre/mytest
sleep 3
lfs hsm_state /lustre/mytest
sleep 3
lfs hsm_archive -a $arch /lustre/mytest
sleep 3
lfs hsm_state /lustre/mytest
sleep 5
lfs hsm_state /lustre/mytest
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Run test script&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;/lustre/mytest: (0x00000009) exists archived, archive_id:3
/lustre/mytest: (0x0000000d) released exists archived, archive_id:3
/lustre/mytest: (0x0000000b) exists dirty archived, archive_id:3
/lustre/mytest: (0x0000000b) exists dirty archived, archive_id:3
/lustre/mytest: (0x0000000b) exists dirty archived, archive_id:3
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;while watching the MDS syslog:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Sep 17 10:39:06 lustre-mds1 kernel: LustreError: 1396:0:(mdt_coordinator.c:1463:mdt_hsm_update_request_state()) lsttest-MDT0000: Cannot find running request for cookie 0x54194821 on fid=[0x200003ab0:0x1:0x0]
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;and the copytool:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;lhsmtool_posix[1230]: action=0 src=(null) dst=(null) mount_point=/lustre
lhsmtool_posix[1230]: waiting for message from kernel
lhsmtool_posix[1230]: copytool fs=lsttest archive#=3 item_count=1
lhsmtool_posix[1230]: waiting for message from kernel
lhsmtool_posix[1231]: &apos;[0x200003ab0:0x1:0x0]&apos; action ARCHIVE reclen 72, cookie=0x5419481f
lhsmtool_posix[1231]: processing file &apos;mytest&apos;
lhsmtool_posix[1231]: archiving &apos;/lustre/.lustre/fid/0x200003ab0:0x1:0x0&apos; to &apos;/tmp/0001/0000/3ab0/0000/0002/0000/0x200003ab0:0x1:0x0_tmp&apos;
lhsmtool_posix[1231]: saving stripe info of &apos;/lustre/.lustre/fid/0x200003ab0:0x1:0x0&apos; in /tmp/0001/0000/3ab0/0000/0002/0000/0x200003ab0:0x1:0x0_tmp.lov
lhsmtool_posix[1231]: going to copy data from &apos;/lustre/.lustre/fid/0x200003ab0:0x1:0x0&apos; to &apos;/tmp/0001/0000/3ab0/0000/0002/0000/0x200003ab0:0x1:0x0_tmp&apos;
lhsmtool_posix[1231]: data archiving for &apos;/lustre/.lustre/fid/0x200003ab0:0x1:0x0&apos; to &apos;/tmp/0001/0000/3ab0/0000/0002/0000/0x200003ab0:0x1:0x0_tmp&apos; done
lhsmtool_posix[1231]: attr file for &apos;/lustre/.lustre/fid/0x200003ab0:0x1:0x0&apos; saved to archive &apos;/tmp/0001/0000/3ab0/0000/0002/0000/0x200003ab0:0x1:0x0_tmp&apos;
lhsmtool_posix[1231]: fsetxattr of &apos;trusted.hsm&apos; on &apos;/tmp/0001/0000/3ab0/0000/0002/0000/0x200003ab0:0x1:0x0_tmp&apos; rc=0 (Success)
lhsmtool_posix[1231]: fsetxattr of &apos;trusted.link&apos; on &apos;/tmp/0001/0000/3ab0/0000/0002/0000/0x200003ab0:0x1:0x0_tmp&apos; rc=0 (Success)
lhsmtool_posix[1231]: fsetxattr of &apos;trusted.lov&apos; on &apos;/tmp/0001/0000/3ab0/0000/0002/0000/0x200003ab0:0x1:0x0_tmp&apos; rc=0 (Success)
lhsmtool_posix[1231]: fsetxattr of &apos;trusted.lma&apos; on &apos;/tmp/0001/0000/3ab0/0000/0002/0000/0x200003ab0:0x1:0x0_tmp&apos; rc=0 (Success)
lhsmtool_posix[1231]: fsetxattr of &apos;lustre.lov&apos; on &apos;/tmp/0001/0000/3ab0/0000/0002/0000/0x200003ab0:0x1:0x0_tmp&apos; rc=-1 (Operation not supported)
lhsmtool_posix[1231]: xattr file for &apos;/lustre/.lustre/fid/0x200003ab0:0x1:0x0&apos; saved to archive &apos;/tmp/0001/0000/3ab0/0000/0002/0000/0x200003ab0:0x1:0x0_tmp&apos;
lhsmtool_posix[1231]: symlink &apos;/tmp/shadow/mytest&apos; to &apos;../0001/0000/3ab0/0000/0002/0000/0x200003ab0:0x1:0x0&apos; done
lhsmtool_posix[1231]: Action completed, notifying coordinator cookie=0x5419481f, FID=[0x200003ab0:0x1:0x0], hp_flags=0 err=0
lhsmtool_posix[1231]: llapi_hsm_action_end() on &apos;/lustre/.lustre/fid/0x200003ab0:0x1:0x0&apos; ok (rc=0)
lhsmtool_posix[1230]: copytool fs=lsttest archive#=3 item_count=1
lhsmtool_posix[1230]: waiting for message from kernel
lhsmtool_posix[1232]: &apos;[0x200003ab0:0x1:0x0]&apos; action RESTORE reclen 72, cookie=0x54194820
lhsmtool_posix[1232]: processing file &apos;mytest&apos;
lhsmtool_posix[1232]: reading stripe rules from &apos;/tmp/0001/0000/3ab0/0000/0002/0000/0x200003ab0:0x1:0x0.lov&apos; for &apos;/tmp/0001/0000/3ab0/0000/0002/0000/0x200003ab0:0x1:0x0&apos;
lhsmtool_posix[1232]: restoring data from &apos;/tmp/0001/0000/3ab0/0000/0002/0000/0x200003ab0:0x1:0x0&apos; to &apos;{VOLATILE}=[0x200003ab1:0x1:0x0]&apos;
lhsmtool_posix[1232]: going to copy data from &apos;/tmp/0001/0000/3ab0/0000/0002/0000/0x200003ab0:0x1:0x0&apos; to &apos;{VOLATILE}=[0x200003ab1:0x1:0x0]&apos;
lhsmtool_posix[1232]: data restore from &apos;/tmp/0001/0000/3ab0/0000/0002/0000/0x200003ab0:0x1:0x0&apos; to &apos;{VOLATILE}=[0x200003ab1:0x1:0x0]&apos; done
lhsmtool_posix[1232]: Action completed, notifying coordinator cookie=0x54194820, FID=[0x200003ab0:0x1:0x0], hp_flags=0 err=0
lhsmtool_posix[1232]: llapi_hsm_action_end() on &apos;/lustre/.lustre/fid/0x200003ab0:0x1:0x0&apos; ok (rc=0)
lhsmtool_posix[1230]: copytool fs=lsttest archive#=3 item_count=1
lhsmtool_posix[1230]: waiting for message from kernel
lhsmtool_posix[1233]: &apos;[0x200003ab0:0x1:0x0]&apos; action ARCHIVE reclen 72, cookie=0x54194821
lhsmtool_posix[1233]: processing file &apos;mytest&apos;
lhsmtool_posix[1233]: archiving &apos;/lustre/.lustre/fid/0x200003ab0:0x1:0x0&apos; to &apos;/tmp/0001/0000/3ab0/0000/0002/0000/0x200003ab0:0x1:0x0_tmp&apos;
lhsmtool_posix[1233]: saving stripe info of &apos;/lustre/.lustre/fid/0x200003ab0:0x1:0x0&apos; in /tmp/0001/0000/3ab0/0000/0002/0000/0x200003ab0:0x1:0x0_tmp.lov
lhsmtool_posix[1233]: going to copy data from &apos;/lustre/.lustre/fid/0x200003ab0:0x1:0x0&apos; to &apos;/tmp/0001/0000/3ab0/0000/0002/0000/0x200003ab0:0x1:0x0_tmp&apos;
lhsmtool_posix[1233]: progress ioctl for copy &apos;/lustre/.lustre/fid/0x200003ab0:0x1:0x0&apos;-&amp;gt;&apos;/tmp/0001/0000/3ab0/0000/0002/0000/0x200003ab0:0x1:0x0_tmp&apos; failed: No such file or directory (2)
lhsmtool_posix[1233]: data copy failed from &apos;/lustre/.lustre/fid/0x200003ab0:0x1:0x0&apos; to &apos;/tmp/0001/0000/3ab0/0000/0002/0000/0x200003ab0:0x1:0x0_tmp&apos;: No such file or directory (2)
lhsmtool_posix[1233]: Action completed, notifying coordinator cookie=0x54194821, FID=[0x200003ab0:0x1:0x0], hp_flags=0 err=2
lhsmtool_posix[1233]: llapi_hsm_action_end() on &apos;/lustre/.lustre/fid/0x200003ab0:0x1:0x0&apos; failed: No such file or directory (2)
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Immediately after the test, mdt/*/hsm/actions shows the following:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;lrh=[type=10680000 len=136 idx=11/3] fid=[0x200003ab0:0x1:0x0] dfid=[0x200003ab0:0x1:0x0] compound/cookie=0x5419481f/0x5419481f action=ARCHIVE archive#=3 flags=0x0 extent=0x0-0xffffffffffffffff gid=0x0 datalen=0 status=SUCCEED data=[]
lrh=[type=10680000 len=136 idx=11/6] fid=[0x200003ab0:0x1:0x0] dfid=[0x200003ab0:0x1:0x0] compound/cookie=0x54194820/0x54194820 action=RESTORE archive#=3 flags=0x0 extent=0x0-0xffffffffffffffff gid=0x0 datalen=0 status=SUCCEED data=[]
lrh=[type=10680000 len=136 idx=11/9] fid=[0x200003ab0:0x1:0x0] dfid=[0x200003ab0:0x1:0x0] compound/cookie=0x54194821/0x54194821 action=ARCHIVE archive#=3 flags=0x0 extent=0x0-0xffffffffffffffff gid=0x0 datalen=0 status=FAILED data=[]
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;A second, later retry of the archive request works fine, so there are ways to work around this issue, but it still would be nice if archival of a changed file would work on the first try.&lt;/p&gt;</comment>
                    </comments>
                    <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzw3u7:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>10716</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>