<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Fri Feb 09 23:54:17 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LMR-3] lhsm archive of more than 15k small files ends up in errors</title>
                <link>https://jira.whamcloud.com/browse/LMR-3</link>
                <project id="11910" key="LMR">Lemur</project>
                    <description>&lt;p&gt;1) I have created 30k , 10kb sized files using dd. I have lhsmd posix plugin running in the debug mode where I monitor the progress of archival job.&lt;/p&gt;

&lt;p&gt;2) When I issue the command &quot;lhsm archive *.bin&quot; in the directory where the 30k files are located, I see ALERTS on the debug logs that some handlers were unable to find the files although they exist. However, archival of other files by other handlers still proceeds.&lt;/p&gt;

&lt;p&gt;3) At the end of the archival when I check the MDT I see that not all 30k files were archived.&lt;br/&gt;
lctl get_param -n mdt.*.hsm.agents&lt;br/&gt;
uuid=f9ee32b4-d8fa-821d-e19c-9b0700d1e276 archive_id=ANY requests=&lt;span class=&quot;error&quot;&gt;&amp;#91;current:0 ok:4241 *errors:25759*&amp;#93;&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;4) ) However, 15k files of same size were all successfully archived , starting 30k upto 1M archival ends up in errors and our test case calls for successful archival of 1M files.&lt;/p&gt;

&lt;p&gt;lctl get_param -n mdt.*.hsm.agents --&amp;gt; sucessful 15k&lt;br/&gt;
uuid=7fb34125-b8fd-bbc4-1632-007ceaa3df78 archive_id=ANY requests=&lt;span class=&quot;error&quot;&gt;&amp;#91;current:0 ok:15000 errors:0&amp;#93;&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;5) Attachments&lt;br/&gt;
 a) agent conf file&lt;br/&gt;
 b) lhsm posix conf file&lt;br/&gt;
 c) Alerts seen on the lemur archival logs&lt;br/&gt;
 d) Lemur rpms installed&lt;/p&gt;


&lt;p&gt;Please let us know if you would need any more information.&lt;/p&gt;

&lt;p&gt;Thanks&lt;/p&gt;</description>
                <environment>IEEL3.0 lustre client : lustre: 2.7.16.10&lt;br/&gt;
CentOS7.2  3.10.0-327.36.2.el7.x86_64&lt;br/&gt;
Lemur&lt;br/&gt;
Interconnect Intel Omnipath</environment>
        <key id="44002">LMR-3</key>
            <summary>lhsm archive of more than 15k small files ends up in errors</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="4" iconUrl="https://jira.whamcloud.com/images/icons/priorities/minor.svg">Minor</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="4">Incomplete</resolution>
                                        <assignee username="mjmac">Michael MacDonald</assignee>
                                    <reporter username="jyothi">Mangala Jyothi Bhaskar</reporter>
                        <labels>
                    </labels>
                <created>Tue, 21 Feb 2017 16:29:48 +0000</created>
                <updated>Thu, 8 Feb 2024 01:40:40 +0000</updated>
                            <resolved>Thu, 8 Feb 2024 01:40:40 +0000</resolved>
                                                                        <due></due>
                            <votes>0</votes>
                                    <watches>8</watches>
                                                                            <comments>
                            <comment id="185666" author="mjmac" created="Tue, 21 Feb 2017 17:11:23 +0000"  >&lt;p&gt;Hi, I&apos;ll take a look at this. At first glance, it seems that you&apos;ve given us everything we need to try to reproduce this, so thanks for that. I&apos;ll post updates or questions as warranted.&lt;/p&gt;</comment>
                            <comment id="185667" author="mjmac" created="Tue, 21 Feb 2017 17:13:10 +0000"  >&lt;p&gt;Oh, one question: Am I correct in assuming that you are also running IEEL 3.0 on your MDS?&lt;/p&gt;</comment>
                            <comment id="185668" author="jyothi" created="Tue, 21 Feb 2017 17:14:45 +0000"  >&lt;p&gt;Great! Thank you. Looking forward to it and yes IEEL3.0. To be more precise, IEEL3.0 for CentOS7.2 kernel &quot;3.10.0-327.el7_lustre.g993c615.x86_64&quot; &#160;and lustre &quot;2.7.15.3-3.10.0_327.el7&quot;&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;Client has a slightly different kernel compared to the lustre servers. Looks like lustre versions are also different.&#160;&lt;/p&gt;

&lt;p&gt;Client ( like mentioned in description) has lustre&#160;2.7.16.10&#160;&lt;/p&gt;

&lt;p&gt;Server has lustre 2.7.15.3&lt;/p&gt;</comment>
                            <comment id="186004" author="mjmac" created="Thu, 23 Feb 2017 18:04:37 +0000"  >&lt;p&gt;Hi Jyothi.&lt;/p&gt;

&lt;p&gt;Sorry, it took me a bit to get to this. Rather than trying to mirror your environment exactly, I set up a Lustre 2.9.0 filesystem. With this configuration, I was unable to reproduce this problem. This leads me to suspect a problem in the version of lustre shipped in IEEL3.0 rather than in Lemur.&lt;/p&gt;

&lt;p&gt;I will have to defer to Lustre support on this, and I will get them looped in.&lt;/p&gt;

&lt;p&gt;For reference, here is what I did to test this out:&lt;/p&gt;

&lt;p&gt;Installed Lustre 2.9.0 from &lt;a href=&quot;https://build.hpdd.intel.com/job/lustre-b2_9/2/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://build.hpdd.intel.com/job/lustre-b2_9/2/&lt;/a&gt; on 4 nodes (MDS, MGS, OSS, client) as usual.&lt;/p&gt;

&lt;p&gt;On my client, I installed Lemur RPMs from &lt;a href=&quot;http://lemur-release.s3-website-us-east-1.amazonaws.com/release/0.5.2/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://lemur-release.s3-website-us-east-1.amazonaws.com/release/0.5.2/&lt;/a&gt; and configured it with settings similar to those attached to this ticket. I used a 2GB tmpfs as my archive root. I started the agent in debug mode with the following command:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;lhsmd -debug 2&amp;gt;&amp;amp;1 | tee /tmp/lhsmd.log

&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Next, I created 30k files using the following command:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;for i in $(seq 1 30000); do dd if=/dev/urandom of=$i.bin bs=10k count=1; done 

&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Next, I archived the files using:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt; lhsm archive *.bin

&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Finally, I verified the files&apos; state using:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;lhsm status *.bin | grep archived | wc -l

&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;... and the output was 30000 as expected.&lt;/p&gt;

&lt;p&gt;I did not observe any errors or other unusual log output in the agent log.&lt;/p&gt;</comment>
                            <comment id="186007" author="mjmac" created="Thu, 23 Feb 2017 18:14:22 +0000"  >&lt;p&gt;One other thing does occur to me actually. I notice that you appear to be using Lemur RPMs from a non-release build. The version is 0.5.1_2_g885da1d, which corresponds to &lt;a href=&quot;https://github.com/intel-hpdd/lemur/commit/885da1d4f93e4e8f09181812d0a211a3cd544a63&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://github.com/intel-hpdd/lemur/commit/885da1d4f93e4e8f09181812d0a211a3cd544a63&lt;/a&gt;. Did you build this locally or pull it from the devel section of our release site? While it shouldn&apos;t be a problem to build locally, I suggest that you try the 0.5.2 RPMs I used in my test. There are no lemur code changes between the version you&apos;re using and the 0.5.2 release (just some packaging housekeeping), but I suppose it&apos;s possible that there could be some difference in build environments which contribute to the problem.&lt;/p&gt;</comment>
                            <comment id="186145" author="jyothi" created="Fri, 24 Feb 2017 20:24:27 +0000"  >&lt;p&gt;Sorry for the delay in getting back. &#160;Did you say you tested it on Lustre 2.9.0? &#160;We use IEEL and even with the latest IEEL release which is IEEL 3.1.0.2 we have lustre rpms which have &lt;b&gt;2.7.19.8&lt;/b&gt; for example you will see this with a lustre rpm which was built for us &quot;lustre-osd-ldiskfs-&lt;b&gt;2.7.19.8&lt;/b&gt;-3.10.0_514.el7_lustre.g0afcb1e.x86_64_g0afcb1e.x86_64.rpm&quot; . &#160;What release of IEEL would have 2.9.0?&#160;&lt;/p&gt;

&lt;p&gt;Yes I built the rpms locally. It was about a month ago that I checked out Lemur code from git master branch and made my own rpms since we have a specific kernel and lustre client to adhere to. On the client node where we run Lemur , we have a requirement to stick to kernel &quot;3.10.0-327.36.2.el7&quot; &#160;and lustre client &quot;lustre: 2.7.16.10&quot;. &#160;We can however, upgrade the lustre servers to IEEL3.1.0.2 which would be lustre 2.7.19.8.&#160;&lt;/p&gt;

&lt;p&gt;So if I directly install the 0.5.2 RPMs from where you pointed, would I be able to run it on the above kernel and lustre client versions? If yes, that would be great and I can check out the new lemur rpms since I also had issues connecting to non AWS S3 ( on a separate note ) &#160;and the link suggests there could be a fix for S3 region covered in those rpms ? ).&#160;&lt;/p&gt;</comment>
                            <comment id="186155" author="mjmac" created="Fri, 24 Feb 2017 21:11:34 +0000"  >&lt;p&gt;Hi. Yes, I tested it against 2.9.0. I wanted to see if the problem you reported still manifests in the most recent release of Lustre, which happens to be that community release. I do not know when IEEL will be rebased on 2.9.x.&lt;/p&gt;

&lt;p&gt;The Lemur RPMs we build should work with any version of Lustre released since 2.6.0. They are not tied to any particular kernel or Lustre release. We haven&apos;t tested specifically against your version of Lustre, but I&apos;m not aware of any reason that it wouldn&apos;t work just fine.&lt;/p&gt;

&lt;p&gt;If the problem repeats with the 0.5.2 RPMs we&apos;ve provided, then i think we&apos;ll have to dig into the Lustre side of things and get Lustre support involved. Out of curiosity, have you previously worked with the in-tree Lustre copytool (lhsmtool_posix)? I ask because it would be helpful to understand if Lemur is replacing an existing solution or if this is completely new.&lt;/p&gt;</comment>
                            <comment id="186317" author="jyothi" created="Mon, 27 Feb 2017 18:24:14 +0000"  >&lt;p&gt;At the first look at the logs and behavior do you mostly suspect lustre or lemur plugin?&#160;&lt;/p&gt;

&lt;p&gt;Good to know Lemur rpms are not tied to any kernel or lustre version. I will get the rpms from the link you pointed and test it against our current lustre or one version later than that which would be 2.7.19.8. &#160;If the problem persists may be we can think of taking it up with lustre support.&#160;&lt;/p&gt;

&lt;p&gt;As far as I know, we have not worked extensivley with the in-tree copy tool either. &#160;We might have tested some HSM features years ago. So I would say this is completely new and we haven&apos;t released any copy tools as a part of our solutions yet.&#160;&lt;/p&gt;</comment>
                            <comment id="186351" author="mjmac" created="Mon, 27 Feb 2017 21:39:10 +0000"  >&lt;p&gt;Hi.&lt;/p&gt;

&lt;p&gt;Well, as the co-developer of Lemur, my inclination is to say it must be the other software&apos;s fault. &lt;img class=&quot;emoticon&quot; src=&quot;https://jira.whamcloud.com/images/icons/emoticons/wink.png&quot; height=&quot;16&quot; width=&quot;16&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/p&gt;

&lt;p&gt;In all seriousness, though, I don&apos;t see anything in the logs you&apos;ve posted which indicate a problem in Lemur. This error seems to occur a lot:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;ALERT 2017/02/03 21:12:41 /root/rpmbuild/BUILD/lemur-0.5.1_2_g885da1d/src/github.com/intel-hpdd/lemur/cmd/lhsmd/agent/agent.go:161: handler-19: begin failed: no such file or directory: AI: 58956d9a ARCHIVE [0x200001ca2:0x14bc5:0x0] 0,EOF []
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Looking through our code, it appears that this error is coming from within liblustreapi rather than the lemur code. It may be helpful to look at the MDS logs as well to see if there are any relevant error messages appearing there. You&apos;re only using 1 MDS, correct? If you are using &amp;gt; 1 MDS, then it&apos;s possible that there is a problem with Lemur&apos;s support for that, but I know that was tested in the past.&lt;/p&gt;

&lt;p&gt;So, that was a longish way of saying that I suspect the version of Lustre in IEEL more than Lemur at this point.&lt;/p&gt;</comment>
                            <comment id="186363" author="jyothi" created="Mon, 27 Feb 2017 22:08:31 +0000"  >&lt;p&gt;I see. As of now I do not have MDS logs. I will have to reproduce this to get more relevant MDS logs since this test was about 2-3 weeks ago. We are in the middle of a big benchmark and I still dont have my hands on the resources to test this again.&#160;&lt;/p&gt;

&lt;p&gt;I am thinking, once I have access to the resources again, I will first use the Lemur rpms you used and then reproduce the issue and send some MDS logs your way.&#160;&lt;/p&gt;

&lt;p&gt;No this is not a DNE set up. There is one MDT and one primary MDS at a time. However we have a pair of MDS ( configured in High Availability), still, we would have &lt;b&gt;one&lt;/b&gt; MDS server primarily managing the MDT at any given point in time. Not sure if this is what you were talking about.&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;</comment>
                            <comment id="186365" author="mjmac" created="Mon, 27 Feb 2017 22:21:03 +0000"  >&lt;p&gt;Yes, I was asking if this was a DNE setup. As I was reading through the code surrounding the error message I referenced, I was looking for potential sources of the ENOENT error.&lt;/p&gt;

&lt;p&gt;As you are not running with DNE, and this same code works fine with Lustre 2.9.0, I am again lead to suspect that the problem is with the version of Lustre in IEEL. I will see if I can get some resources together to reproduce it on our side, but I think it will be faster in your environment.&lt;/p&gt;</comment>
                            <comment id="189480" author="jyothi" created="Thu, 23 Mar 2017 18:40:30 +0000"  >&lt;p&gt;I re-ran the 30K test with lemur0.6 rpms and I still see the issue.&#160;&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;lctl get_param -n mdt.*.hsm.agents&lt;br/&gt;
 uuid=7035a1ec-a1bf-36e9-83f5-9847469a03ea archive_id=ANY requests=&lt;span class=&quot;error&quot;&gt;&amp;#91;current:0 ok:5760 errors:24240&amp;#93;&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;It ended up not archiving about 24K files. The best case I have seen so far is 15k files and all 15k are archived with 0 errors. Like you said it could be lustre version. Even the latest IEEL build we have has lustre version 2.7.19.8. &#160;Is there a way we can find out if this has been a known issue with HSM or if any fixes have gone in since 2.7 since you said you didnt see this in 2.9, or could it be something specific to IEEL distribution? Could some kind of logs help? I have attached &quot;dmesg -T&quot; from the MDS server when this archival job was running. &#160;&quot;mdslog.txt&quot; hope this helps.&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;On lemur side I see the same kind of error that mentioned before &quot;ALERT 2017/03/23 18:26:03 /tmp/rpmbuild/BUILD/lemur-0.6.0/src/github.com/intel-hpdd/lemur/cmd/lhsmd/agent/agent.go:161: handler-34: begin failed: no such file or directory: AI: 58ccb253 ARCHIVE &lt;span class=&quot;error&quot;&gt;&amp;#91;0x2000088d1:0x15a3:0x0&amp;#93;&lt;/span&gt; 0,EOF []&lt;br/&gt;
DEBUG 13:26:03.248862 agent.go:152: handler-34: incoming: AI: 58ccb254 ARCHIVE &lt;span class=&quot;error&quot;&gt;&amp;#91;0x2000088d1:0x15a4:0x0&amp;#93;&lt;/span&gt; 0,EOF []&quot; &#160;I remember you mentioned this is a common error.&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;At the most as of now I could try the latest IEEL3.1 servers and 2.7.19.8 lustre version. If I still see the same issue I might have to find exclusive hardware and set up a different lustre.&#160;&lt;/p&gt;</comment>
                            <comment id="189487" author="mjmac" created="Thu, 23 Mar 2017 19:09:13 +0000"  >&lt;p&gt;Hmm, that&apos;s strange. I&apos;ll check with the Lustre engineering folks to see if they have any insights. The error messages in your mds log sure look like a smoking gun, to me...&lt;/p&gt;</comment>
                            <comment id="189757" author="jyothi" created="Mon, 27 Mar 2017 16:03:12 +0000"  >&lt;p&gt;Michael, did you get a chance to discuss this with Lustre engineering?&#160;&lt;/p&gt;</comment>
                    </comments>
                    <attachments>
                            <attachment id="25525" name="30k_test2_fail.rtf" size="1978794" author="jyothi" created="Tue, 21 Feb 2017 16:30:26 +0000"/>
                            <attachment id="25524" name="agent.txt" size="839" author="jyothi" created="Tue, 21 Feb 2017 16:30:20 +0000"/>
                            <attachment id="25527" name="lemur_rpms.txt" size="177" author="jyothi" created="Tue, 21 Feb 2017 16:30:36 +0000"/>
                            <attachment id="25526" name="lhsmd posix conf.txt" size="604" author="jyothi" created="Tue, 21 Feb 2017 16:30:33 +0000"/>
                            <attachment id="25977" name="mdslog.txt" size="5378" author="jyothi" created="Thu, 23 Mar 2017 18:42:19 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzz4i7:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>