<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 02:33:50 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-10302] hsm: obscure bug with multi-mountpoints and ldlm</title>
                <link>https://jira.whamcloud.com/browse/LU-10302</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;I do not have much to share except the attached reproducer.&lt;/p&gt;

&lt;p&gt;The key elements of the reproducer &lt;b&gt;seem&lt;/b&gt; to be:&lt;/p&gt;
&lt;ol&gt;
	&lt;li&gt;setup lustre with two mountpoints;&lt;/li&gt;
	&lt;li&gt;create a file;&lt;/li&gt;
	&lt;li&gt;launch a copytool &lt;b&gt;on mountpoint A&lt;/b&gt;;&lt;/li&gt;
	&lt;li&gt;suspend the copytool;&lt;/li&gt;
	&lt;li&gt;archive the file created at step 1 &lt;b&gt;from mountpoint A&lt;/b&gt;*;&lt;/li&gt;
	&lt;li&gt;delete the file &lt;b&gt;on mountpoint B&lt;/b&gt;;&lt;/li&gt;
	&lt;li&gt;&lt;tt&gt;sync&lt;/tt&gt;;&lt;/li&gt;
	&lt;li&gt;un-suspend the copytool (the output of the copytool should indicate that &lt;tt&gt;llapi_hsm_action_begin()&lt;/tt&gt; failed with EIO, not ENOENT)&lt;/li&gt;
	&lt;li&gt;umount =&amp;gt; the process hangs in an unkillable state.&lt;/li&gt;
&lt;/ol&gt;


&lt;p&gt;*&lt;em&gt;You can use mountpoint B at step 5, but only if you created the file from mountpoint A.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;I added some debug in the reproducer that should be logged in /tmp.&lt;/p&gt;

&lt;p&gt;I suspect those two lines in the &lt;tt&gt;dmesg&lt;/tt&gt; are related to this issue (they are logged at umount time):&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[  143.575078] LustreError: 3703:0:(ldlm_resource.c:1094:ldlm_resource_complain()) filter-lustre-OST0000_UUID: namespace resource [0x2:0x0:0x0].0x0 (ffff8806ab7b6900) refcount nonzero (1) after lock cleanup; forcing cleanup.
[  143.578233] LustreError: 3703:0:(ldlm_resource.c:1676:ldlm_resource_dump()) --- Resource: [0x2:0x0:0x0].0x0 (ffff8806ab7b6900) refcount = 2
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;&lt;em&gt;Note: the title should probably be updated once we figure what the issue exactly is&lt;/em&gt;&lt;/p&gt;</description>
                <environment></environment>
        <key id="49492">LU-10302</key>
            <summary>hsm: obscure bug with multi-mountpoints and ldlm</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="4" iconUrl="https://jira.whamcloud.com/images/icons/priorities/minor.svg">Minor</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="jhammond">John Hammond</assignee>
                                    <reporter username="cealustre">CEA</reporter>
                        <labels>
                    </labels>
                <created>Thu, 30 Nov 2017 15:29:39 +0000</created>
                <updated>Wed, 5 Aug 2020 13:50:22 +0000</updated>
                            <resolved>Fri, 22 Dec 2017 12:50:59 +0000</resolved>
                                                    <fixVersion>Lustre 2.11.0</fixVersion>
                                        <due></due>
                            <votes>0</votes>
                                    <watches>7</watches>
                                                                            <comments>
                            <comment id="215049" author="adilger" created="Thu, 30 Nov 2017 18:49:18 +0000"  >&lt;p&gt;Quentin, it isn&#8217;t clear from your bug report what the actual problem is that you are hitting? &#160;Does the client unmount fail, or are the error messages unexpected but not a problem otherwise?  Is this problem hit in normal usage?&lt;/p&gt;

&lt;p&gt;&#160;It does look like the copytool is holding a lock reference on the OST object longer than it should be, but they should be cleaned up at mount. &lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;</comment>
                            <comment id="215116" author="bougetq" created="Fri, 1 Dec 2017 09:46:13 +0000"  >&lt;p&gt;My bad, I updated the description: the client unmount hangs.&lt;/p&gt;

&lt;p&gt;&amp;gt; Is this problem hit in normal usage?&lt;/p&gt;

&lt;p&gt;The reproducer I provided works on a single node setup but you can also reproduce on a multi-node setup (copytool on one node, client doing the &lt;tt&gt;rm&lt;/tt&gt; on another node), so this definitely impacts production setups.&lt;/p&gt;</comment>
                            <comment id="215118" author="bougetq" created="Fri, 1 Dec 2017 10:13:06 +0000"  >&lt;p&gt;Letting the hsm request timeout is not a requirement to reproduce, rather than that, syncing data/metadata is what is important.&lt;/p&gt;

&lt;p&gt;I updated the description (once again) and the reproducer accordingly.&lt;/p&gt;</comment>
                            <comment id="215144" author="pjones" created="Fri, 1 Dec 2017 18:22:09 +0000"  >&lt;p&gt;Bruno&lt;/p&gt;

&lt;p&gt;Can you look into this one?&lt;/p&gt;

&lt;p&gt;Thanks&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="215209" author="bougetq" created="Mon, 4 Dec 2017 13:26:56 +0000"  >&lt;p&gt;The condition to trigger the bug is a bit more complex than I first thought: &lt;tt&gt;lhsmtool_posix != rm &amp;amp;&amp;amp; !(create == lfs hsm_archive == rm)&lt;/tt&gt;&lt;/p&gt;

&lt;p&gt;The more verbose version: lhsmtool_posix and rm are run on different mountpoints, and the file is not created, archived and deleted from the same mountpoint.&lt;/p&gt;

&lt;p&gt;I am not sure how useful this is. I am putting it here... just in case.&lt;/p&gt;</comment>
                            <comment id="215835" author="jhammond" created="Fri, 8 Dec 2017 20:36:51 +0000"  >&lt;p&gt;You are seeing the fact that the lock and resource reference counting in LDLM is intolerant of some lvbo init errors. In particular, it &lt;tt&gt;ofd_lvbo_init()&lt;/tt&gt; fails because the object could not be found then a reference on the resource is somehow leaked.&lt;/p&gt;</comment>
                            <comment id="215836" author="jhammond" created="Fri, 8 Dec 2017 20:37:53 +0000"  >&lt;p&gt;BTW, the CT is able to hit this because it calls &lt;tt&gt;search_inode_for_lustre()&lt;/tt&gt; to get the data version so it is not seeing that the file has been deleted.&lt;/p&gt;</comment>
                            <comment id="215902" author="bougetq" created="Mon, 11 Dec 2017 08:23:48 +0000"  >&lt;p&gt;I cannot reproduce the bug anymore when I apply the patch you proposed for &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-10357&quot; title=&quot;ll_ioc_copy_{start,end}() depend on search_inode_for_lustre() which is bad&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-10357&quot;&gt;&lt;del&gt;LU-10357&lt;/del&gt;&lt;/a&gt;. Thank you!&lt;/p&gt;

&lt;p&gt;Maybe we can keep this LU to fix &lt;tt&gt;search_inode_for_lustre()&lt;/tt&gt; or &lt;tt&gt;ofd_lvbo_init()&lt;/tt&gt;... or both, depending on what makes more sense. =)&lt;/p&gt;</comment>
                            <comment id="215969" author="gerrit" created="Mon, 11 Dec 2017 19:10:53 +0000"  >&lt;p&gt;John L. Hammond (john.hammond@intel.com) uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/30477&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/30477&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-10302&quot; title=&quot;hsm: obscure bug with multi-mountpoints and ldlm&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-10302&quot;&gt;&lt;del&gt;LU-10302&lt;/del&gt;&lt;/a&gt; ldlm: destroy lock if LVB init fails&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: 0be0459c0b1409c790a214a73735673ed9907b57&lt;/p&gt;</comment>
                            <comment id="217051" author="gerrit" created="Fri, 22 Dec 2017 06:49:37 +0000"  >&lt;p&gt;Oleg Drokin (oleg.drokin@intel.com) merged in patch &lt;a href=&quot;https://review.whamcloud.com/30477/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/30477/&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-10302&quot; title=&quot;hsm: obscure bug with multi-mountpoints and ldlm&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-10302&quot;&gt;&lt;del&gt;LU-10302&lt;/del&gt;&lt;/a&gt; ldlm: destroy lock if LVB init fails&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: &lt;br/&gt;
Commit: c91cb6ee81e7751b719228efa58dc32fdea836e5&lt;/p&gt;</comment>
                            <comment id="217112" author="pjones" created="Fri, 22 Dec 2017 12:50:59 +0000"  >&lt;p&gt;Landed for 2.11&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                        <issuelink>
            <issuekey id="49661">LU-10357</issuekey>
        </issuelink>
                            </outwardlinks>
                                                                <inwardlinks description="is related to">
                                        <issuelink>
            <issuekey id="50959">LU-10723</issuekey>
        </issuelink>
                            </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                            <attachment id="28819" name="reproducer-lu-10302.sh" size="1075" author="bougetq" created="Fri, 1 Dec 2017 10:10:01 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzzohb:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>