<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:38:47 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-4002] HSM restore vs unlink deadlock </title>
                <link>https://jira.whamcloud.com/browse/LU-4002</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;In 23c197908902183d5f88d3f431da6cde9c290e07 &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3811&quot; title=&quot;non-root users cannot archive files, root cannot archive non-root users&amp;#39; files&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3811&quot;&gt;&lt;del&gt;LU-3811&lt;/del&gt;&lt;/a&gt; hsm: handle file ownership and timestamps, I added a stat() of the file being restored to the CT&apos;s restore path. This is to ensure that the volatile file is given the correct ownership and timestamps before the restore, and is required for the layout swap to succeed. However this introduces a potential for deadlock vs unlink() and other operations. Consider the following sequence of operations on a single file:&lt;/p&gt;

&lt;ol&gt;
	&lt;li&gt;Client sends restore, CDT takes and holds EX LAYOUT lock.&lt;/li&gt;
	&lt;li&gt;Client sends unlink, handler sleeps on EX FULL lock.&lt;/li&gt;
	&lt;li&gt;CDT sends restore action to CT.&lt;/li&gt;
	&lt;li&gt;CT begins restore, sends getattr (from stat()), handler sleeps on PR LOOKUP,UPDATE,PERM lock.&lt;/li&gt;
&lt;/ol&gt;


&lt;p&gt;We have a similar deadlock with rename-onto.&lt;/p&gt;

&lt;p&gt;I think the simplest way out of this mess would be to lock fewer bits in the unlink handler. Can anyone say why unlink should invalidate cached layout? An open unlinked file is still valid for IO.&lt;/p&gt;</description>
                <environment></environment>
        <key id="21107">LU-4002</key>
            <summary>HSM restore vs unlink deadlock </summary>
                <type id="7" iconUrl="https://jira.whamcloud.com/images/icons/issuetypes/task_agile.png">Technical task</type>
                            <parent id="20020">LU-3647</parent>
                                    <priority id="1" iconUrl="https://jira.whamcloud.com/images/icons/priorities/blocker.svg">Blocker</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="jhammond">John Hammond</assignee>
                                    <reporter username="jhammond">John Hammond</reporter>
                        <labels>
                            <label>HSM</label>
                    </labels>
                <created>Tue, 24 Sep 2013 18:41:24 +0000</created>
                <updated>Tue, 25 Jan 2022 20:56:30 +0000</updated>
                            <resolved>Wed, 9 Oct 2013 08:22:39 +0000</resolved>
                                    <version>Lustre 2.5.0</version>
                                    <fixVersion>Lustre 2.5.0</fixVersion>
                                        <due></due>
                            <votes>0</votes>
                                    <watches>6</watches>
                                                                            <comments>
                            <comment id="67557" author="green" created="Wed, 25 Sep 2013 16:14:24 +0000"  >&lt;p&gt;I wouldagree that unlink does not need to invalidate the layout&lt;/p&gt;</comment>
                            <comment id="67559" author="jay" created="Wed, 25 Sep 2013 16:32:29 +0000"  >&lt;p&gt;Is it possible for unlink to grab LOOKUP only? The only side effect I can think of now is that there are some caching locks on the client side won&apos;t be revoked. But this can be easily fixed.&lt;/p&gt;</comment>
                            <comment id="67609" author="jhammond" created="Wed, 25 Sep 2013 20:05:27 +0000"  >&lt;p&gt;In general it won&apos;t be enough just to remove LAYOUT from unlink.Assume restore takes EX LAYOUT, some other operation tries to take PR LAYOUT | LOOKUP | ... and waits. Then unlink tries to take EX LOOKUP | .... Then stat (from the CT) tries to take LOOKUP | UPDATE | ... and deadlocks.&lt;/p&gt;

&lt;p&gt;As I keep saying, any operation that requires two lock is dangerous.&lt;/p&gt;

&lt;p&gt;Note that unlink (and rename onto) should take more than just LOOKUP, since it modifies link count and timestamps.&lt;/p&gt;

&lt;p&gt;Could we do something sane seeming like having the MDT send the attributes that the CT is getting from stat?&lt;/p&gt;</comment>
                            <comment id="67905" author="jhammond" created="Sat, 28 Sep 2013 00:05:30 +0000"  >&lt;p&gt;Please see &lt;a href=&quot;http://review.whamcloud.com/7792&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/7792&lt;/a&gt;.&lt;/p&gt;</comment>
                            <comment id="68056" author="jhammond" created="Tue, 1 Oct 2013 15:26:14 +0000"  >&lt;p&gt;I suggest that this be considered a blocker for 2.5.0. It is easy to imagine situations where users will trigger this deadlock. Faced with a long running restore on a file (which to the user may just seem like an unresponsive console or FS) the user may logout, login, and unlink the released file (since perhaps it is easy to regenerate anyway).&lt;/p&gt;</comment>
                            <comment id="68227" author="jlevi" created="Thu, 3 Oct 2013 12:52:10 +0000"  >&lt;p&gt;Patch landed to Master. If more work is needed in this ticket, please let me know and I will reopen this ticket.&lt;/p&gt;</comment>
                            <comment id="68655" author="adilger" created="Wed, 9 Oct 2013 08:03:26 +0000"  >&lt;p&gt;Per recent comments in &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4053&quot; title=&quot;client leaking objects/locks during IO&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4053&quot;&gt;&lt;del&gt;LU-4053&lt;/del&gt;&lt;/a&gt;, this is causing a lot of layout locks to be left on the client after an inode is unlinked.  Ideally, the layout lock would be revoked if this is the last reference to the inode (last link and file is not opened).  The main question is whether it is possible to know this in advance?  Unlike the open-unlinked handling, at worst this will revoke an extra lock if there is a race and another thread opens the file just before it is unlinked, so I think it is better to handle the common case more efficiently.&lt;/p&gt;

&lt;p&gt;Is there any way to know in advance if HSM is processing this file and not try to revoke the layout lock in this case?&lt;/p&gt;</comment>
                            <comment id="68656" author="adilger" created="Wed, 9 Oct 2013 08:22:39 +0000"  >&lt;p&gt;Might have jumped the gun on this. Closing it again until we know it is the culprit. &lt;/p&gt;</comment>
                            <comment id="68677" author="jay" created="Wed, 9 Oct 2013 17:17:06 +0000"  >&lt;p&gt;Hi Andreas, this is a known issue but we still decided to land the patch because the deadlock issue is more severe.&lt;/p&gt;

&lt;p&gt;The problem is that when the LOOKUP lock is revoked, we don&apos;t know if this is because the file is being unlinked or renamed. However, leaving some locks in cache may not be a problem because if the system is active, the locks will be discarded by LRU soon or later.&lt;/p&gt;</comment>
                            <comment id="68682" author="jhammond" created="Wed, 9 Oct 2013 17:26:28 +0000"  >&lt;p&gt;I believe that we could revert this patch if the HSM coordinator would send the ownership and timestamps to use on the volatile file along with the restore action. Then the copytool will not need to stat() the original file. The restoring layout swap will check that the ownerships agree protecting us from a TOCTTOU issue.&lt;/p&gt;</comment>
                            <comment id="68835" author="jhammond" created="Fri, 11 Oct 2013 15:47:18 +0000"  >&lt;p&gt;Please see &lt;a href=&quot;http://review.whamcloud.com/7927&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/7927&lt;/a&gt; for a sketch of this approach.&lt;/p&gt;</comment>
                            <comment id="323898" author="jhammond" created="Tue, 25 Jan 2022 20:56:30 +0000"  >&lt;p&gt;Note to self. After &lt;a href=&quot;http://review.whamcloud.com/13750&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/13750&lt;/a&gt; (&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-4727&quot; title=&quot;Lhsmtool_posix process stuck in ll_layout_refresh() when restoring&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-4727&quot;&gt;&lt;del&gt;LU-4727&lt;/del&gt;&lt;/a&gt; hsm: use IOC_MDC_GETFILEINFO in restore) the copytool uses IOC_MDC_GETFILEINFO/MDS_GETATTR_NAME to get the attributes of the released file which bypasses LLDM and does not have the issue described above.&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                        <issuelink>
            <issuekey id="21245">LU-4053</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="23513">LU-4727</issuekey>
        </issuelink>
                            </outwardlinks>
                                                        </issuelinktype>
                    </issuelinks>
                <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzw3tr:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>10714</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                                                                                </customfields>
    </item>
</channel>
</rss>