<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 02:24:39 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-9266] Mount hung due to double HSM RESTORE records</title>
                <link>https://jira.whamcloud.com/browse/LU-9266</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;Usually when agent sends several RESTORE requests to the same fid MDT processes only the first.&lt;br/&gt;
 When the 2nd arrives MDT doesn&apos;t add new action because see the first in llog.&lt;br/&gt;
 But there is a chance that the 1st RESTORE action is not written into llog when the 2nd comes to MDT:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;&#160;
int mdt_hsm_add_actions(struct mdt_thread_info *mti, 
                        struct hsm_action_list *hal, __u64 *compound_id) 
{
...
       rc = hsm_find_compatible(mti-&amp;gt;mti_env, mdt, hal);
...

                /* test result of hsm_find_compatible()
                 * if request redundant or cancel of nothing
                 * do not record
                 */
                /* redundant case */
                if (hai-&amp;gt;hai_action != HSMA_CANCEL &amp;amp;&amp;amp; hai-&amp;gt;hai_cookie != 0)
                        continue;
...
                        /* take LAYOUT lock so that accessing the layout will
                         * be blocked until the restore is finished */
                        mdt_lock_reg_init(&amp;amp;crh-&amp;gt;crh_lh, LCK_EX);
                        rc = mdt_object_lock(mti, obj, &amp;amp;crh-&amp;gt;crh_lh,
...
                /* record request */
                rc = mdt_agent_record_add(mti-&amp;gt;mti_env, mdt, *compound_id,
                                          archive_id, flags, hai);


&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Even If MDT doesn&apos;t find compatible request in llog it tries to take LAYOUT lock. This lock is already taken by the 1st RESTORE request.&lt;br/&gt;
 Normally 2nd RESTORE request may take LAYOUT lock only AFTER the end of 1st RESOTRE action. In such case 2nd request finds that object is already RESTORED and does nothing.&lt;br/&gt;
 But ldlm_resource_clean called from umount brakes this order and 2nd request may add the same action to the llog:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;lrh=[type=10680000 len=136 idx=1/1] fid=[0x200000402:0x1:0x0] dfid=[0x200000402:0x1:0x0] compound/cookie=0x58d95f65/0x58d95f65 action=ARCHIVE archive#=2 flags=0x0 extent=0x0-0xffffffffffffffff gid=0x0 datalen=0 status=SUCCEED data=[]
lrh=[type=10680000 len=136 idx=1/2] fid=[0x200000402:0x1:0x0] dfid=[0x200000402:0x1:0x0] compound/cookie=0x58d95f66/0x58d95f66 action=RESTORE archive#=2 flags=0x0 extent=0x0-0xffffffffffffffff gid=0x0 datalen=0 status=WAITING data=[]
lrh=[type=10680000 len=136 idx=1/3] fid=[0x200000402:0x1:0x0] dfid=[0x200000402:0x1:0x0] compound/cookie=0x58d95f67/0x58d95f67 action=RESTORE archive#=2 flags=0x0 extent=0x0-0xffffffffffffffff gid=0x0 datalen=0 status=WAITING data=[]


&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Such records causes mount to hung when&lt;br/&gt;
 starting hsm:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt; D: 15524 TASK: ffff880068b5b540 CPU: 4 COMMAND: &quot;lctl&quot;
 #0 [ffff8800bacd9728] schedule at ffffffff81525d30
 #1 [ffff8800bacd97f0] ldlm_completion_ast at ffffffffa08527f5 [ptlrpc]
 #2 [ffff8800bacd9890] ldlm_cli_enqueue_local at ffffffffa0851b8e [ptlrpc]
 #3 [ffff8800bacd9910] mdt_object_lock0 at ffffffffa0e4ec4c [mdt]
 #4 [ffff8800bacd99c0] mdt_object_lock at ffffffffa0e4f694 [mdt]
 #5 [ffff8800bacd99d0] mdt_object_find_lock at ffffffffa0e4f9c1 [mdt]
 #6 [ffff8800bacd9a00] hsm_restore_cb at ffffffffa0e9b533 [mdt]
 #7 [ffff8800bacd9a50] llog_process_thread at ffffffffa05fd699 [obdclass]
 #8 [ffff8800bacd9b10] llog_process_or_fork at ffffffffa05fdbaf [obdclass]
 #9 [ffff8800bacd9b60] llog_cat_process_cb at ffffffffa0601250 [obdclass]
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</description>
                <environment></environment>
        <key id="45057">LU-9266</key>
            <summary>Mount hung due to double HSM RESTORE records</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="3" iconUrl="https://jira.whamcloud.com/images/icons/priorities/major.svg">Major</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="hongchao.zhang">Hongchao Zhang</assignee>
                                    <reporter username="scherementsev">Sergey Cheremencev</reporter>
                        <labels>
                    </labels>
                <created>Mon, 27 Mar 2017 20:57:48 +0000</created>
                <updated>Wed, 16 Aug 2017 21:49:59 +0000</updated>
                            <resolved>Wed, 9 Aug 2017 04:52:05 +0000</resolved>
                                    <version>Lustre 2.9.0</version>
                                    <fixVersion>Lustre 2.10.1</fixVersion>
                    <fixVersion>Lustre 2.11.0</fixVersion>
                                        <due></due>
                            <votes>1</votes>
                                    <watches>6</watches>
                                                                            <comments>
                            <comment id="189797" author="gerrit" created="Mon, 27 Mar 2017 21:04:51 +0000"  >&lt;p&gt;Sergey Cheremencev (sergey.cheremencev@seagate.com) uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/26215&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/26215&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-9266&quot; title=&quot;Mount hung due to double HSM RESTORE records&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-9266&quot;&gt;&lt;del&gt;LU-9266&lt;/del&gt;&lt;/a&gt; hsm: don&apos;t add request when cdt is stopped&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: 0587c27ddf9ed41cb57b6959370656696295f954&lt;/p&gt;</comment>
                            <comment id="190345" author="pjones" created="Fri, 31 Mar 2017 23:00:41 +0000"  >&lt;p&gt;Hongchao&lt;/p&gt;

&lt;p&gt;Could you please review this proposed change&lt;/p&gt;

&lt;p&gt;Thanks&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="193012" author="bfaccini" created="Fri, 21 Apr 2017 14:15:34 +0000"  >&lt;p&gt;Sergei, Hongchao,&lt;br/&gt;
Do I correctly understand that the conditions required to fall into this situation are very unlikely and racy?&lt;br/&gt;
I mean, 2 MDT request handler threads handling 2 restore requests for the same FID/file, and a concurrent MDT umount/stop, leading to 1st restore request to have granted layout-lock and recorded a 1st restore action to llog, but this lock has been canceled as part of umount process allowing the 2nd restore request to grant it and thus be able to add a 2nd restore action to llog, finally leading to a hang during next MDT mount/start when replaying all the layout-locks for all recorded restores.&lt;br/&gt;
Am I right?&lt;br/&gt;
And if yes, why don&apos;t you fix this very specific case in hsm_restore_cb() by finding/discarding (EALREADY ?) any duplicates ? We may want to add a hashing mechanism for the cdt_restore_handle structs and not require to browse crh_list,  in case we will need to handle huge number of restores.&lt;/p&gt;</comment>
                            <comment id="193686" author="sergey" created="Wed, 26 Apr 2017 22:23:52 +0000"  >&lt;p&gt;Thanks for feedback.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Do I correctly understand that the conditions required to fall into this situation are very unlikely and racy?&lt;/p&gt;&lt;/blockquote&gt;
&lt;p&gt;Yes at first look it is very unlikely. But on the other hand seagate&apos;s customer faced this problem.&lt;br/&gt;
 So I guess it is not so unlikely on the systems with high hsm activity.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;I mean, 2 MDT request handler threads handling 2 restore requests for the same FID/file, and a concurrent MDT umount/stop, leading to 1st restore request to have granted layout-lock and recorded a 1st restore action to llog, but this lock has been canceled as part of umount process allowing the 2nd restore request to grant it and thus be able to add a 2nd restore action to llog, finally leading to a hang during next MDT mount/start when replaying all the layout-locks for all recorded restores.&lt;br/&gt;
 Am I right?&lt;/p&gt;&lt;/blockquote&gt;
&lt;p&gt;Correct.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;And if yes, why don&apos;t you fix this very specific case in hsm_restore_cb() by finding/discarding (EALREADY ?) any duplicates ?&lt;/p&gt;&lt;/blockquote&gt;
&lt;p&gt;Because easier to don&apos;t add new requests when cdt is stopped then parsing llog later during the mount. Furthermore mdt_hsm_add_actions has the same condition(cdt-&amp;gt;cdt_state == CDT_STOPPED) at the beginning - so ideally we shouldn&apos;t serve any requests when coordinator is stopped.&lt;/p&gt;</comment>
                            <comment id="204863" author="gerrit" created="Wed, 9 Aug 2017 04:18:23 +0000"  >&lt;p&gt;Oleg Drokin (oleg.drokin@intel.com) merged in patch &lt;a href=&quot;https://review.whamcloud.com/26215/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/26215/&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-9266&quot; title=&quot;Mount hung due to double HSM RESTORE records&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-9266&quot;&gt;&lt;del&gt;LU-9266&lt;/del&gt;&lt;/a&gt; hsm: don&apos;t add request when cdt is stopped&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: &lt;br/&gt;
Commit: 37a5157b84bce367e31743cb8648a15618492531&lt;/p&gt;</comment>
                            <comment id="204874" author="pjones" created="Wed, 9 Aug 2017 04:52:05 +0000"  >&lt;p&gt;Landed for 2.11&lt;/p&gt;</comment>
                            <comment id="204919" author="gerrit" created="Wed, 9 Aug 2017 16:24:32 +0000"  >&lt;p&gt;Minh Diep (minh.diep@intel.com) uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/28441&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/28441&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-9266&quot; title=&quot;Mount hung due to double HSM RESTORE records&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-9266&quot;&gt;&lt;del&gt;LU-9266&lt;/del&gt;&lt;/a&gt; hsm: don&apos;t add request when cdt is stopped&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: b2_10&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: 04269245ce43e879869326c8ae5950b300fb318e&lt;/p&gt;</comment>
                            <comment id="205547" author="gerrit" created="Wed, 16 Aug 2017 20:43:12 +0000"  >&lt;p&gt;John L. Hammond (john.hammond@intel.com) merged in patch &lt;a href=&quot;https://review.whamcloud.com/28441/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/28441/&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-9266&quot; title=&quot;Mount hung due to double HSM RESTORE records&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-9266&quot;&gt;&lt;del&gt;LU-9266&lt;/del&gt;&lt;/a&gt; hsm: don&apos;t add request when cdt is stopped&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: b2_10&lt;br/&gt;
Current Patch Set: &lt;br/&gt;
Commit: d488337c04b52392e11a784617d902e1f12c7cba&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10010">
                    <name>Duplicate</name>
                                                                <inwardlinks description="is duplicated by">
                                                        </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzz8fj:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>