<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 02:17:56 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-8481] MDT hung in recovery</title>
                <link>https://jira.whamcloud.com/browse/LU-8481</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;On our DNE testbed I have an MDT hung in recovery.  This is LLNL lustre tag 2.8.0_0.0.llnlpreview.30 (see the lustre-release-fe-llnl repo).  This tag includes patches and debugging for issues &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-8370&quot; title=&quot;ASSERTION( lur-&amp;gt;lur_hdr.lrh_len &amp;lt;= ctxt-&amp;gt;loc_chunk_size )&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-8370&quot;&gt;&lt;del&gt;LU-8370&lt;/del&gt;&lt;/a&gt;, &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-8422&quot; title=&quot;llog_osd.c:165:llog_osd_pad()) ASSERTION( len &amp;gt;= (24) &amp;amp;&amp;amp; (len &amp;amp; 0x7) == 0 )&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-8422&quot;&gt;&lt;del&gt;LU-8422&lt;/del&gt;&lt;/a&gt;, and &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-7800&quot; title=&quot;Panic during recovery of soak-test.&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-7800&quot;&gt;&lt;del&gt;LU-7800&lt;/del&gt;&lt;/a&gt;.  See the tag for specifics.&lt;/p&gt;

&lt;p&gt;One of the MDTs is hanging in recovery after reaching 0, even though all of the other MDTs are alive and well.  Trying to abort recovery does not work.&lt;/p&gt;

&lt;p&gt;Checking backtraces, I see the following thread that looks stuck:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;PID: 14316  TASK: ffff883f1cb26780  CPU: 6   COMMAND: &quot;tgt_recover_15&quot;
 #0 [ffff883e760b37e8] __schedule+0x295 at ffffffff81651da5
 #1 [ffff883e760b3850] schedule+0x29 at ffffffff81652479
 #2 [ffff883e760b3860] ldlm_completion_ast+0x62d at ffffffffa0deb1cd [ptlrpc]
 #3 [ffff883e760b3900] ldlm_cli_enqueue_fini+0x938 at ffffffffa0dec958 [ptlrpc]
 #4 [ffff883e760b39a8] ldlm_cli_enqueue+0x2aa at ffffffffa0ded07a [ptlrpc]
 #5 [ffff883e760b3a50] osp_md_object_lock+0x154 at ffffffffa129b5c4 [osp]
 #6 [ffff883e760b3ad0] lod_object_lock+0xf0 at ffffffffa11d8310 [lod]
 #7 [ffff883e760b3b80] mdd_object_lock+0x3b at ffffffffa124070b [mdd]
 #8 [ffff883e760b3b90] mdt_remote_object_lock+0x1cf at ffffffffa10f563f [mdt]
 #9 [ffff883e760b3be8] mdt_object_lock_internal+0x15e at ffffffffa10f683e [mdt]
#10 [ffff883e760b3c30] mdt_reint_object_lock+0x20 at ffffffffa10f6b50 [mdt]
#11 [ffff883e760b3c40] mdt_reint_link+0x7e4 at ffffffffa110bd94 [mdt]
#12 [ffff883e760b3cc8] mdt_reint_rec+0x80 at ffffffffa110e470 [mdt]
#13 [ffff883e760b3cf0] mdt_reint_internal+0x5e1 at ffffffffa10f1971 [mdt]
#14 [ffff883e760b3d28] mdt_reint+0x67 at ffffffffa10fb0d7 [mdt]
#15 [ffff883e760b3d58] tgt_request_handle+0x915 at ffffffffa0e7d695 [ptlrpc]
#16 [ffff883e760b3da0] handle_recovery_req+0x8b at ffffffffa0dda95b [ptlrpc]
#17 [ffff883e760b3dc8] replay_request_or_update+0x4aa at ffffffffa0de499a [ptlrpc]
#18 [ffff883e760b3e40] target_recovery_thread+0x617 at ffffffffa0de53c7 [ptlrpc]
#19 [ffff883e760b3ec8] kthread+0xcf at ffffffff810a99bf
#20 [ffff883e760b3f50] ret_from_fork+0x58 at ffffffff8165d9d8
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Here is the recovery_status:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[root@jet16:lquake-MDT000f]# cat recovery_status
status: RECOVERING
recovery_start: 1470427157
time_remaining: 0
connected_clients: 202/203
req_replay_clients: 1
lock_repay_clients: 1
completed_clients: 201
evicted_clients: 1
replayed_requests: 0
queued_requests: 0
next_transno: 64483674941
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The console is regularly noting the passed recovery deadline:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[ 4472.930991] Lustre: lquake-MDT000f: Recovery already passed deadline 47:36, It is most likely due to DNE recovery is failed or stuck, please wait a few more minutes or abort the recovery.
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</description>
                <environment></environment>
        <key id="38641">LU-8481</key>
            <summary>MDT hung in recovery</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="5" iconUrl="https://jira.whamcloud.com/images/icons/priorities/trivial.svg">Trivial</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="5">Cannot Reproduce</resolution>
                                        <assignee username="yong.fan">nasf</assignee>
                                    <reporter username="morrone">Christopher Morrone</reporter>
                        <labels>
                            <label>llnl</label>
                    </labels>
                <created>Fri, 5 Aug 2016 21:00:05 +0000</created>
                <updated>Sat, 12 Jan 2019 04:03:52 +0000</updated>
                            <resolved>Sat, 12 Jan 2019 04:03:52 +0000</resolved>
                                                                        <due></due>
                            <votes>0</votes>
                                    <watches>2</watches>
                                                                            <comments>
                            <comment id="160988" author="morrone" created="Fri, 5 Aug 2016 21:01:35 +0000"  >&lt;p&gt;Interestingly, /proc/fs/lustre/health_check says &quot;healthy&quot;.&lt;/p&gt;</comment>
                            <comment id="160991" author="morrone" created="Fri, 5 Aug 2016 21:18:32 +0000"  >&lt;p&gt;On reboot the MDT was able to complete recovery the second time.  That may indicate that this hang is racy.&lt;/p&gt;</comment>
                            <comment id="161140" author="pjones" created="Mon, 8 Aug 2016 17:17:15 +0000"  >&lt;p&gt;Fan Yong&lt;/p&gt;

&lt;p&gt;Could you please asssit with this one?&lt;/p&gt;

&lt;p&gt;Thanks&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="161242" author="yong.fan" created="Tue, 9 Aug 2016 08:46:29 +0000"  >&lt;p&gt;The stack trace tells us that the MDT000f was replaying some &apos;link&apos; operation from the client, the parent object to hold the target name entry was on another MDT, assume it was MDT000X. Such replay operation triggered ldlm enqueue RPC from the MDT000f to the MDT000X. At that time, the parent object&apos;s lock was held by some other, then the MDT000X did not grant the lock to the MDT000f immediately. So the lock sponsor &quot;tgt_recover_15&quot; on the MDT000f waited there inside &quot;ldlm_completion_ast()&quot; as following:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;int ldlm_completion_ast(struct ldlm_lock *lock, __u64 flags, void *data)
{
...
        if (ldlm_is_no_timeout(lock)) {
                LDLM_DEBUG(lock, &quot;waiting indefinitely because of NO_TIMEOUT&quot;);
                lwi = LWI_INTR(interrupted_completion_wait, &amp;amp;lwd);
        } else {
                lwi = LWI_TIMEOUT_INTR(cfs_time_seconds(timeout),
                                       ldlm_expired_completion_wait,
                                       interrupted_completion_wait, &amp;amp;lwd);
        }
...
}
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;But for some unknown reason, the lock was not granted in time. It may because:&lt;/p&gt;

&lt;p&gt;1) The conflict lock holder did not release the conflict lock in time, that should triggered eviction on the MDT000X for the blocking_ast timeout. Please check log messages on the MDT000X.&lt;/p&gt;

&lt;p&gt;2) The conflict lock holder released the conflict lock in time, but the MDT000f failed to send the completion_ast to the MDT000f. If that happened, there should be lock callback timeout logs on the MDT000X. Please check log messages on the MDT000X.&lt;/p&gt;

&lt;p&gt;Anyway, it also depends on how long you saw the MDT000f hung there. If it is long enough, the MDT000X should give us more information; otherwise, if the MDT000X was not aware of the timeout (lock blocking ast or completion ast), then we cannot know why the  &quot;tgt_recover_15&quot; on the MDT000f was not waken up.&lt;/p&gt;</comment>
                            <comment id="161342" author="morrone" created="Tue, 9 Aug 2016 21:05:16 +0000"  >&lt;p&gt;From the &quot;Recovery already passed deadline 47:36&quot; message, we know that it was stuck for at least 47 minutes.  I think I left it sitting like that for over an hour before I rebooted it.&lt;/p&gt;

&lt;p&gt;How do I identify MDT000X?  And what am I looking for in the sea of Lustre console messages?&lt;/p&gt;</comment>
                            <comment id="161371" author="yong.fan" created="Wed, 10 Aug 2016 01:49:02 +0000"  >&lt;blockquote&gt;
&lt;p&gt;How do I identify MDT000X? And what am I looking for in the sea of Lustre console messages?&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;Honestly, only from the given logs, I do not know what the exact &apos;X&apos; is. You have to check every other MDT one by one with the message like &quot;lock callback timer expired after xxxx: evicting client at ...&quot;&lt;/p&gt;</comment>
                            <comment id="163059" author="yong.fan" created="Wed, 24 Aug 2016 18:15:57 +0000"  >&lt;p&gt;Any further input? Thanks!&lt;/p&gt;</comment>
                            <comment id="163064" author="morrone" created="Wed, 24 Aug 2016 18:24:41 +0000"  >&lt;p&gt;Not at this time.&lt;/p&gt;</comment>
                            <comment id="229110" author="yong.fan" created="Tue, 5 Jun 2018 16:47:24 +0000"  >&lt;p&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/ViewProfile.jspa?name=morrone&quot; class=&quot;user-hover&quot; rel=&quot;morrone&quot;&gt;morrone&lt;/a&gt;,&lt;/p&gt;

&lt;p&gt;Do you have more input for this ticket or can we close it? I do not know whether the issue is still there or not because there are some DNE recovery related patches have been landed in the past 20 months.&lt;/p&gt;</comment>
                            <comment id="239866" author="pjones" created="Sat, 12 Jan 2019 04:03:52 +0000"  >&lt;p&gt;closing ancient ticket&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                        <issuelink>
            <issuekey id="43326">LU-9049</issuekey>
        </issuelink>
                            </outwardlinks>
                                                        </issuelinktype>
                    </issuelinks>
                <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzyjnb:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>