<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 02:12:27 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-7848] Recovery process on MDS stalled</title>
                <link>https://jira.whamcloud.com/browse/LU-7848</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;Error occurred during soak testing of build &apos;20160302&apos; (b2_8 RC4) (see: &lt;a href=&quot;https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160302&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160302&lt;/a&gt; also). DNE is enabled. MDTs had been formatted using ldiskfs, OSTs using zfs. MDS nodes are configured in active - active HA failover configuration. (For teset set-up configuration see &lt;a href=&quot;https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-Configuration&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-Configuration&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;The following effects can be observed:&lt;/p&gt;
&lt;ul&gt;
	&lt;li&gt;After restarting and failover it takes 0.5 - 3 hours for the recovery to complete on all MDSes(seems to be correlated wiith uptime of the MDS)&lt;/li&gt;
	&lt;li&gt;Sometimes only 1 MDT finish recovery&lt;/li&gt;
	&lt;li&gt;Often the recovery never completes&lt;/li&gt;
	&lt;li&gt;This is true for all MDSes&lt;/li&gt;
	&lt;li&gt;a high rate of clients are evicted leading to a large number of job crashes ( up to ~ 25%).&lt;/li&gt;
	&lt;li&gt;Interestingly the recovery of secondary MDTs take only a couple of minutes and always complete on the failover partner node.&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;Here&apos;re failover and restart events listed for MDS node &lt;tt&gt;lola-11&lt;/tt&gt;. The same &apos;structure&apos; can be found for the other nodes:&lt;br/&gt;
Recovery for secondary MDTs on &lt;tt&gt;lola-11&lt;/tt&gt;&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;mds_failover     : 2016-03-03 10:24:12,345 - 2016-03-03 10:32:12,647    lola-10
Mar  3 10:31:58 lola-11 kernel: Lustre: soaked-MDT0005: Recovery over after 2:14, of 16 clients 0 recovered and 16 were evicted.
Mar  3 10:32:06 lola-11 kernel: Lustre: soaked-MDT0004: Recovery over after 0:20, of 16 clients 0 recovered and 16 were evicted.

mds_failover     : 2016-03-03 18:11:42,958 - 2016-03-03 18:18:17,112    lola-10
Mar  3 18:18:03 lola-11 kernel: Lustre: soaked-MDT0005: Recovery over after 1:03, of 16 clients 0 recovered and 16 were evicted.
Mar  3 18:18:10 lola-11 kernel: Lustre: soaked-MDT0004: Recovery over after 0:08, of 16 clients 0 recovered and 16 were evicted.

mds_failover     : 2016-03-03 22:04:51,554 - 2016-03-03 22:12:03,652    lola-10
Mar  3 22:11:50 lola-11 kernel: Lustre: soaked-MDT0005: Recovery over after 1:36, of 16 clients 0 recovered and 16 were evicted.
Mar  3 22:11:57 lola-11 kernel: Lustre: soaked-MDT0004: Recovery over after 0:22, of 16 clients 0 recovered and 16 were evicted.

mds_failover     : 2016-03-04 00:11:27,161 - 2016-03-04 00:18:36,686    lola-10
Mar  4 00:18:23 lola-11 kernel: Lustre: soaked-MDT0005: Recovery over after 1:23, of 5 clients 0 recovered and 5 were evicted.
Mar  4 00:18:30 lola-11 kernel: Lustre: soaked-MDT0004: Recovery over after 0:23, of 6 clients 0 recovered and 6 were evicted.

mds_failover     : 2016-03-04 01:51:11,775 - 2016-03-04 01:58:40,927    lola-10
Mar  4 01:58:27 lola-11 kernel: Lustre: soaked-MDT0005: Recovery over after 1:41, of 16 clients 0 recovered and 16 were evicted.
Mar  4 01:58:34 lola-11 kernel: Lustre: soaked-MDT0004: Recovery over after 0:23, of 16 clients 0 recovered and 16 were evicted.

mds_failover     : 2016-03-04 02:54:18,928 - 2016-03-04 03:01:00,519    lola-10
Mar  4 03:00:47 lola-11 kernel: Lustre: soaked-MDT0005: Recovery over after 1:05, of 16 clients 0 recovered and 16 were evicted.
Mar  4 03:00:54 lola-11 kernel: Lustre: soaked-MDT0004: Recovery over after 0:09, of 16 clients 0 recovered and 16 were evicted.
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;------------------&lt;br/&gt;
Recovery for primary MDTs on &lt;tt&gt;lola-11&lt;/tt&gt;&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;mds_failover     : 2016-03-03 09:36:44,457 - 2016-03-03 09:43:43,316    lola-11
Mar  3 09:50:42 lola-11 kernel: Lustre: soaked-MDT0007: Recovery over after 6:59, of 16 clients 16 recovered and 0 were evicted.
Mar  3 09:51:14 lola-11 kernel: Lustre: soaked-MDT0006: Recovery over after 7:31, of 16 clients 8 recovered and 8 were evicted.

mds_failover     : 2016-03-03 13:06:05,210 - 2016-03-03 13:13:33,003    lola-11
Mar  3 14:13:46 lola-11 kernel: Lustre: soaked-MDT0006: Recovery over after 40:56, of 16 clients 16 recovered and 0 were evicted.
Mar  3 14:13:50 lola-11 kernel: Lustre: soaked-MDT0007: Recovery over after 41:50, of 16 clients 16 recovered and 0 were evicted.

mds_restart      : 2016-03-03 13:26:05,005 - 2016-03-03 13:32:48,359    lola-11
Mar  3 14:13:46 lola-11 kernel: Lustre: soaked-MDT0006: Recovery over after 40:56, of 16 clients 16 recovered and 0 were evicted.
Mar  3 14:13:50 lola-11 kernel: Lustre: soaked-MDT0007: Recovery over after 41:50, of 16 clients 16 recovered and 0 were evicted.

mds_restart      : 2016-03-03 20:14:23,309 - 2016-03-03 20:24:56,044    lola-11
Mar  3 20:37:51 lola-11 kernel: Lustre: soaked-MDT0006: Recovery over after 12:50, of 16 clients 16 recovered and 0 were evicted.
 ---&amp;gt; MDT0007 never recovered

mds_failover     : 2016-03-03 22:15:27,654 - 2016-03-03 22:23:34,982    lola-11
Mar  4 01:03:03 lola-11 kernel: Lustre: soaked-MDT0007: Recovery over after 159:29, of 16 clients 14 recovered and 2 were evicted.
Mar  4 01:03:05 lola-11 kernel: Lustre: soaked-MDT0006: Recovery over after 159:30, of 16 clients 14 recovered and 2 were evicted.

mds_failover     : 2016-03-04 05:10:37,638 - 2016-03-04 05:17:48,193    lola-11
 ---&amp;gt; MDT0006 never recovered
 ---&amp;gt; MDT0007 never recovered

mds_failover     : 2016-03-04 05:35:12,194 - 2016-03-04 05:41:56,320    lola-11
 ---&amp;gt; MDT0006 never recovered
 ---&amp;gt; MDT0007 never recovered

mds_restart      : 2016-03-04 06:53:30,098 - 2016-03-04 07:03:06,783    lola-11
 ---&amp;gt; MDT0006 never recovered
 ---&amp;gt; MDT0007 never recovered
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Attached message, console and debug log files (with mask &apos;&lt;del&gt;1&apos;) of all MDS nodes (&lt;tt&gt;lola&lt;/del&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;8-11&amp;#93;&lt;/span&gt;&lt;/tt&gt;.&lt;/p&gt;

&lt;p&gt;Same situation ended once with start of oom-killer (see &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-7836&quot; title=&quot;MDSes crashed with oom-killer&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-7836&quot;&gt;&lt;del&gt;LU-7836&lt;/del&gt;&lt;/a&gt;.)&lt;/p&gt;</description>
                <environment>lola&lt;br/&gt;
build: &lt;a href=&quot;https://build.hpdd.intel.com/job/lustre-b2_8/11/&quot;&gt;https://build.hpdd.intel.com/job/lustre-b2_8/11/&lt;/a&gt;</environment>
        <key id="35161">LU-7848</key>
            <summary>Recovery process on MDS stalled</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="1" iconUrl="https://jira.whamcloud.com/images/icons/priorities/blocker.svg">Blocker</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="di.wang">Di Wang</assignee>
                                    <reporter username="heckes">Frank Heckes</reporter>
                        <labels>
                            <label>dne2</label>
                            <label>soak</label>
                    </labels>
                <created>Fri, 4 Mar 2016 16:20:14 +0000</created>
                <updated>Tue, 19 Mar 2019 15:41:02 +0000</updated>
                            <resolved>Tue, 31 May 2016 12:53:18 +0000</resolved>
                                    <version>Lustre 2.8.0</version>
                                    <fixVersion>Lustre 2.9.0</fixVersion>
                                        <due></due>
                            <votes>0</votes>
                                    <watches>7</watches>
                                                                            <comments>
                            <comment id="144644" author="heckes" created="Fri, 4 Mar 2016 16:28:28 +0000"  >&lt;p&gt;The recovery process can be aborted. The command &lt;tt&gt;lctl --device 4  abort_recovery&lt;/tt&gt; will be blocked on IO.&lt;/p&gt;</comment>
                            <comment id="144674" author="di.wang" created="Fri, 4 Mar 2016 19:40:19 +0000"  >&lt;p&gt;It looks like the update recovery thread is stuck inside retrieving update log. &lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;lod0007_rec00 S 0000000000000016     0  4584      2 0x00000080^M
 ffff880408c0f900 0000000000000046 ffff880408c0f8f0 ffffffffa0a5ad2c^M
 000001fc91878c1b 0000000000000000 0000000000000000 0000000000000286^M
 ffff880408c0f8a0 ffffffff8108742c ffff880408c065f8 ffff880408c0ffd8^M
Call Trace:^M
 [&amp;lt;ffffffffa0a5ad2c&amp;gt;] ? ptlrpc_unregister_reply+0x6c/0x810 [ptlrpc]^M
 [&amp;lt;ffffffff8108742c&amp;gt;] ? lock_timer_base+0x3c/0x70^M
 [&amp;lt;ffffffff8152b222&amp;gt;] schedule_timeout+0x192/0x2e0^M
 [&amp;lt;ffffffff81087540&amp;gt;] ? process_timeout+0x0/0x10^M
 [&amp;lt;ffffffffa0a5db61&amp;gt;] ptlrpc_set_wait+0x321/0x960 [ptlrpc]^M
 [&amp;lt;ffffffffa0a533f0&amp;gt;] ? ptlrpc_interrupted_set+0x0/0x120 [ptlrpc]^M
 [&amp;lt;ffffffff81064c00&amp;gt;] ? default_wake_function+0x0/0x20^M
 [&amp;lt;ffffffffa0a69e55&amp;gt;] ? lustre_msg_set_jobid+0xf5/0x130 [ptlrpc]^M
 [&amp;lt;ffffffffa0a5e221&amp;gt;] ptlrpc_queue_wait+0x81/0x220 [ptlrpc]^M
 [&amp;lt;ffffffffa13378d9&amp;gt;] osp_remote_sync+0x129/0x190 [osp]^M
 [&amp;lt;ffffffffa131ba07&amp;gt;] osp_attr_get+0x417/0x700 [osp]^M
 [&amp;lt;ffffffffa131d457&amp;gt;] osp_object_init+0x1c7/0x330 [osp]^M
 [&amp;lt;ffffffffa084c1e8&amp;gt;] lu_object_alloc+0xd8/0x320 [obdclass]^M
 [&amp;lt;ffffffffa084d5d1&amp;gt;] lu_object_find_try+0x151/0x260 [obdclass]^M
 [&amp;lt;ffffffffa084d791&amp;gt;] lu_object_find_at+0xb1/0xe0 [obdclass]^M
 [&amp;lt;ffffffff81174450&amp;gt;] ? cache_alloc_refill+0x1c0/0x240^M
 [&amp;lt;ffffffffa084e52c&amp;gt;] dt_locate_at+0x1c/0xa0 [obdclass]^M
 [&amp;lt;ffffffffa0813dce&amp;gt;] llog_osd_get_cat_list+0x8e/0xcd0 [obdclass]^M
 [&amp;lt;ffffffffa1255750&amp;gt;] lod_sub_prep_llog+0x110/0x7a0 [lod]^M
 [&amp;lt;ffffffffa12309c0&amp;gt;] ? lod_sub_recovery_thread+0x0/0xaf0 [lod]^M
 [&amp;lt;ffffffff8105bd83&amp;gt;] ? __wake_up+0x53/0x70^M
 [&amp;lt;ffffffffa1230c04&amp;gt;] lod_sub_recovery_thread+0x244/0xaf0 [lod]^M
 [&amp;lt;ffffffffa12309c0&amp;gt;] ? lod_sub_recovery_thread+0x0/0xaf0 [lod]^M
 [&amp;lt;ffffffff8109e78e&amp;gt;] kthread+0x9e/0xc0^M
 [&amp;lt;ffffffff8100c28a&amp;gt;] child_rip+0xa/0x20^M
 [&amp;lt;ffffffff8109e6f0&amp;gt;] ? kthread+0x0/0xc0^M
 [&amp;lt;ffffffff8100c280&amp;gt;] ? child_rip+0x0/0x20^M
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Hmm, this kind of RPC should be allowed during recovery. Sigh, there are not enough log to tell what happen. will check more.&lt;/p&gt;</comment>
                            <comment id="144686" author="di.wang" created="Fri, 4 Mar 2016 22:50:56 +0000"  >&lt;p&gt;ah, it seems after we disable the MDT-MDT eviction by the recent patch &lt;a href=&quot;http://review.whamcloud.com/#/c/18676/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#/c/18676/&lt;/a&gt;, then the new connection from the restart MDT will always be denied because of this. So the retrieving update log will be blocked. I will cook a patch.&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;00010000:02000400:27.0:1457125392.284909:0:6144:0:(ldlm_lib.c:751:target_handle_reconnect()) soaked-MDT0004: already connected client soaked-MDT0003-mdtlov_UUID (at 192.168.1.109@o2ib10) with handle 0xb3e905baff54ad2d. Rejecting client with the same UUID trying to reconnect with handle 0x91819165c59714eb
00010000:00000001:27.0:1457125392.284913:0:6144:0:(ldlm_lib.c:756:target_handle_reconnect()) Process leaving (rc=18446744073709551502 : -114 : ffffffffffffff8e)
00010000:00000001:27.0:1457125392.284916:0:6144:0:(ldlm_lib.c:1190:target_handle_connect()) Process leaving via out (rc=18446744073709551502 : -114 : 0xffffffffffffff8e)
00000020:00000040:27.0:1457125392.284920:0:6144:0:(genops.c:817:class_export_put()) PUTting export ffff8807f3ca9c00 : new refcount 5
00000020:00000040:27.0:1457125392.284922:0:6144:0:(obd_config.c:740:class_decref()) Decref soaked-MDT0004 (ffff880834f44078) now 67
00010000:00000001:27.0:1457125392.284924:0:6144:0:(ldlm_lib.c:1410:target_handle_connect()) Process leaving (rc=18446744073709551502 : -114 : ffffffffffffff8e)
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="144703" author="gerrit" created="Sat, 5 Mar 2016 05:35:43 +0000"  >&lt;p&gt;wangdi (di.wang@intel.com) uploaded a new patch: &lt;a href=&quot;http://review.whamcloud.com/18800&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/18800&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-7848&quot; title=&quot;Recovery process on MDS stalled&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-7848&quot;&gt;&lt;del&gt;LU-7848&lt;/del&gt;&lt;/a&gt; target: Do not fail MDT-MDT connection&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: 190f34e30d3aa1002fd4cbe71e48732a296c65b7&lt;/p&gt;</comment>
                            <comment id="144773" author="gerrit" created="Mon, 7 Mar 2016 17:53:45 +0000"  >&lt;p&gt;wangdi (di.wang@intel.com) uploaded a new patch: &lt;a href=&quot;http://review.whamcloud.com/18813&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/18813&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-7848&quot; title=&quot;Recovery process on MDS stalled&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-7848&quot;&gt;&lt;del&gt;LU-7848&lt;/del&gt;&lt;/a&gt; target: Do not fail MDT-MDT connection&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: b2_8&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: 8b42a380c37ac2ae4a24431f42984acf88b2a546&lt;/p&gt;</comment>
                            <comment id="144871" author="heckes" created="Tue, 8 Mar 2016 14:58:12 +0000"  >&lt;p&gt;The problem persist after installing build associated with change #18813.&lt;br/&gt;
Still there long recovery times with most all clients recovered or short ones were no client recovers:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;lola-10.log:Mar  8 02:28:45 lola-10 kernel: Lustre: soaked-MDT0005: Recovery over after 183:52, of 16 clients 16 recovered and 0 were evicted.
lola-10.log:Mar  8 02:28:46 lola-10 kernel: Lustre: soaked-MDT0004: Recovery over after 183:04, of 16 clients 16 recovered and 0 were evicted.
lola-10.log:Mar  8 05:01:17 lola-10 kernel: Lustre: soaked-MDT0004: Recovery over after 14:17, of 16 clients 13 recovered and 3 were evicted.
lola-11.log:Mar  8 02:33:20 lola-11 kernel: Lustre: soaked-MDT0006: Recovery over after 112:56, of 16 clients 16 recovered and 0 were evicted.
lola-11.log:Mar  8 02:33:21 lola-11 kernel: Lustre: soaked-MDT0007: Recovery over after 113:44, of 16 clients 16 recovered and 0 were evicted.
lola-8.log:Mar  8 02:27:59 lola-8 kernel: Lustre: soaked-MDT0003: Recovery over after 0:45, of 16 clients 0 recovered and 16 were evicted.
lola-8.log:Mar  8 02:28:10 lola-8 kernel: Lustre: soaked-MDT0002: Recovery over after 0:26, of 16 clients 0 recovered and 16 were evicted.
lola-8.log:Mar  8 02:37:39 lola-8 kernel: Lustre: soaked-MDT0000: Recovery over after 63:43, of 16 clients 16 recovered and 0 were evicted.
lola-9.log:Mar  8 01:32:31 lola-9 kernel: Lustre: soaked-MDT0001: Recovery over after 1:03, of 16 clients 0 recovered and 16 were evicted.
lola-9.log:Mar  8 01:33:23 lola-9 kernel: Lustre: soaked-MDT0000: Recovery over after 1:10, of 16 clients 0 recovered and 16 were evicted.
lola-9.log:Mar  8 02:41:56 lola-9 kernel: Lustre: soaked-MDT0003: Recovery over after 13:41, of 16 clients 16 recovered and 0 were evicted.
lola-9.log:Mar  8 02:41:58 lola-9 kernel: Lustre: soaked-MDT0002: Recovery over after 13:43, of 16 clients 16 recovered and 0 were evicted.
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="144979" author="di.wang" created="Wed, 9 Mar 2016 06:45:38 +0000"  >&lt;p&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;3/8/16, 8:05:50 PM&amp;#93;&lt;/span&gt; wangdi: it seems the old code is running&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;3/8/16, 8:05:53 PM&amp;#93;&lt;/span&gt; wangdi: 00010000:00000001:13.0:1457452176.471578:0:4691:0:(ldlm_lib.c:756:target_handle_reconnect()) Process leaving (rc=18446744073709551502 : -114 : ffffffffffffff8e)&lt;br/&gt;
00010000:02000400:13.0:1457452176.471579:0:4691:0:(ldlm_lib.c:1175:target_handle_connect()) soaked-MDT0003: Client soaked-MDT0006-mdtlov_UUID (at 192.168.1.111@o2ib10) refused connection, still busy with 474 references&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;3/8/16, 8:07:39 PM&amp;#93;&lt;/span&gt; wangdi: The debug line number seems not matching the patch &lt;a href=&quot;http://review.whamcloud.com/#/c/18813/2/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#/c/18813/2/&lt;/a&gt;&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;3/8/16, 8:07:46 PM&amp;#93;&lt;/span&gt; wangdi: it seems it matches b2_8&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;3/8/16, 8:08:43 PM&amp;#93;&lt;/span&gt; wangdi: Cliff, Frank: could you please check what is running on lola-8. I hope to use this build &lt;a href=&quot;https://build.hpdd.intel.com/job/lustre-reviews/37662/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://build.hpdd.intel.com/job/lustre-reviews/37662/&lt;/a&gt;&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;3/8/16, 8:08:44 PM&amp;#93;&lt;/span&gt; wangdi: thanks&lt;/p&gt;</comment>
                            <comment id="144988" author="heckes" created="Wed, 9 Mar 2016 11:20:22 +0000"  >&lt;ul&gt;
	&lt;li&gt;The MDS nodes weren&apos;t updated with the correct builds. I don&apos;t know the reason as I didn&apos;t execute the update myself. Anyway, I&apos;m very sorry for that.&lt;/li&gt;
	&lt;li&gt;All nodes (MDSes) have been updated and soak session is restarted using build #37662&lt;/li&gt;
&lt;/ul&gt;
</comment>
                            <comment id="144997" author="heckes" created="Wed, 9 Mar 2016 14:11:43 +0000"  >&lt;p&gt;The effect still remains:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;mds_restart      : 2016-03-09 03:44:21,690 - 2016-03-09 03:55:17,693    lola-8
lola-8.log:Mar  9 03:56:08 lola-8 kernel: Lustre: soaked-MDT0000: Recovery over after 0:49, of 16 clients 16 recovered and 0 were evicted.
lola-8.log:Mar  9 03:56:09 lola-8 kernel: Lustre: soaked-MDT0001: Recovery over after 2:29, of 16 clients 16 recovered and 0 were evicted.

mds_restart      : 2016-03-09 04:54:33,695 - 2016-03-09 05:02:29,259    lola-10
lola-10.log:Mar  9 05:16:28 lola-10 kernel: Lustre: soaked-MDT0007: Recovery over after 1:42, of 13 clients 0 recovered and 13 were evicted.
lola-10.log:Mar  9 05:16:44 lola-10 kernel: Lustre: soaked-MDT0006: Recovery over after 0:32, of 7 clients 0 recovered and 7 were evicted.

mds_failover     : 2016-03-09 05:08:46,261 - 2016-03-09 05:16:50,083    lola-11
lola-11.log:Mar  9 05:47:15 lola-11 kernel: Lustre: soaked-MDT0006: Recovery over after 30:24, of 10 clients 10 recovered and 0 were evicted.
lola-11.log:Mar  9 05:47:23 lola-11 kernel: Lustre: soaked-MDT0007: Recovery over after 30:33, of 13 clients 12 recovered and 1 was evicted.
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;This is for build &lt;a href=&quot;https://build.hpdd.intel.com/job/lustre-reviews/37662/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://build.hpdd.intel.com/job/lustre-reviews/37662/&lt;/a&gt;&lt;br/&gt;
I&apos;ll add message and debug log files asap.&lt;/p&gt;</comment>
                            <comment id="145021" author="heckes" created="Wed, 9 Mar 2016 17:06:52 +0000"  >&lt;p&gt;All files (messages, debug logs) have been copied to &lt;tt&gt;lhn.lola.hpdd.intel.com:/scratch/crashdumps/lu-7848&lt;/tt&gt;. The test session were executed between Mar, 9th 02:52 &amp;#8211; 06:19.&lt;/p&gt;</comment>
                            <comment id="145425" author="heckes" created="Mon, 14 Mar 2016 15:33:11 +0000"  >&lt;p&gt;The problem also occurs during soak testing of b2_8 RC5 (see &lt;a href=&quot;https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160309&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160309&lt;/a&gt;).&lt;/p&gt;

&lt;ul&gt;
	&lt;li&gt;MDS (&lt;tt&gt;lola-11&lt;/tt&gt; was restarted at random time 2016-03-14 06:09:10,71&lt;/li&gt;
	&lt;li&gt;Recovery of MDT0007 don&apos;t complete even after more than 2 hours&lt;/li&gt;
	&lt;li&gt;Uploaded messages, console and debug log file: {{messages-lola-11-20160314, console-lola-11-20160314, lola-16-lustre-log-20160314-0736)&lt;/li&gt;
&lt;/ul&gt;
</comment>
                            <comment id="145445" author="di.wang" created="Mon, 14 Mar 2016 17:44:20 +0000"  >&lt;p&gt;Ah, recovery process on the remote target is stuck in lu_object_find().&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;mdt_out03_021 D 000000000000000e     0 121458      2 0x00000080
 ffff88081be7bb50 0000000000000046 0000000000000000 0000000000000000
 ffff880378ae49c0 ffff88081ec00118 0000cf356d708513 ffff88081ec00118
 ffff88081be7bb50 000000010d91895e ffff8808161145f8 ffff88081be7bfd8
Call Trace:
 [&amp;lt;ffffffffa088b71d&amp;gt;] lu_object_find_at+0x3d/0xe0 [obdclass]
 [&amp;lt;ffffffffa0acd342&amp;gt;] ? __req_capsule_get+0x162/0x6e0 [ptlrpc]
 [&amp;lt;ffffffff81064c00&amp;gt;] ? default_wake_function+0x0/0x20
 [&amp;lt;ffffffffa0aa2dd0&amp;gt;] ? lustre_swab_object_update_reply+0x0/0xc0 [ptlrpc]
 [&amp;lt;ffffffffa088c52c&amp;gt;] dt_locate_at+0x1c/0xa0 [obdclass]
 [&amp;lt;ffffffffa0b13970&amp;gt;] out_handle+0x1030/0x1880 [ptlrpc]
 [&amp;lt;ffffffff8105872d&amp;gt;] ? check_preempt_curr+0x6d/0x90
 [&amp;lt;ffffffff8152b83e&amp;gt;] ? mutex_lock+0x1e/0x50
 [&amp;lt;ffffffffa0b031ca&amp;gt;] ? req_can_reconstruct+0x6a/0x120 [ptlrpc]
 [&amp;lt;ffffffffa0b0ac2c&amp;gt;] tgt_request_handle+0x8ec/0x1440 [ptlrpc]
 [&amp;lt;ffffffffa0ab7c61&amp;gt;] ptlrpc_main+0xd21/0x1800 [ptlrpc]
 [&amp;lt;ffffffff8152a39e&amp;gt;] ? thread_return+0x4e/0x7d0
 [&amp;lt;ffffffffa0ab6f40&amp;gt;] ? ptlrpc_main+0x0/0x1800 [ptlrpc]
 [&amp;lt;ffffffff8109e78e&amp;gt;] kthread+0x9e/0xc0
 [&amp;lt;ffffffff8100c28a&amp;gt;] child_rip+0xa/0x20
 [&amp;lt;ffffffff8109e6f0&amp;gt;] ? kthread+0x0/0xc0
 [&amp;lt;ffffffff8100c280&amp;gt;] ? child_rip+0x0/0x20
mdt_out03_022 D 000000000000001c     0 121824      2 0x00000080
 ffff88081c4afb50 0000000000000046 0000000000000000 0000000000000000
 ffff880378ae49c0 ffff8807e4900118 0000d18fc8bbd702 ffff8807e4900118
 ffff88081c4afb50 000000010db905ef ffff8807e0153ad8 ffff88081c4affd8
Call Trace:
 [&amp;lt;ffffffffa088b71d&amp;gt;] lu_object_find_at+0x3d/0xe0 [obdclass]
 [&amp;lt;ffffffffa0acd342&amp;gt;] ? __req_capsule_get+0x162/0x6e0 [ptlrpc]
 [&amp;lt;ffffffff81064c00&amp;gt;] ? default_wake_function+0x0/0x20
 [&amp;lt;ffffffffa0aa2dd0&amp;gt;] ? lustre_swab_object_update_reply+0x0/0xc0 [ptlrpc]
 [&amp;lt;ffffffffa088c52c&amp;gt;] dt_locate_at+0x1c/0xa0 [obdclass]
 [&amp;lt;ffffffffa0b13970&amp;gt;] out_handle+0x1030/0x1880 [ptlrpc]
 [&amp;lt;ffffffff8105872d&amp;gt;] ? check_preempt_curr+0x6d/0x90
 [&amp;lt;ffffffff8152b83e&amp;gt;] ? mutex_lock+0x1e/0x50
 [&amp;lt;ffffffffa0b031ca&amp;gt;] ? req_can_reconstruct+0x6a/0x120 [ptlrpc]
 [&amp;lt;ffffffffa0b0ac2c&amp;gt;] tgt_request_handle+0x8ec/0x1440 [ptlrpc]
 [&amp;lt;ffffffffa0ab7c61&amp;gt;] ptlrpc_main+0xd21/0x1800 [ptlrpc]
 [&amp;lt;ffffffff8152a39e&amp;gt;] ? thread_return+0x4e/0x7d0
 [&amp;lt;ffffffffa0ab6f40&amp;gt;] ? ptlrpc_main+0x0/0x1800 [ptlrpc]
 [&amp;lt;ffffffff8109e78e&amp;gt;] kthread+0x9e/0xc0
 [&amp;lt;ffffffff8100c28a&amp;gt;] child_rip+0xa/0x20
 [&amp;lt;ffffffff8109e6f0&amp;gt;] ? kthread+0x0/0xc0
 [&amp;lt;ffffffff8100c280&amp;gt;] ? child_rip+0x0/0x20
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="145948" author="heckes" created="Thu, 17 Mar 2016 14:45:05 +0000"  >&lt;p&gt;Soak has been continued to execute b2_8 RC5 build with reformatted Lustre FS.&lt;br/&gt;
Now there&apos;s only 1 MDT per MDS and 5 OSTs per OSS (unchanged). MDT had&lt;br/&gt;
been formatted with ldiskfs and OSTs using zfs.&lt;/p&gt;

&lt;p&gt;The recovery process never stalls now neither for MDS restarts nor failover. All &lt;br/&gt;
recovery times are below 2 mins now. See the attached file &lt;tt&gt;recovery-times-201603-17&lt;/tt&gt;.&lt;/p&gt;</comment>
                            <comment id="146077" author="heckes" created="Fri, 18 Mar 2016 09:30:05 +0000"  >&lt;p&gt;After ~ 73 hours the recovery process stalled again and lead to an continuously increasing allocation of slabs.&lt;br/&gt;
The later effect is handled in &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-7836&quot; title=&quot;MDSes crashed with oom-killer&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-7836&quot;&gt;&lt;del&gt;LU-7836&lt;/del&gt;&lt;/a&gt;. &lt;b&gt;NOTE&lt;/b&gt;: All kernel debugs, messages and console logs have been attached to &lt;b&gt;&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-7836&quot; title=&quot;MDSes crashed with oom-killer&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-7836&quot;&gt;&lt;del&gt;LU-7836&lt;/del&gt;&lt;/a&gt;&lt;/b&gt;.&lt;/p&gt;

&lt;ul&gt;
	&lt;li&gt;Configuration&lt;br/&gt;
All events between 2016-03-17 04:38 &#8212; 2016-03-18 02:00 were executed with &apos;mds_restart&apos;, &apos;mds_failover + wait for recovery&apos;&lt;br/&gt;
(wait for recovery means that the recovery process need to complete on secondary node before failback)&lt;br/&gt;
This is mentioned as the former failback &apos;mechanism&apos; configured within soak framework was to failback immediately after&lt;br/&gt;
the server target was mounted successfully on the secondary node. The error happens actually after a &apos;mds_restart&apos;.&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;Sequence of events&lt;/p&gt;
&lt;ul&gt;
	&lt;li&gt;2016-03-18 00:34:30,557:fsmgmt.fsmgmt:INFO     triggering fault mds_restart&lt;/li&gt;
	&lt;li&gt;2016-03-18 00:40:49,655:fsmgmt.fsmgmt:INFO     lola-8 is up!!&lt;/li&gt;
	&lt;li&gt;2016-03-18 00:42:25,091:fsmgmt.fsmgmt:INFO     ... soaked-MDT0000 mounted successfully on lola-8&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;MDT stalled after time_remaining is 0:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;----------------
lola-8
----------------
mdt.soaked-MDT0000.recovery_status=
status: RECOVERING
recovery_start: 1458286947
time_remaining: 0
connected_clients: 12/12
req_replay_clients: 2
lock_repay_clients: 4
completed_clients: 8
evicted_clients: 0
replayed_requests: 0
queued_requests: 2
next_transno: 167504104141
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The recovery process can&apos;t be interrupted:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[root@lola-8 ~]# ps aux -L | grep lctl | grep -v grep 
root       8740   8740  0.0    1  0.0  15268   732 pts/1    D+   02:38   0:00 lctl --device 5 abort_recovery
[root@lola-8 ~]# cat /proc/8740/wchan 
target_stop_recovery_thread
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;For this event the debug log file &lt;tt&gt;lola-8-lustre-log-20160318-0240&lt;/tt&gt; has been attached.&lt;/p&gt;</comment>
                            <comment id="146116" author="heckes" created="Fri, 18 Mar 2016 15:47:43 +0000"  >&lt;p&gt;Same for umount :&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[root@lola-8 ~]# ps aux | grep umount
root      21524  0.0  0.0 105184   776 pts/0    D+   08:45   0:00 umount /mnt/soaked-mdt0
[root@lola-8 ~]# cat /proc/21524/wchan 
target_stop_recovery_thread[root@lola-8 ~]# 
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="146536" author="di.wang" created="Tue, 22 Mar 2016 22:48:37 +0000"  >&lt;p&gt;In the newest run, MDT failover seems stuck again. Here is what happened.&lt;/p&gt;

&lt;p&gt;1. lola 9 is restarted, and MDT0001 is failover on lola-8.&lt;br/&gt;
2. Then MDT0001 is re-mounted and tries to connect to MDT0002 on lola-10, but always get EALREADY.&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;LustreError: 11-0: soaked-MDT0002-osp-MDT0001: operation mds_connect to node 192.168.1.110@o2ib10 failed: rc = -114
LustreError: 11-0: soaked-MDT0002-osp-MDT0001: operation mds_connect to node 192.168.1.110@o2ib10 failed: rc = -114
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;3. And on MDT0002 (lola-10) denies the new connection because of&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Lustre: soaked-MDT0002: Client soaked-MDT0001-mdtlov_UUID seen on new nid 192.168.1.108@o2ib10 when existing nid 192.168.1.109@o2ib10 is already connected
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Since MDT always the same uuid to connect other MDT, I think we need remove the old export when it finds same export comes from a different nid. &lt;/p&gt;</comment>
                            <comment id="146927" author="cliffw" created="Fri, 25 Mar 2016 16:26:49 +0000"  >&lt;p&gt;As of Friday morning, we have had 12 MDS failovers without an issues and 11 MDS restarts. Average times are low.&lt;br/&gt;
a few examples:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;lola-11.log:Mar 24 15:14:37 lola-11 kernel: Lustre: soaked-MDT0002: Recovery over after 1:11, of 12 clients 12 recovered and 0 were evicted.
lola-11.log:Mar 24 17:12:16 lola-11 kernel: Lustre: soaked-MDT0002: Recovery over after 0:55, of 12 clients 12 recovered and 0 were evicted.
lola-11.log:Mar 24 18:53:29 lola-11 kernel: Lustre: soaked-MDT0002: Recovery over after 1:09, of 12 clients 12 recovered and 0 were evicted.
lola-11.log:Mar 24 20:37:20 lola-11 kernel: Lustre: soaked-MDT0002: Recovery over after 0:48, of 12 clients 12 recovered and 0 were evicted.
lola-11.log:Mar 24 23:50:03 lola-11 kernel: Lustre: soaked-MDT0003: Recovery over after 0:55, of 12 clients 12 recovered and 0 were evicted.
lola-11.log:Mar 25 01:05:09 lola-11 kernel: Lustre: soaked-MDT0003: Recovery over after 1:09, of 12 clients 12 recovered and 0 were evicted.
lola-11.log:Mar 25 02:05:57 lola-11 kernel: Lustre: soaked-MDT0003: Recovery over after 1:11, of 12 clients 12 recovered and 0 were evicted.
lola-11.log:Mar 25 07:16:49 lola-11 kernel: Lustre: soaked-MDT0003: Recovery over after 1:05, of 12 clients 12 recovered and 0 were evicted.
lola-11.log:Mar 25 08:37:15 lola-11 kernel: Lustre: soaked-MDT0003: Recovery over after 0:20, of 12 clients 12 recovered and 0 were evicted.
lola-8.log:Mar 24 08:56:21 lola-8 kernel: Lustre: soaked-MDT0000: Recovery over after 5:00, of 12 clients 3 recovered and 9 were evicted.
lola-8.log:Mar 24 14:14:03 lola-8 kernel: Lustre: soaked-MDT0000: Recovery over after 1:25, of 12 clients 12 recovered and 0 were evicted.
lola-8.log:Mar 24 18:07:07 lola-8 kernel: Lustre: soaked-MDT0000: Recovery over after 0:24, of 12 clients 12 recovered and 0 were evicted.
lola-8.log:Mar 24 19:16:10 lola-8 kernel: Lustre: soaked-MDT0000: Recovery over after 1:06, of 12 clients 12 recovered and 0 were evicted.
lola-8.log:Mar 25 00:49:16 lola-8 kernel: Lustre: soaked-MDT0000: Recovery over after 1:24, of 12 clients 12 recovered and 0 were evicted.
lola-8.log:Mar 25 02:17:26 lola-8 kernel: Lustre: soaked-MDT0000: Recovery over after 0:28, of 12 clients 12 recovered and 0 were evicted.
lola-8.log:Mar 25 03:03:24 lola-8 kernel: Lustre: soaked-MDT0000: Recovery over after 0:18, of 12 clients 12 recovered and 0 were evicted.
lola-8.log:Mar 25 06:09:20 lola-8 kernel: Lustre: soaked-MDT0001: Recovery over after 0:45, of 12 clients 12 recovered and 0 were evicted.
lola-8.log:Mar 25 08:00:11 lola-8 kernel: Lustre: soaked-MDT0000: Recovery over after 1:13, of 12 clients 12 recovered and 0 were evicted.
lola-8.log:Mar 25 09:23:57 lola-8 kernel: Lustre: soaked-MDT0000: Recovery over after 0:28, of 12 clients 12 recovered and 0 were evicted.
lola-9.log:Mar 24 08:56:21 lola-9 kernel: Lustre: soaked-MDT0001: Recovery over after 5:00, of 12 clients 3 recovered and 9 were evicted.
lola-9.log:Mar 24 10:58:12 lola-9 kernel: Lustre: soaked-MDT0001: Recovery over after 0:48, of 12 clients 12 recovered and 0 were evicted.
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="147027" author="cliffw" created="Mon, 28 Mar 2016 14:39:31 +0000"  >&lt;p&gt;Soak hit the OOM killer again after ~48 hours  of running. &lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;2016-03-25 23:35:03,097:fsmgmt.fsmgmt:INFO     Mounting soaked-MDT0003 on lola-10 ...
2016-03-25 23:37:36,548:fsmgmt.fsmgmt:INFO     ... soaked-MDT0003 mounted successfully on lola-10
2016-03-25 23:37:36,548:fsmgmt.fsmgmt:INFO     ... soaked-MDT0003 failed over
2016-03-25 23:37:36,549:fsmgmt.fsmgmt:INFO     Wait &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; recovery to complete
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt; 
&lt;p&gt;errors reported to syslog during the recovery&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;Mar 26 00:00:54 lola-10 kernel: LustreError: 4923:0:(ldlm_lib.c:2773:target_queue_recovery_request()) @@@ dropping resent queued req  req@ffff8803efc440c0 x1529699890967980/t0(270583579075) o36-&amp;gt;df764ab7-38a0-d041-b3b3-6cbdb5966d59@192.168.1.131@o2ib100:70/0 lens 768/0 e 0 to 0 dl 1458975660 ref 2 fl Interpret:/6/ffffffff rc 0/-1
Mar 26 00:00:54 lola-10 kernel: LustreError: 4923:0:(ldlm_lib.c:2773:target_queue_recovery_request()) Skipped 320 previous similar messages

Mar 26 00:02:21 lola-10 kernel: LustreError: 3413:0:(&lt;span class=&quot;code-keyword&quot;&gt;import&lt;/span&gt;.c:675:ptlrpc_connect_import()) already connecting
Mar 26 00:02:21 lola-10 kernel: LustreError: 3413:0:(&lt;span class=&quot;code-keyword&quot;&gt;import&lt;/span&gt;.c:675:ptlrpc_connect_import()) Skipped 3 previous similar messages

Mar 27 03:15:21 lola-10 kernel: Lustre: soaked-MDT0003: Recovery already passed deadline 1650:08, It is most likely due to DNE recovery is failed or stuck, please wait a few more minutes or abort the recovery.

&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Due to the weekend, the recovery was hung for quite a long time, last console message is dated Mar 27 03:19:15&lt;/p&gt;</comment>
                            <comment id="147508" author="cliffw" created="Thu, 31 Mar 2016 22:29:54 +0000"  >&lt;p&gt;The same failure recurred, Mar 31. Seems to be very regular, but takes a day or so. &lt;br/&gt;
Same sequence of events as above, errors from lola-9&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;Lustre: Skipped 1164 previous similar messages^M
Lustre: soaked-MDT0000: Received &lt;span class=&quot;code-keyword&quot;&gt;new&lt;/span&gt; LWP connection from 192.168.1.110@o2ib10, removing former export from same NID^M
Lustre: Skipped 287 previous similar messages^M
LustreError: 4406:0:(ldlm_lib.c:2773:target_queue_recovery_request()) @@@ dropping resent queued req  req@ffff88040c491850 x1530264857542592/t0(399435290195) o1000-&amp;gt;soaked-MDT0001-mdtlov_UUID@0@lo:409/0 lens 344/0 e 0 to 0 dl 1459462974 ref 2 fl Interpret:/6/ffffffff rc 0/-1^M
LustreError: 4406:0:(ldlm_lib.c:2773:target_queue_recovery_request()) Skipped 263 previous similar messages^M
LustreError: 4304:0:(client.c:2874:ptlrpc_replay_interpret()) @@@ request replay timed out.^M
  req@ffff8807008016c0 x1530264857542592/t399435290195(399435290195) o1000-&amp;gt;soaked-MDT0000-osp-MDT0001@0@lo:24/4 lens 344/4288 e 1146 to 1 dl 1459462988 ref 2 fl Interpret:EX/6/ffffffff rc -110/-1^M
LustreError: 4304:0:(client.c:2874:ptlrpc_replay_interpret()) Skipped 87 previous similar messages^M
Mar 31 15:25:01 lola-9 TIME: Time stamp &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; console
Lustre: soaked-MDT0003-osp-MDT0000: Connection to soaked-MDT0003 (at 192.168.1.111@o2ib10) was lost; in progress operations using &lt;span class=&quot;code-keyword&quot;&gt;this&lt;/span&gt; service will wait &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; recovery to complete^M
Lustre: soaked-MDT0000: Recovery already passed deadline 477:37, It is most likely due to DNE recovery is failed or stuck, please wait a few more minutes or abort the recovery.^M
Lustre: Skipped 875 previous similar messages^M
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="148890" author="heckes" created="Thu, 14 Apr 2016 10:43:07 +0000"  >&lt;p&gt;The error appears also during soak test of build &apos;20160413&apos; (see &lt;a href=&quot;https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160413&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160413&lt;/a&gt;). Test configuration is the same as specified above again.&lt;br/&gt;
Please let me know if this should be handled in a new ticket!&lt;/p&gt;

&lt;p&gt;After MDS restart and failover of 3 MDTs &lt;tt&gt;lola-&lt;span class=&quot;error&quot;&gt;&amp;#91;8,10,11&amp;#93;&lt;/span&gt;&lt;/tt&gt; the recovery process is stucked,&lt;/p&gt;

&lt;p&gt;Sequence of events:&lt;/p&gt;
&lt;ul&gt;
	&lt;li&gt;2016-04-13 16:51:12,334:fsmgmt.fsmgmt:INFO     mds_restart &lt;tt&gt;lola-11&lt;/tt&gt; just completed&lt;br/&gt;
Shows many messages also (INTL-156?)
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Apr 13 16:51:51 lola-11 kernel: LustreError: 5331:0:(ldlm_lib.c:1900:check_for_next_transno()) soaked-MDT0003: waking for gap in transno, VBR is OFF (skip: 12890084916, ql: 5, comp: 15, conn: 20, next: 12890084917, next_update 12890084974 last_committed: 12890084882)
...
...
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;&lt;/li&gt;
	&lt;li&gt;2016-04-13 23:43:59,801:fsmgmt.fsmgmt:INFO     mds_restart &lt;tt&gt;lola-10&lt;/tt&gt; just completed&lt;/li&gt;
	&lt;li&gt;2016-04-14 00:41:58,831:fsmgmt.fsmgmt:INFO     mds_restart &lt;tt&gt;lola-8&lt;/tt&gt; just completed&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;Attached files: message, console and debug_kernel files of &lt;tt&gt;lola-&lt;span class=&quot;error&quot;&gt;&amp;#91;8,10,11&amp;#93;&lt;/span&gt;&lt;/tt&gt;. Debug files contain&lt;br/&gt;
information gathered with default debug mask.&lt;/p&gt;</comment>
                            <comment id="148964" author="di.wang" created="Thu, 14 Apr 2016 17:35:43 +0000"  >&lt;p&gt;You can ignore these console message for now, and this is an known issue, see &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-7732&quot; title=&quot;check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF &quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-7732&quot;&gt;&lt;del&gt;LU-7732&lt;/del&gt;&lt;/a&gt;.&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;soaked-MDT0003: waking for gap in transno, VBR is OFF (skip: 12890084916, ql: 5, comp: 15, conn: 20, next: 12890084917, next_update 12890084974 last_committed: 12890084882)
...
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Hmm, I did not see the console message from lola-9. And also it seems lola-10 crashed around 04-14 6am because of &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-8022&quot; title=&quot;LNet: BUG: unable to handle kernel NULL pointer dereference&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-8022&quot;&gt;&lt;del&gt;LU-8022&lt;/del&gt;&lt;/a&gt;. when did this recovery stuck happen? thanks. Just wondering if this is related?&lt;/p&gt;</comment>
                            <comment id="148966" author="cliffw" created="Thu, 14 Apr 2016 17:56:01 +0000"  >&lt;p&gt;Frank can confirm, but I believe the stuck recovery happened on the first mds_failover after we restarted. The restart was to recover from the lola-10 crash. So, not related. &lt;/p&gt;</comment>
                            <comment id="154012" author="gerrit" created="Tue, 31 May 2016 04:54:10 +0000"  >&lt;p&gt;Oleg Drokin (oleg.drokin@intel.com) merged in patch &lt;a href=&quot;http://review.whamcloud.com/18800/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/18800/&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-7848&quot; title=&quot;Recovery process on MDS stalled&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-7848&quot;&gt;&lt;del&gt;LU-7848&lt;/del&gt;&lt;/a&gt; target: Do not fail MDT-MDT connection&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: &lt;br/&gt;
Commit: a37b7315ad15a1012c22277a0f7c7b7c9a989b59&lt;/p&gt;</comment>
                            <comment id="154068" author="pjones" created="Tue, 31 May 2016 12:53:18 +0000"  >&lt;p&gt;Landed for 2.9&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                                        </outwardlinks>
                                                                <inwardlinks description="is related to">
                                        <issuelink>
            <issuekey id="35775">LU-7974</issuekey>
        </issuelink>
                            </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                            <attachment id="21123" name="console-lola-10-log-20160407.bz2" size="54334" author="heckes" created="Thu, 14 Apr 2016 12:03:45 +0000"/>
                            <attachment id="20662" name="console-lola-10.log.bz2" size="518540" author="heckes" created="Fri, 4 Mar 2016 16:44:31 +0000"/>
                            <attachment id="21124" name="console-lola-11-log-20160407.bz2" size="86230" author="heckes" created="Thu, 14 Apr 2016 12:03:45 +0000"/>
                            <attachment id="20663" name="console-lola-11.log.bz2" size="576921" author="heckes" created="Fri, 4 Mar 2016 16:44:31 +0000"/>
                            <attachment id="21122" name="console-lola-8-log-20160407.bz2" size="142786" author="heckes" created="Thu, 14 Apr 2016 12:03:45 +0000"/>
                            <attachment id="20660" name="console-lola-8.log.bz2" size="740422" author="heckes" created="Fri, 4 Mar 2016 16:44:31 +0000"/>
                            <attachment id="20661" name="console-lola-9.log.bz2" size="665739" author="heckes" created="Fri, 4 Mar 2016 16:44:31 +0000"/>
                            <attachment id="20670" name="lola-10-lustre-log-20160304-0751.bz2" size="2247193" author="heckes" created="Fri, 4 Mar 2016 16:59:04 +0000"/>
                            <attachment id="21129" name="lola-10_lustre-log.20160414-0312.bz2" size="3333410" author="heckes" created="Thu, 14 Apr 2016 12:13:29 +0000"/>
                            <attachment id="20671" name="lola-11-lustre-log-20160304-0751.bz2" size="1513397" author="heckes" created="Fri, 4 Mar 2016 16:59:05 +0000"/>
                            <attachment id="21130" name="lola-11_lustre-log.20160414-0312.bz2" size="2327536" author="heckes" created="Thu, 14 Apr 2016 12:13:29 +0000"/>
                            <attachment id="20668" name="lola-8-lustre-log-20160304-0751.bz2" size="2709574" author="heckes" created="Fri, 4 Mar 2016 16:59:04 +0000"/>
                            <attachment id="21128" name="lola-8_lustre-log.20160414-0312.bz2" size="3832681" author="heckes" created="Thu, 14 Apr 2016 12:13:29 +0000"/>
                            <attachment id="20669" name="lola-9-lustre-log-20160304-0751.bz2" size="2233853" author="heckes" created="Fri, 4 Mar 2016 16:59:04 +0000"/>
                            <attachment id="20802" name="lustre-log-20160318-0240.bz2" size="1072223" author="heckes" created="Fri, 18 Mar 2016 14:03:26 +0000"/>
                            <attachment id="21126" name="messages-lola-10.log-20160414.bz2" size="86085" author="heckes" created="Thu, 14 Apr 2016 12:04:24 +0000"/>
                            <attachment id="20666" name="messages-lola-10.log.bz2" size="378659" author="heckes" created="Fri, 4 Mar 2016 16:47:43 +0000"/>
                            <attachment id="21127" name="messages-lola-11.log-20160414.bz2" size="91964" author="heckes" created="Thu, 14 Apr 2016 12:04:24 +0000"/>
                            <attachment id="20667" name="messages-lola-11.log.bz2" size="297552" author="heckes" created="Fri, 4 Mar 2016 16:47:43 +0000"/>
                            <attachment id="21125" name="messages-lola-8.log-20160414.bz2" size="113529" author="heckes" created="Thu, 14 Apr 2016 12:04:24 +0000"/>
                            <attachment id="20664" name="messages-lola-8.log.bz2" size="331516" author="heckes" created="Fri, 4 Mar 2016 16:47:43 +0000"/>
                            <attachment id="20665" name="messages-lola-9.log.bz2" size="376232" author="heckes" created="Fri, 4 Mar 2016 16:47:43 +0000"/>
                            <attachment id="20783" name="recovery-times-20160317" size="18358" author="heckes" created="Thu, 17 Mar 2016 14:46:37 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzy3nr:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>