<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 02:22:49 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-9049] DNE MDT Never completes recovery</title>
                <link>https://jira.whamcloud.com/browse/LU-9049</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;Failover of lola-11 to lola-10&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;2017-01-24 16:44:56,220:fsmgmt.fsmgmt:INFO     Mounting soaked-MDT0003 on lola-10 ...
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Immediately prior to the failover mount, lola-10 reports hung processes.&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;Jan 24 16:41:03 lola-10 kernel: Lustre: 4554:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1485304863/real 1485304863]  req@ffff8804131dd680 x1557446666855008/t0(0) o38-&amp;gt;soaked-MDT0003-osp-MDT0002@192.168.1.111@o2ib10:24/4 lens 520/544 e 0 to 1 dl 1485304880 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1
Jan 24 16:41:03 lola-10 kernel: Lustre: 4554:0:(client.c:2113:ptlrpc_expire_one_request()) Skipped 1 previous similar message
Jan 24 16:41:35 lola-10 kernel: LNet: Service thread pid 6159 was inactive &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; 226.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; debugging purposes:
Jan 24 16:41:35 lola-10 kernel: Pid: 6159, comm: mdt00_005
Jan 24 16:41:35 lola-10 kernel:
Jan 24 16:41:35 lola-10 kernel: Call Trace:
Jan 24 16:41:35 lola-10 kernel: [&amp;lt;ffffffff8105e4e3&amp;gt;] ? __wake_up+0x53/0x70
Jan 24 16:41:35 lola-10 kernel: [&amp;lt;ffffffffa0a33bb2&amp;gt;] top_trans_stop+0x782/0xbd0 [ptlrpc]
Jan 24 16:41:35 lola-10 kernel: [&amp;lt;ffffffff81067650&amp;gt;] ? default_wake_function+0x0/0x20
Jan 24 16:41:35 lola-10 kernel: [&amp;lt;ffffffffa11b4aac&amp;gt;] lod_trans_stop+0x2bc/0x330 [lod]
Jan 24 16:41:35 lola-10 kernel: [&amp;lt;ffffffffa1259f31&amp;gt;] mdd_trans_stop+0x21/0xc6 [mdd]
Jan 24 16:41:35 lola-10 kernel: [&amp;lt;ffffffffa1246e63&amp;gt;] mdd_create+0x1373/0x17e0 [mdd]
Jan 24 16:41:35 lola-10 kernel: [&amp;lt;ffffffffa1105c44&amp;gt;] ? mdt_version_save+0x84/0x1a0 [mdt]
Jan 24 16:41:35 lola-10 kernel: [&amp;lt;ffffffffa110fcfc&amp;gt;] mdt_reint_create+0xbdc/0xfe0 [mdt]
Jan 24 16:41:35 lola-10 kernel: [&amp;lt;ffffffffa11013bc&amp;gt;] ? mdt_root_squash+0x2c/0x3f0 [mdt]
Jan 24 16:41:35 lola-10 kernel: [&amp;lt;ffffffffa09bb2bb&amp;gt;] ? lustre_pack_reply_v2+0x1eb/0x280 [ptlrpc]
Jan 24 16:41:35 lola-10 kernel: [&amp;lt;ffffffff81299b7a&amp;gt;] ? strlcpy+0x4a/0x60
Jan 24 16:41:35 lola-10 kernel: [&amp;lt;ffffffffa1102a2a&amp;gt;] ? old_init_ucred_common+0xda/0x2b0 [mdt]
Jan 24 16:41:35 lola-10 kernel: [&amp;lt;ffffffffa1104f5d&amp;gt;] mdt_reint_rec+0x5d/0x200 [mdt]
Jan 24 16:41:35 lola-10 kernel: [&amp;lt;ffffffffa10efd9b&amp;gt;] mdt_reint_internal+0x62b/0xa50 [mdt]
Jan 24 16:41:35 lola-10 kernel: [&amp;lt;ffffffffa10f066b&amp;gt;] mdt_reint+0x6b/0x120 [mdt]
Jan 24 16:41:35 lola-10 kernel: [&amp;lt;ffffffffa0a1f17c&amp;gt;] tgt_request_handle+0x8ec/0x1440 [ptlrpc]
Jan 24 16:41:35 lola-10 kernel: [&amp;lt;ffffffffa09ca66b&amp;gt;] ptlrpc_server_handle_request+0x2eb/0xbd0 [ptlrpc]
Jan 24 16:41:35 lola-10 kernel: [&amp;lt;ffffffffa067c84a&amp;gt;] ? lc_watchdog_touch+0x7a/0x190 [libcfs]
Jan 24 16:41:35 lola-10 kernel: [&amp;lt;ffffffffa09c56e9&amp;gt;] ? ptlrpc_wait_event+0xa9/0x2e0 [ptlrpc]
Jan 24 16:41:35 lola-10 kernel: [&amp;lt;ffffffff81059ca9&amp;gt;] ? __wake_up_common+0x59/0x90
Jan 24 16:41:35 lola-10 kernel: [&amp;lt;ffffffffa09cba11&amp;gt;] ptlrpc_main+0xac1/0x18d0 [ptlrpc]
Jan 24 16:41:35 lola-10 kernel: [&amp;lt;ffffffffa09caf50&amp;gt;] ? ptlrpc_main+0x0/0x18d0 [ptlrpc]
Jan 24 16:41:35 lola-10 kernel: [&amp;lt;ffffffff810a138e&amp;gt;] kthread+0x9e/0xc0
Jan 24 16:41:35 lola-10 kernel: [&amp;lt;ffffffff8100c28a&amp;gt;] child_rip+0xa/0x20
Jan 24 16:41:35 lola-10 kernel: [&amp;lt;ffffffff810a12f0&amp;gt;] ? kthread+0x0/0xc0
Jan 24 16:41:35 lola-10 kernel: [&amp;lt;ffffffff8100c280&amp;gt;] ? child_rip+0x0/0x20
Jan 24 16:41:35 lola-10 kernel:
Jan 24 16:41:35 lola-10 kernel: LustreError: dumping log to /tmp/lustre-log.1485304895.6159
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Mount occurs&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;Jan 24 16:46:22 lola-10 kernel: Lustre: soaked-MDT0003: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-900
...
Jan 24 16:46:27 lola-10 kernel: Lustre: soaked-MDT0003: Will be in recovery &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; at least 2:30, or until 22 clients reconnect
...
Jan 24 16:47:44 lola-10 kernel: Lustre: 6025:0:(service.c:1331:ptlrpc_at_send_early_reply()) @@@ Couldn&apos;t add any time (5/5), not sending early reply
Jan 24 16:47:44 lola-10 kernel:  req@ffff88041171b980 x1557448608035648/t21476913294(0) o36-&amp;gt;57c15fb3-4c15-952f-3686-35f4c9caa941@192.168.1.126@o2ib100:-1/-1 lens 632/568 e 19 to 0 dl 1485305269 ref 2 fl Interpret:/0/0 rc 0/0
Jan 24 16:47:47 lola-10 kernel: Lustre: 6025:0:(service.c:1331:ptlrpc_at_send_early_reply()) @@@ Couldn&apos;t add any time (5/5), not sending early reply
Jan 24 16:47:47 lola-10 kernel:  req@ffff880737e13cc0 x1557448490734944/t21476914413(0) o36-&amp;gt;1e7aaa7d-f417-ebd3-eff8-7787d5dac449@192.168.1.123@o2ib100:-1/-1 lens 616/3128 e 19 to 0 dl 1485305272 ref 2 fl Interpret:/0/0 rc 0/0
Jan 24 16:47:47 lola-10 kernel: Lustre: 6025:0:(service.c:1331:ptlrpc_at_send_early_reply()) Skipped 22 previous similar messages
Jan 24 16:47:50 lola-10 kernel: Lustre: soaked-MDT0002: Client ffcd9b78-a534-1d1b-1494-f25c5b00edf1 (at 192.168.1.131@o2ib100) reconnecting
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Recovery appears to stall&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;Jan 24 16:50:18 lola-10 kernel: Lustre: 7291:0:(ldlm_lib.c:1787:extend_recovery_timer()) soaked-MDT0003: extended recovery timer reaching hard limit: 900, extend: 1
Jan 24 16:50:19 lola-10 kernel: Lustre: 7291:0:(ldlm_lib.c:1787:extend_recovery_timer()) soaked-MDT0003: extended recovery timer reaching hard limit: 900, extend: 1
Jan 24 16:50:19 lola-10 kernel: Lustre: 7291:0:(ldlm_lib.c:1787:extend_recovery_timer()) Skipped 2 previous similar messages
Jan 24 16:50:21 lola-10 kernel: Lustre: 7291:0:(ldlm_lib.c:1787:extend_recovery_timer()) soaked-MDT0003: extended recovery timer reaching hard limit: 900, extend: 1
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;System is still spinning at the present time. &lt;/p&gt;</description>
                <environment>Soak test cluster, version=lustre: 2.9.51_45_g3b3eeeb - tip of master</environment>
        <key id="43326">LU-9049</key>
            <summary>DNE MDT Never completes recovery</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="1" iconUrl="https://jira.whamcloud.com/images/icons/priorities/blocker.svg">Blocker</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="laisiyao">Lai Siyao</assignee>
                                    <reporter username="cliffw">Cliff White</reporter>
                        <labels>
                            <label>soak</label>
                    </labels>
                <created>Wed, 25 Jan 2017 16:23:03 +0000</created>
                <updated>Tue, 8 May 2018 03:20:34 +0000</updated>
                            <resolved>Thu, 22 Jun 2017 03:05:45 +0000</resolved>
                                    <version>Lustre 2.10.0</version>
                                    <fixVersion>Lustre 2.10.0</fixVersion>
                                        <due></due>
                            <votes>1</votes>
                                    <watches>13</watches>
                                                                            <comments>
                            <comment id="182097" author="cliffw" created="Wed, 25 Jan 2017 16:28:31 +0000"  >&lt;p&gt;Stack dump from lola-10 and first lustre-log dumped. The system is available if anyone wishes to have a look. &lt;/p&gt;</comment>
                            <comment id="182103" author="cliffw" created="Wed, 25 Jan 2017 16:56:09 +0000"  >&lt;p&gt;At the same time, lola-8 (MGS/MDT0) is having hangs:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;an 24 16:40:04 lola-8 kernel: Lustre: 5828:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1485304804/real 1485304804]  req@ffff8804e8eb1cc0 x1557447056153856/t0(0) o38-&amp;gt;soaked-MDT0003-osp-MDT0000@192.168.1.111@o2ib10:24/4 lens 520/544 e 0 to 1 dl 1485304815 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1
Jan 24 16:40:04 lola-8 kernel: Lustre: 5828:0:(client.c:2113:ptlrpc_expire_one_request()) Skipped 1 previous similar message
Jan 24 16:41:09 lola-8 kernel: LNet: Service thread pid 6336 was inactive &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; debugging purposes:
Jan 24 16:41:09 lola-8 kernel: Pid: 6336, comm: mdt00_014
Jan 24 16:41:09 lola-8 kernel:
Jan 24 16:41:09 lola-8 kernel: Call Trace:
Jan 24 16:41:09 lola-8 kernel: [&amp;lt;ffffffffa0baebb2&amp;gt;] top_trans_stop+0x782/0xbd0 [ptlrpc]
Jan 24 16:41:09 lola-8 kernel: [&amp;lt;ffffffff81067650&amp;gt;] ? default_wake_function+0x0/0x20
Jan 24 16:41:09 lola-8 kernel: [&amp;lt;ffffffffa134caac&amp;gt;] lod_trans_stop+0x2bc/0x330 [lod]
Jan 24 16:41:09 lola-8 kernel: [&amp;lt;ffffffffa136dc3c&amp;gt;] ? lod_attr_get+0x9c/0x1b0 [lod]
Jan 24 16:41:09 lola-8 kernel: [&amp;lt;ffffffffa13f1f31&amp;gt;] mdd_trans_stop+0x21/0xc6 [mdd]
Jan 24 16:41:09 lola-8 kernel: [&amp;lt;ffffffffa13d67f2&amp;gt;] mdd_unlink+0x532/0x1020 [mdd]
Jan 24 16:41:09 lola-8 kernel: [&amp;lt;ffffffffa129cec8&amp;gt;] mdo_unlink+0x18/0x50 [mdt]
Jan 24 16:41:09 lola-8 kernel: [&amp;lt;ffffffffa12a5fcf&amp;gt;] mdt_reint_unlink+0xcaf/0x10c0 [mdt]
Jan 24 16:41:09 lola-8 kernel: [&amp;lt;ffffffffa129cf5d&amp;gt;] mdt_reint_rec+0x5d/0x200 [mdt]
Jan 24 16:41:09 lola-8 kernel: [&amp;lt;ffffffffa1287d9b&amp;gt;] mdt_reint_internal+0x62b/0xa50 [mdt]
Jan 24 16:41:09 lola-8 kernel: [&amp;lt;ffffffffa128866b&amp;gt;] mdt_reint+0x6b/0x120 [mdt]
Jan 24 16:41:09 lola-8 kernel: [&amp;lt;ffffffffa0b9a17c&amp;gt;] tgt_request_handle+0x8ec/0x1440 [ptlrpc]
Jan 24 16:41:09 lola-8 kernel: [&amp;lt;ffffffffa0b4566b&amp;gt;] ptlrpc_server_handle_request+0x2eb/0xbd0 [ptlrpc]
Jan 24 16:41:09 lola-8 kernel: [&amp;lt;ffffffffa082a84a&amp;gt;] ? lc_watchdog_touch+0x7a/0x190 [libcfs]
Jan 24 16:41:09 lola-8 kernel: [&amp;lt;ffffffffa0b406e9&amp;gt;] ? ptlrpc_wait_event+0xa9/0x2e0 [ptlrpc]
Jan 24 16:41:09 lola-8 kernel: [&amp;lt;ffffffff81059ca9&amp;gt;] ? __wake_up_common+0x59/0x90
Jan 24 16:41:09 lola-8 kernel: [&amp;lt;ffffffffa0b46a11&amp;gt;] ptlrpc_main+0xac1/0x18d0 [ptlrpc]
Jan 24 16:41:09 lola-8 kernel: [&amp;lt;ffffffffa0b45f50&amp;gt;] ? ptlrpc_main+0x0/0x18d0 [ptlrpc]
Jan 24 16:41:09 lola-8 kernel: [&amp;lt;ffffffff810a138e&amp;gt;] kthread+0x9e/0xc0
Jan 24 16:41:09 lola-8 kernel: [&amp;lt;ffffffff8100c28a&amp;gt;] child_rip+0xa/0x20
Jan 24 16:41:09 lola-8 kernel: [&amp;lt;ffffffff810a12f0&amp;gt;] ? kthread+0x0/0xc0
Jan 24 16:41:09 lola-8 kernel: [&amp;lt;ffffffff8100c280&amp;gt;] ? child_rip+0x0/0x20
Jan 24 16:41:09 lola-8 kernel:
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Mount occurs&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;Jan 24 16:41:44 lola-8 kernel: Lustre: 5828:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1485304904/real 1485304904]  req@ffff8804e8e9a3c0 x1557447056161856/t0(0) o38-&amp;gt;soaked-MDT0003-osp-MDT0000@192.168.1.111@o2ib10:24/4 lens 520/544 e 0 to 1 dl 1485304925 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1
Jan 24 16:41:44 lola-8 kernel: Lustre: 5828:0:(client.c:2113:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
Jan 24 16:42:49 lola-8 kernel: Lustre: soaked-MDT0000: Received &lt;span class=&quot;code-keyword&quot;&gt;new&lt;/span&gt; LWP connection from 192.168.1.110@o2ib10, removing former export from same NID
Jan 24 16:42:49 lola-8 kernel: Lustre: soaked-MDT0000: Connection restored to soaked-MDT0000-lwp-MDT0002_UUID (at 192.168.1.110@o2ib10)
Jan 24 16:44:39 lola-8 kernel: Lustre: 5828:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1485305079/real 1485305079]  req@ffff8804eb39d080 x1557447056175808/t0(0) o38-&amp;gt;soaked-MDT0003-osp-MDT0000@192.168.1.111@o2ib10:24/4 lens 520/544 e 0 to 1 dl 1485305110 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1
Jan 24 16:44:39 lola-8 kernel: Lustre: 5828:0:(client.c:2113:ptlrpc_expire_one_request()) Skipped 1 previous similar message
Jan 24 16:45:01 lola-8 CROND[7107]: (root) CMD (logger -p local0.notice -t TIME &lt;span class=&quot;code-quote&quot;&gt;&quot;Time stamp &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; console&quot;&lt;/span&gt;)
Jan 24 16:45:01 lola-8 TIME: Time stamp &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; console
Jan 24 16:46:22 lola-8 kernel: Lustre: soaked-MDT0000: Received LWP connection from 192.168.1.110@o2ib10, removing former export from 192.168.1.111@o2ib10
Jan 24 16:46:22 lola-8 kernel: Lustre: soaked-MDT0000: Connection restored to soaked-MDT0003-mdtlov_UUID (at 192.168.1.110@o2ib10)
Jan 24 16:47:44 lola-8 kernel: Lustre: 6298:0:(service.c:1331:ptlrpc_at_send_early_reply()) @@@ Couldn&apos;t add any time (5/5), not sending early reply
Jan 24 16:47:44 lola-8 kernel:  req@ffff8805945686c0 x1557448469206752/t21499374434(0) o36-&amp;gt;905c8d4a-b076-6b11-d52f-9bf74da1631a@192.168.1.120@o2ib100:-1/-1 lens 624/3128 e 24 to 0 dl 1485305269 ref 2 fl Interpret:/0/0 rc 0/0
Jan 24 16:47:50 lola-8 kernel: Lustre: soaked-MDT0000: Client 905c8d4a-b076-6b11-d52f-9bf74da1631a (at 192.168.1.120@o2ib100) reconnecting
Jan 24 16:47:50 lola-8 kernel: Lustre: soaked-MDT0000: Connection restored to 905c8d4a-b076-6b11-d52f-9bf74da1631a (at 192.168.1.120@o2ib100)
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;System continues to have blocked processes&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;Jan 24 16:53:38 lola-8 kernel: INFO: task mdt_out00_002:6070 blocked &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; more than 120 seconds.
Jan 24 16:53:38 lola-8 kernel:      Tainted: P           -- ------------    2.6.32-573.26.1.el6_lustre.x86_64 #1
Jan 24 16:53:38 lola-8 kernel: &lt;span class=&quot;code-quote&quot;&gt;&quot;echo 0 &amp;gt; /proc/sys/kernel/hung_task_timeout_secs&quot;&lt;/span&gt; disables &lt;span class=&quot;code-keyword&quot;&gt;this&lt;/span&gt; message.
Jan 24 16:53:38 lola-8 kernel: mdt_out00_002 D 0000000000000016     0  6070      2 0x00000080
Jan 24 16:53:38 lola-8 kernel: ffff88082230bb00 0000000000000046 0000000000000000 ffff88082230bcc0
Jan 24 16:53:38 lola-8 kernel: ffff8803c91c2f00 ffff8803eb000118 000007f09680dbc0 ffff8803eb000118
Jan 24 16:53:38 lola-8 kernel: ffff88082230bb00 000000010080a9dc ffff8808276045f8 ffff88082230bfd8
Jan 24 16:53:38 lola-8 kernel: Call Trace:
Jan 24 16:53:38 lola-8 kernel: [&amp;lt;ffffffffa0939f8d&amp;gt;] lu_object_find_at+0x3d/0xe0 [obdclass]
Jan 24 16:53:38 lola-8 kernel: [&amp;lt;ffffffffa0b5c612&amp;gt;] ? __req_capsule_get+0x162/0x6e0 [ptlrpc]
Jan 24 16:53:38 lola-8 kernel: [&amp;lt;ffffffff81067650&amp;gt;] ? default_wake_function+0x0/0x20
Jan 24 16:53:38 lola-8 kernel: [&amp;lt;ffffffffa0b31e10&amp;gt;] ? lustre_swab_object_update_reply+0x0/0xc0 [ptlrpc]
Jan 24 16:53:38 lola-8 kernel: [&amp;lt;ffffffffa093ad1c&amp;gt;] dt_locate_at+0x1c/0xa0 [obdclass]
Jan 24 16:53:38 lola-8 kernel: [&amp;lt;ffffffffa0ba2e07&amp;gt;] out_handle+0x1067/0x18d0 [ptlrpc]
Jan 24 16:53:38 lola-8 kernel: [&amp;lt;ffffffff8153afce&amp;gt;] ? mutex_lock+0x1e/0x50
Jan 24 16:53:38 lola-8 kernel: [&amp;lt;ffffffffa0b9308a&amp;gt;] ? req_can_reconstruct+0x6a/0x120 [ptlrpc]
Jan 24 16:53:38 lola-8 kernel: [&amp;lt;ffffffffa0b9a17c&amp;gt;] tgt_request_handle+0x8ec/0x1440 [ptlrpc]
Jan 24 16:53:38 lola-8 kernel: [&amp;lt;ffffffffa0b4566b&amp;gt;] ptlrpc_server_handle_request+0x2eb/0xbd0 [ptlrpc]
Jan 24 16:53:38 lola-8 kernel: [&amp;lt;ffffffffa082a84a&amp;gt;] ? lc_watchdog_touch+0x7a/0x190 [libcfs]
Jan 24 16:53:38 lola-8 kernel: [&amp;lt;ffffffffa0b406e9&amp;gt;] ? ptlrpc_wait_event+0xa9/0x2e0 [ptlrpc]
Jan 24 16:53:38 lola-8 kernel: [&amp;lt;ffffffff81059ca9&amp;gt;] ? __wake_up_common+0x59/0x90
Jan 24 16:53:38 lola-8 kernel: [&amp;lt;ffffffffa0b46a11&amp;gt;] ptlrpc_main+0xac1/0x18d0 [ptlrpc]
Jan 24 16:53:38 lola-8 kernel: [&amp;lt;ffffffffa0b45f50&amp;gt;] ? ptlrpc_main+0x0/0x18d0 [ptlrpc]
Jan 24 16:53:38 lola-8 kernel: [&amp;lt;ffffffff810a138e&amp;gt;] kthread+0x9e/0xc0
Jan 24 16:53:38 lola-8 kernel: [&amp;lt;ffffffff8100c28a&amp;gt;] child_rip+0xa/0x20
Jan 24 16:53:38 lola-8 kernel: [&amp;lt;ffffffff810a12f0&amp;gt;] ? kthread+0x0/0xc0
Jan 24 16:53:38 lola-8 kernel: [&amp;lt;ffffffff8100c280&amp;gt;] ? child_rip+0x0/0x20
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="182109" author="di.wang" created="Wed, 25 Jan 2017 17:31:13 +0000"  >&lt;p&gt;There are a few interesting process on lola-8. It maybe a deadlock, need investigate a bit. Nasf: could you please check if this lfsck thread is suspicious here? thanks.&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;lfsck         S 0000000000000010     0  6061      2 0x00000080
 ffff880821d9f710 0000000000000046 0000000000000000 ffff880821d9f6d4
 0000000000000000 ffff88043fe83e00 00003c78463d5afe ffff880038615a00
 0000000000000200 0000000103f26d04 ffff880821d07068 ffff880821d9ffd8
Call Trace:
 [&amp;lt;ffffffff8153a9b2&amp;gt;] schedule_timeout+0x192/0x2e0
 [&amp;lt;ffffffff81089fa0&amp;gt;] ? process_timeout+0x0/0x10
 [&amp;lt;ffffffffa0b2ae61&amp;gt;] ptlrpc_set_wait+0x321/0x960 [ptlrpc]
 [&amp;lt;ffffffffa0b200d0&amp;gt;] ? ptlrpc_interrupted_set+0x0/0x120 [ptlrpc]
 [&amp;lt;ffffffff81067650&amp;gt;] ? default_wake_function+0x0/0x20
 [&amp;lt;ffffffffa0b373c5&amp;gt;] ? lustre_msg_set_jobid+0xf5/0x130 [ptlrpc]
 [&amp;lt;ffffffffa0b2b521&amp;gt;] ptlrpc_queue_wait+0x81/0x220 [ptlrpc]
 [&amp;lt;ffffffffa1458322&amp;gt;] osp_remote_sync+0xf2/0x1e0 [osp]
 [&amp;lt;ffffffffa143b677&amp;gt;] osp_attr_get+0x447/0x710 [osp]
 [&amp;lt;ffffffffa09167be&amp;gt;] ? lprocfs_counter_add+0x9e/0x140 [obdclass]
 [&amp;lt;ffffffffa143cd45&amp;gt;] osp_object_init+0x1f5/0x360 [osp]
 [&amp;lt;ffffffffa0938a58&amp;gt;] lu_object_alloc+0xd8/0x320 [obdclass]
 [&amp;lt;ffffffffa0939e41&amp;gt;] lu_object_find_try+0x151/0x260 [obdclass]
 [&amp;lt;ffffffffa0db4f3a&amp;gt;] ? fld_cache_lookup+0x3a/0x1e0 [fld]
 [&amp;lt;ffffffffa093a001&amp;gt;] lu_object_find_at+0xb1/0xe0 [obdclass]
 [&amp;lt;ffffffffa0db8cf3&amp;gt;] ? fld_server_lookup+0x53/0x330 [fld]
 [&amp;lt;ffffffffa093a06f&amp;gt;] lu_object_find_slice+0x1f/0x80 [obdclass]
 [&amp;lt;ffffffffa120f522&amp;gt;] lfsck_namespace_verify_stripe_slave+0x232/0xbf0 [lfsck]
 [&amp;lt;ffffffffa10628cb&amp;gt;] ? ldiskfs_xattr_trusted_get+0x2b/0x30 [ldiskfs]
 [&amp;lt;ffffffff8153c200&amp;gt;] ? hrtimer_nanosleep_restart+0x10/0x90
 [&amp;lt;ffffffffa11cd375&amp;gt;] lfsck_namespace_open_dir+0x175/0x1a0 [lfsck]
 [&amp;lt;ffffffffa11c3466&amp;gt;] lfsck_open_dir+0xa6/0x3a0 [lfsck]
 [&amp;lt;ffffffffa11c6267&amp;gt;] lfsck_exec_oit+0x667/0xb30 [lfsck]
 [&amp;lt;ffffffffa0db4f3a&amp;gt;] ? fld_cache_lookup+0x3a/0x1e0 [fld]
 [&amp;lt;ffffffffa11c6f55&amp;gt;] lfsck_master_oit_engine+0x825/0x15b0 [lfsck]
 [&amp;lt;ffffffff81067662&amp;gt;] ? default_wake_function+0x12/0x20
 [&amp;lt;ffffffffa11ca650&amp;gt;] lfsck_master_engine+0xb50/0x1460 [lfsck]
 [&amp;lt;ffffffff81067650&amp;gt;] ? default_wake_function+0x0/0x20
 [&amp;lt;ffffffffa11c9b00&amp;gt;] ? lfsck_master_engine+0x0/0x1460 [lfsck]
 [&amp;lt;ffffffff810a138e&amp;gt;] kthread+0x9e/0xc0
 [&amp;lt;ffffffff8100c28a&amp;gt;] child_rip+0xa/0x20
 [&amp;lt;ffffffff810a12f0&amp;gt;] ? kthread+0x0/0xc0
 [&amp;lt;ffffffff8100c280&amp;gt;] ? child_rip+0x0/0x20

mdt_out00_002 D 0000000000000016     0  6070      2 0x00000080
 ffff88082230bb00 0000000000000046 0000000000000000 ffff88082230bcc0
 ffff8803c91c2f00 ffff8803eb000118 000007f09680dbc0 ffff8803eb000118
 ffff88082230bb00 000000010080a9dc ffff8808276045f8 ffff88082230bfd8
Call Trace:
 [&amp;lt;ffffffffa0939f8d&amp;gt;] lu_object_find_at+0x3d/0xe0 [obdclass]
 [&amp;lt;ffffffffa0b5c612&amp;gt;] ? __req_capsule_get+0x162/0x6e0 [ptlrpc]
 [&amp;lt;ffffffff81067650&amp;gt;] ? default_wake_function+0x0/0x20
 [&amp;lt;ffffffffa0b31e10&amp;gt;] ? lustre_swab_object_update_reply+0x0/0xc0 [ptlrpc]
 [&amp;lt;ffffffffa093ad1c&amp;gt;] dt_locate_at+0x1c/0xa0 [obdclass]
 [&amp;lt;ffffffffa0ba2e07&amp;gt;] out_handle+0x1067/0x18d0 [ptlrpc]
 [&amp;lt;ffffffff8153afce&amp;gt;] ? mutex_lock+0x1e/0x50
 [&amp;lt;ffffffffa0b9308a&amp;gt;] ? req_can_reconstruct+0x6a/0x120 [ptlrpc]
 [&amp;lt;ffffffffa0b9a17c&amp;gt;] tgt_request_handle+0x8ec/0x1440 [ptlrpc]
 [&amp;lt;ffffffffa0b4566b&amp;gt;] ptlrpc_server_handle_request+0x2eb/0xbd0 [ptlrpc]
 [&amp;lt;ffffffffa082a84a&amp;gt;] ? lc_watchdog_touch+0x7a/0x190 [libcfs]
 [&amp;lt;ffffffffa0b406e9&amp;gt;] ? ptlrpc_wait_event+0xa9/0x2e0 [ptlrpc]
 [&amp;lt;ffffffff81059ca9&amp;gt;] ? __wake_up_common+0x59/0x90
 [&amp;lt;ffffffffa0b46a11&amp;gt;] ptlrpc_main+0xac1/0x18d0 [ptlrpc]
 [&amp;lt;ffffffffa0b45f50&amp;gt;] ? ptlrpc_main+0x0/0x18d0 [ptlrpc]
 [&amp;lt;ffffffff810a138e&amp;gt;] kthread+0x9e/0xc0
 [&amp;lt;ffffffff8100c28a&amp;gt;] child_rip+0xa/0x20
 [&amp;lt;ffffffff810a12f0&amp;gt;] ? kthread+0x0/0xc0
 [&amp;lt;ffffffff8100c280&amp;gt;] ? child_rip+0x0/0x20

mdt00_004     S 0000000000000013     0  6077      2 0x00000080
 ffff880837067870 0000000000000046 0000000000000000 ffffffff810a1c1c
 ffff880836aaf520 ffff8808370677f0 0000240d5abd2b10 0000000000000000
 ffff880837067830 0000000102588d0f ffff880836aafad8 ffff880837067fd8
Call Trace:
 [&amp;lt;ffffffff810a1c1c&amp;gt;] ? remove_wait_queue+0x3c/0x50
 [&amp;lt;ffffffffa0b04170&amp;gt;] ? ldlm_expired_completion_wait+0x0/0x230 [ptlrpc]
 [&amp;lt;ffffffffa0b094dd&amp;gt;] ldlm_completion_ast+0x67d/0x9a0 [ptlrpc]
 [&amp;lt;ffffffff81067650&amp;gt;] ? default_wake_function+0x0/0x20
 [&amp;lt;ffffffffa0b0886e&amp;gt;] ldlm_cli_enqueue_local+0x21e/0x810 [ptlrpc]
 [&amp;lt;ffffffffa0b08e60&amp;gt;] ? ldlm_completion_ast+0x0/0x9a0 [ptlrpc]
 [&amp;lt;ffffffffa12807a0&amp;gt;] ? mdt_blocking_ast+0x0/0x2e0 [mdt]
 [&amp;lt;ffffffffa128d4e4&amp;gt;] mdt_object_local_lock+0x3a4/0xb00 [mdt]
 [&amp;lt;ffffffffa12807a0&amp;gt;] ? mdt_blocking_ast+0x0/0x2e0 [mdt]
 [&amp;lt;ffffffffa0b08e60&amp;gt;] ? ldlm_completion_ast+0x0/0x9a0 [ptlrpc]
 [&amp;lt;ffffffffa093a001&amp;gt;] ? lu_object_find_at+0xb1/0xe0 [obdclass]
 [&amp;lt;ffffffffa128e143&amp;gt;] mdt_object_lock_internal+0x63/0x320 [mdt]
 [&amp;lt;ffffffffa128e614&amp;gt;] mdt_object_lock+0x14/0x20 [mdt]
 [&amp;lt;ffffffffa1295266&amp;gt;] mdt_getattr_name_lock+0xdd6/0x1910 [mdt]
 [&amp;lt;ffffffffa12962c2&amp;gt;] mdt_intent_getattr+0x292/0x470 [mdt]
 [&amp;lt;ffffffffa128691e&amp;gt;] mdt_intent_policy+0x4be/0xc70 [mdt]
 [&amp;lt;ffffffffa0ae76d7&amp;gt;] ldlm_lock_enqueue+0x127/0x990 [ptlrpc]
 [&amp;lt;ffffffffa0b134bb&amp;gt;] ldlm_handle_enqueue0+0x98b/0x14e0 [ptlrpc]
 [&amp;lt;ffffffffa0b98fd1&amp;gt;] tgt_enqueue+0x61/0x230 [ptlrpc]
 [&amp;lt;ffffffffa0b9a17c&amp;gt;] tgt_request_handle+0x8ec/0x1440 [ptlrpc]
 [&amp;lt;ffffffffa0b4566b&amp;gt;] ptlrpc_server_handle_request+0x2eb/0xbd0 [ptlrpc]
 [&amp;lt;ffffffffa082a84a&amp;gt;] ? lc_watchdog_touch+0x7a/0x190 [libcfs]
 [&amp;lt;ffffffffa0b406e9&amp;gt;] ? ptlrpc_wait_event+0xa9/0x2e0 [ptlrpc]
 [&amp;lt;ffffffff81059ca9&amp;gt;] ? __wake_up_common+0x59/0x90
 [&amp;lt;ffffffffa0b46a11&amp;gt;] ptlrpc_main+0xac1/0x18d0 [ptlrpc]
 [&amp;lt;ffffffffa0b45f50&amp;gt;] ? ptlrpc_main+0x0/0x18d0 [ptlrpc]
 [&amp;lt;ffffffff810a138e&amp;gt;] kthread+0x9e/0xc0
 [&amp;lt;ffffffff8100c28a&amp;gt;] child_rip+0xa/0x20
 [&amp;lt;ffffffff810a12f0&amp;gt;] ? kthread+0x0/0xc0
 [&amp;lt;ffffffff8100c280&amp;gt;] ? child_rip+0x0/0x20

mdt00_009     S 0000000000000000     0  6304      2 0x00000080
 ffff88063d177930 0000000000000046 ffffffff00000018 ffff88063d177900
 ffff88063d177890 ffffffff8153a9ba ffff88063d1778d0 0000000000000286
 0000000000000000 000000005888a5eb ffff88063d163068 ffff88063d177fd8
Call Trace:
 [&amp;lt;ffffffff8153a9ba&amp;gt;] ? schedule_timeout+0x19a/0x2e0
 [&amp;lt;ffffffffa0b04170&amp;gt;] ? ldlm_expired_completion_wait+0x0/0x230 [ptlrpc]
 [&amp;lt;ffffffffa0b094dd&amp;gt;] ldlm_completion_ast+0x67d/0x9a0 [ptlrpc]
 [&amp;lt;ffffffff81067650&amp;gt;] ? default_wake_function+0x0/0x20
 [&amp;lt;ffffffffa0b0886e&amp;gt;] ldlm_cli_enqueue_local+0x21e/0x810 [ptlrpc]
 [&amp;lt;ffffffffa0b08e60&amp;gt;] ? ldlm_completion_ast+0x0/0x9a0 [ptlrpc]
 [&amp;lt;ffffffffa12807a0&amp;gt;] ? mdt_blocking_ast+0x0/0x2e0 [mdt]
 [&amp;lt;ffffffffa128d4e4&amp;gt;] mdt_object_local_lock+0x3a4/0xb00 [mdt]
 [&amp;lt;ffffffffa12807a0&amp;gt;] ? mdt_blocking_ast+0x0/0x2e0 [mdt]
 [&amp;lt;ffffffffa0b08e60&amp;gt;] ? ldlm_completion_ast+0x0/0x9a0 [ptlrpc]
 [&amp;lt;ffffffffa128e143&amp;gt;] mdt_object_lock_internal+0x63/0x320 [mdt]
 [&amp;lt;ffffffffa128e614&amp;gt;] mdt_object_lock+0x14/0x20 [mdt]
 [&amp;lt;ffffffffa12a781a&amp;gt;] mdt_reint_create+0x6fa/0xfe0 [mdt]
 [&amp;lt;ffffffffa12993bc&amp;gt;] ? mdt_root_squash+0x2c/0x3f0 [mdt]
 [&amp;lt;ffffffffa0b362bb&amp;gt;] ? lustre_pack_reply_v2+0x1eb/0x280 [ptlrpc]
 [&amp;lt;ffffffff81299b7a&amp;gt;] ? strlcpy+0x4a/0x60
 [&amp;lt;ffffffffa12997df&amp;gt;] ? ucred_set_jobid+0x5f/0x70 [mdt]
 [&amp;lt;ffffffffa129aa2a&amp;gt;] ? old_init_ucred_common+0xda/0x2b0 [mdt]
 [&amp;lt;ffffffffa129cf5d&amp;gt;] mdt_reint_rec+0x5d/0x200 [mdt]
 [&amp;lt;ffffffffa1287d9b&amp;gt;] mdt_reint_internal+0x62b/0xa50 [mdt]
 [&amp;lt;ffffffffa128866b&amp;gt;] mdt_reint+0x6b/0x120 [mdt]
 [&amp;lt;ffffffffa0b9a17c&amp;gt;] tgt_request_handle+0x8ec/0x1440 [ptlrpc]
 [&amp;lt;ffffffffa0b4566b&amp;gt;] ptlrpc_server_handle_request+0x2eb/0xbd0 [ptlrpc]
 [&amp;lt;ffffffffa082a84a&amp;gt;] ? lc_watchdog_touch+0x7a/0x190 [libcfs]
 [&amp;lt;ffffffffa0b406e9&amp;gt;] ? ptlrpc_wait_event+0xa9/0x2e0 [ptlrpc]
 [&amp;lt;ffffffff81059ca9&amp;gt;] ? __wake_up_common+0x59/0x90
 [&amp;lt;ffffffffa0b46a11&amp;gt;] ptlrpc_main+0xac1/0x18d0 [ptlrpc]
 [&amp;lt;ffffffffa0b45f50&amp;gt;] ? ptlrpc_main+0x0/0x18d0 [ptlrpc]
 [&amp;lt;ffffffff810a138e&amp;gt;] kthread+0x9e/0xc0
 [&amp;lt;ffffffff8100c28a&amp;gt;] child_rip+0xa/0x20
 [&amp;lt;ffffffff810a12f0&amp;gt;] ? kthread+0x0/0xc0
 [&amp;lt;ffffffff8100c280&amp;gt;] ? child_rip+0x0/0x20

mdt00_014     S 0000000000000012     0  6336      2 0x00000080
 ffff88062f783990 0000000000000046 0000000000000000 ffff8804137e8398
 0000000000000282 0000000000000003 0000073705fbcc95 0000000000000282
 ffff88062f783940 0000000100748091 ffff88063393c5f8 ffff88062f783fd8
Call Trace:
 [&amp;lt;ffffffffa0baebb2&amp;gt;] top_trans_stop+0x782/0xbd0 [ptlrpc]
 [&amp;lt;ffffffff81067650&amp;gt;] ? default_wake_function+0x0/0x20
 [&amp;lt;ffffffffa134caac&amp;gt;] lod_trans_stop+0x2bc/0x330 [lod]
 [&amp;lt;ffffffffa136dc3c&amp;gt;] ? lod_attr_get+0x9c/0x1b0 [lod]
 [&amp;lt;ffffffffa13f1f31&amp;gt;] mdd_trans_stop+0x21/0xc6 [mdd]
 [&amp;lt;ffffffffa13d67f2&amp;gt;] mdd_unlink+0x532/0x1020 [mdd]
 [&amp;lt;ffffffffa129cec8&amp;gt;] mdo_unlink+0x18/0x50 [mdt]
 [&amp;lt;ffffffffa12a5fcf&amp;gt;] mdt_reint_unlink+0xcaf/0x10c0 [mdt]
 [&amp;lt;ffffffffa129cf5d&amp;gt;] mdt_reint_rec+0x5d/0x200 [mdt]
 [&amp;lt;ffffffffa1287d9b&amp;gt;] mdt_reint_internal+0x62b/0xa50 [mdt]
 [&amp;lt;ffffffffa128866b&amp;gt;] mdt_reint+0x6b/0x120 [mdt]
 [&amp;lt;ffffffffa0b9a17c&amp;gt;] tgt_request_handle+0x8ec/0x1440 [ptlrpc]
 [&amp;lt;ffffffffa0b4566b&amp;gt;] ptlrpc_server_handle_request+0x2eb/0xbd0 [ptlrpc]
 [&amp;lt;ffffffffa082a84a&amp;gt;] ? lc_watchdog_touch+0x7a/0x190 [libcfs]
 [&amp;lt;ffffffffa0b406e9&amp;gt;] ? ptlrpc_wait_event+0xa9/0x2e0 [ptlrpc]
 [&amp;lt;ffffffff81059ca9&amp;gt;] ? __wake_up_common+0x59/0x90
 [&amp;lt;ffffffffa0b46a11&amp;gt;] ptlrpc_main+0xac1/0x18d0 [ptlrpc]
 [&amp;lt;ffffffffa0b45f50&amp;gt;] ? ptlrpc_main+0x0/0x18d0 [ptlrpc]
 [&amp;lt;ffffffff810a138e&amp;gt;] kthread+0x9e/0xc0
 [&amp;lt;ffffffff8100c28a&amp;gt;] child_rip+0xa/0x20
 [&amp;lt;ffffffff810a12f0&amp;gt;] ? kthread+0x0/0xc0
 [&amp;lt;ffffffff8100c280&amp;gt;] ? child_rip+0x0/0x20
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="182298" author="jgmitter" created="Thu, 26 Jan 2017 18:30:59 +0000"  >&lt;p&gt;Hi Alex,&lt;br/&gt;
Since Nasf is out for the Chinese holiday, do you think you could look at the lfsck thread above to see if you have any input?&lt;br/&gt;
Thanks.&lt;br/&gt;
Joe&lt;/p&gt;</comment>
                            <comment id="182300" author="bzzz" created="Thu, 26 Jan 2017 18:34:29 +0000"  >&lt;p&gt;oK..&lt;/p&gt;</comment>
                            <comment id="182362" author="yong.fan" created="Fri, 27 Jan 2017 10:03:19 +0000"  >&lt;p&gt;As explained in the &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-8815&quot; title=&quot;MDS fails to complete recovery&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-8815&quot;&gt;&lt;del&gt;LU-8815&lt;/del&gt;&lt;/a&gt;, someone holds reference on the dying object as to block others who are trying to locate the object with the same FID. If the reference holder is blocked for other reason, then will cause cascading block, looks like the system hung. Here, according to the stack trace, the LFSCK engine is one of the threads that are blocked by the dying object reference holder. What need to be found out:&lt;/p&gt;

&lt;p&gt;1) what the dying object is.&lt;br/&gt;
2) who hold the reference on such dying object.&lt;br/&gt;
3) why the reference holder does not release the dying object.&lt;/p&gt;

&lt;p&gt;In fact, if we can the answers for 1) and/or 2), then get the answers for the 3) may be not difficult. Unfortunately, I still has no answers for none of them. &lt;/p&gt;</comment>
                            <comment id="182365" author="bzzz" created="Fri, 27 Jan 2017 11:05:54 +0000"  >&lt;p&gt;I think we&apos;re ready to get rid of check for LU_OBJECT_BANSHEE from htable_lookup() - the only reason was to prevent an access to such objects, but we&apos;ve got the check in the every method.&lt;/p&gt;</comment>
                            <comment id="182503" author="yujian" created="Sun, 29 Jan 2017 05:08:07 +0000"  >&lt;p&gt;sanity-lfsck test 30 hit the similar failure on master branch in review-dne-part-2 test session:&lt;br/&gt;
&lt;a href=&quot;https://testing.hpdd.intel.com/test_sets/1fd15748-e5ce-11e6-978f-5254006e85c2&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://testing.hpdd.intel.com/test_sets/1fd15748-e5ce-11e6-978f-5254006e85c2&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Console log on one of the two MDSs:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Lustre: DEBUG MARKER: == sanity-lfsck test 30: LFSCK can recover the orphans from backend /lost+found ====================== 20:00:39 (1485633639)
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param fail_loc=0x161d
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param fail_loc=0
Lustre: DEBUG MARKER: /usr/sbin/lctl set_param fail_loc=0x1624
general protection fault: 0000 [#1] SMP 
Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) sha512_ssse3 sha512_generic crypto_null libcfs(OE) ldiskfs(OE) dm_mod rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod crc_t10dif crct10dif_generic ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_core iosf_mbi crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd ppdev parport_pc parport pcspkr i2c_piix4 virtio_balloon nfsd nfs_acl lockd auth_rpcgss grace sunrpc ip_tables ext4 mbcache jbd2 ata_generic pata_acpi cirrus virtio_blk drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops crct10dif_pclmul crct10dif_common ttm ata_piix crc32c_intel 8139too virtio_pci virtio_ring serio_raw virtio drm i2c_core 8139cp mii libata floppy
CPU: 0 PID: 11163 Comm: mdt00_001 Tainted: G           OE  ------------   3.10.0-514.2.2.el7_lustre.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007
task: ffff880043db5e20 ti: ffff880044818000 task.ti: ffff880044818000
RIP: 0010:[&amp;lt;ffffffff811de300&amp;gt;]  [&amp;lt;ffffffff811de300&amp;gt;] kmem_cache_alloc_trace+0x80/0x200
RSP: 0018:ffff88004481b820  EFLAGS: 00010286
RAX: 0000000000000000 RBX: ffff880036ad1c00 RCX: 000000000068969a
RDX: 0000000000689699 RSI: 0000000000008050 RDI: ffff88007d001a00
RBP: ffff88004481b858 R08: 0000000000019aa0 R09: ffff88007d001a00
R10: ffffffffa0aeb29e R11: 64657474696d6d6f R12: 0002680000000100
R13: 0000000000008050 R14: 0000000000000058 R15: ffff88007d001a00
FS:  0000000000000000(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f203169a000 CR3: 000000007b855000 CR4: 00000000000406f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Stack:
 ffff88007d001a00 ffffffffa0aeb29e ffff880036ad1c00 ffff8800787d3000
 ffff880078d57840 ffff88003f82d800 0000000000000000 ffff88004481b8d0
 ffffffffa0aeb29e ffff8800395034bb 0000000900000659 0000000000000000
Call Trace:
 [&amp;lt;ffffffffa0aeb29e&amp;gt;] ? tgt_last_rcvd_update+0x64e/0xc80 [ptlrpc]
 [&amp;lt;ffffffffa0aeb29e&amp;gt;] tgt_last_rcvd_update+0x64e/0xc80 [ptlrpc]
 [&amp;lt;ffffffffa0aef49d&amp;gt;] tgt_txn_stop_cb+0x19d/0x490 [ptlrpc]
 [&amp;lt;ffffffffa07308b8&amp;gt;] ? libcfs_debug_vmsg2+0x708/0xb70 [libcfs]
 [&amp;lt;ffffffffa087063e&amp;gt;] dt_txn_hook_stop+0x7e/0xd0 [obdclass]
 [&amp;lt;ffffffffa0cb71fb&amp;gt;] osd_trans_stop+0x12b/0x590 [osd_ldiskfs]
 [&amp;lt;ffffffffa0b09509&amp;gt;] top_trans_stop+0x99/0x970 [ptlrpc]
 [&amp;lt;ffffffffa0f27399&amp;gt;] lod_trans_stop+0x259/0x340 [lod]
 [&amp;lt;ffffffffa0fb5bad&amp;gt;] mdd_trans_stop+0x1d/0x28 [mdd]
 [&amp;lt;ffffffffa0fa5493&amp;gt;] mdd_unlink+0x523/0xd80 [mdd]
 [&amp;lt;ffffffffa0ea3145&amp;gt;] mdo_unlink+0x43/0x45 [mdt]
 [&amp;lt;ffffffffa0e770c4&amp;gt;] mdt_reint_unlink+0xc04/0x11b0 [mdt]
 [&amp;lt;ffffffffa0e7a500&amp;gt;] mdt_reint_rec+0x80/0x210 [mdt]
 [&amp;lt;ffffffffa0e5cdb2&amp;gt;] mdt_reint_internal+0x5b2/0x9b0 [mdt]
 [&amp;lt;ffffffffa0e680f7&amp;gt;] mdt_reint+0x67/0x140 [mdt]
 [&amp;lt;ffffffffa0af57b5&amp;gt;] tgt_request_handle+0x915/0x1320 [ptlrpc]
 [&amp;lt;ffffffffa0aa282b&amp;gt;] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc]
 [&amp;lt;ffffffffa0aa03e8&amp;gt;] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
 [&amp;lt;ffffffff810c4ff2&amp;gt;] ? default_wake_function+0x12/0x20
 [&amp;lt;ffffffff810ba238&amp;gt;] ? __wake_up_common+0x58/0x90
 [&amp;lt;ffffffffa0aa6830&amp;gt;] ptlrpc_main+0xaa0/0x1de0 [ptlrpc]
 [&amp;lt;ffffffffa0aa5d90&amp;gt;] ? ptlrpc_register_service+0xe40/0xe40 [ptlrpc]
 [&amp;lt;ffffffff810b064f&amp;gt;] kthread+0xcf/0xe0
 [&amp;lt;ffffffff810b0580&amp;gt;] ? kthread_create_on_node+0x140/0x140
 [&amp;lt;ffffffff81696898&amp;gt;] ret_from_fork+0x58/0x90
 [&amp;lt;ffffffff810b0580&amp;gt;] ? kthread_create_on_node+0x140/0x140
Code: d0 00 00 49 8b 50 08 4d 8b 20 49 8b 40 10 4d 85 e4 0f 84 24 01 00 00 48 85 c0 0f 84 1b 01 00 00 49 63 47 20 48 8d 4a 01 4d 8b 07 &amp;lt;49&amp;gt; 8b 1c 04 4c 89 e0 65 49 0f c7 08 0f 94 c0 84 c0 74 b9 49 63 
RIP  [&amp;lt;ffffffff811de300&amp;gt;] kmem_cache_alloc_trace+0x80/0x200
 RSP &amp;lt;ffff88004481b820&amp;gt;
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="192736" author="cliffw" created="Wed, 19 Apr 2017 19:21:10 +0000"  >&lt;p&gt;restarted soak with latest lustre-master (3561)  mounted all nodes. &lt;br/&gt;
Soak-10 does not complete recovery, recovery timer increments until it can increment no more. &lt;br/&gt;
Attached Lustre kernel dump, system logs&lt;/p&gt;</comment>
                            <comment id="192862" author="cliffw" created="Thu, 20 Apr 2017 15:35:48 +0000"  >&lt;p&gt;I left the system overnight, and recovery did eventually complete.&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;[17166.319310] Lustre: 3626:0:(ldlm_lib.c:1781:extend_recovery_timer()) soaked-MDT0002: extended recovery timer reaching hard limit: 900, extend: 1
[17166.339932] Lustre: 3626:0:(ldlm_lib.c:1781:extend_recovery_timer()) Skipped 211354 previous similar messages
[17462.141184] Lustre: 3626:0:(ldlm_lib.c:2046:target_recovery_overseer()) soaked-MDT0002 recovery is aborted by hard timeout
[17462.156428] Lustre: 3626:0:(ldlm_lib.c:2056:target_recovery_overseer()) recovery is aborted, evict exports in recovery
[17462.174477] Lustre: 3626:0:(ldlm_lib.c:2552:target_recovery_thread()) too &lt;span class=&quot;code-object&quot;&gt;long&lt;/span&gt; recovery - read logs
[17462.187265] Lustre: soaked-MDT0002: Recovery over after 283:48, of 3 clients 3 recovered and 0 were evicted.
[17462.187267] LustreError: dumping log to /tmp/lustre-log.1492647191.3626

&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;I would still consider this an issue.&lt;/p&gt;</comment>
                            <comment id="192902" author="pjones" created="Thu, 20 Apr 2017 17:40:10 +0000"  >&lt;p&gt;Lai&lt;/p&gt;

&lt;p&gt;Could you please advise on this one&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="194674" author="gerrit" created="Fri, 5 May 2017 14:48:29 +0000"  >&lt;p&gt;Lai Siyao (lai.siyao@intel.com) uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/26965&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/26965&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-9049&quot; title=&quot;DNE MDT Never completes recovery&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-9049&quot;&gt;&lt;del&gt;LU-9049&lt;/del&gt;&lt;/a&gt; obdclass: unhash object when destroying object&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: a4e1fe3886bfa945c9860dc53d45d2d0a1a0b0e0&lt;/p&gt;</comment>
                            <comment id="198972" author="cliffw" created="Tue, 13 Jun 2017 00:12:20 +0000"  >&lt;p&gt;Tested patch on soak. Used patch set 7, build 48150&lt;/p&gt;
&lt;ul class=&quot;alternate&quot; type=&quot;square&quot;&gt;
	&lt;li&gt;MDS aborts recovery on multiple occasions after failover/restart, clients are evicted.&lt;br/&gt;
Two MDS failures:
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;Jun 12 22:28:52 soak-8 kernel: Lustre: 4536:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; sent delay: [sent 1497306524/real 0]  req@ffff8803fc27f200 x1570030437379168/t0(0) o400-&amp;gt;soaked-MDT0001-osp-MDT0000@192.168.1.109@o2ib10:24/4 lens 224/224 e 0 to 1 dl 1497306531 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
Jun 12 22:28:52 soak-8 kernel: Lustre: 4536:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 1 previous similar message
Jun 12 22:28:52 soak-8 kernel: Lustre: soaked-MDT0001-osp-MDT0000: Connection to soaked-MDT0001 (at 192.168.1.109@o2ib10) was lost; in progress operations using &lt;span class=&quot;code-keyword&quot;&gt;this&lt;/span&gt; service will wait &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; recovery to complete
Jun 12 22:29:07 soak-8 kernel: LustreError: 137-5: soaked-MDT0001_UUID: not available &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; connect from 192.168.1.120@o2ib100 (no target). If you are running an HA pair check that the target is mounted on the other server.
Jun 12 22:29:10 soak-8 kernel: mdt_out01_013: page allocation failure: order:4, mode:0x10c050
Jun 12 22:29:10 soak-8 kernel: CPU: 10 PID: 5150 Comm: mdt_out01_013 Tainted: P        W  OE  ------------   3.10.0-514.16.1.el7_lustre.x86_64 #1
Jun 12 22:29:10 soak-8 kernel: Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013
Jun 12 22:29:10 soak-8 kernel: Lustre: soaked-MDT0000: Rejecting reconnect from the known client soaked-MDT0000-lwp-MDT0002_UUID (at 192.168.1.110@o2ib10) because it is indicating it is a &lt;span class=&quot;code-keyword&quot;&gt;new&lt;/span&gt; client
Jun 12 22:29:10 soak-8 kernel: 000000000010c050
Jun 12 22:29:10 soak-8 kernel: 00000000cc5d6083 ffff8807e65978a8 ffffffff81686d1f
Jun 12 22:29:10 soak-8 kernel: ffff8807e6597938 ffffffff81187010 0000000000000000 ffff88083ffd8000
Jun 12 22:29:10 soak-8 kernel: 0000000000000004 000000000010c050 ffff8807e6597938 00000000cc5d6083
Jun 12 22:29:10 soak-8 kernel: Call Trace:
Jun 12 22:29:10 soak-8 kernel: [&amp;lt;ffffffff81686d1f&amp;gt;] dump_stack+0x19/0x1b
Jun 12 22:29:10 soak-8 kernel: [&amp;lt;ffffffff81187010&amp;gt;] warn_alloc_failed+0x110/0x180
Jun 12 22:29:10 soak-8 kernel: [&amp;lt;ffffffff816828d7&amp;gt;] __alloc_pages_slowpath+0x6b7/0x725
Jun 12 22:29:10 soak-8 kernel: [&amp;lt;ffffffff8118b5c5&amp;gt;] __alloc_pages_nodemask+0x405/0x420
Jun 12 22:29:10 soak-8 kernel: [&amp;lt;ffffffff811cf77a&amp;gt;] alloc_pages_current+0xaa/0x170
Jun 12 22:29:10 soak-8 kernel: [&amp;lt;ffffffff81185eee&amp;gt;] __get_free_pages+0xe/0x50
Jun 12 22:29:10 soak-8 kernel: [&amp;lt;ffffffff811db01e&amp;gt;] kmalloc_order_trace+0x2e/0xa0
Jun 12 22:29:10 soak-8 kernel: [&amp;lt;ffffffff811dd7f1&amp;gt;] __kmalloc+0x221/0x240
Jun 12 22:29:10 soak-8 kernel: [&amp;lt;ffffffffa0cc9599&amp;gt;] ? lprocfs_counter_add+0xf9/0x160 [obdclass]
Jun 12 22:29:10 soak-8 kernel: [&amp;lt;ffffffffa0f8212d&amp;gt;] out_handle+0xa5d/0x1920 [ptlrpc]
Jun 12 22:29:10 soak-8 kernel: [&amp;lt;ffffffff810ce474&amp;gt;] ? update_curr+0x104/0x190
Jun 12 22:29:10 soak-8 kernel: [&amp;lt;ffffffff810c9ba8&amp;gt;] ? __enqueue_entity+0x78/0x80
Jun 12 22:29:10 soak-8 kernel: [&amp;lt;ffffffffa0f14ce2&amp;gt;] ? lustre_msg_get_opc+0x22/0xf0 [ptlrpc]
Jun 12 22:29:10 soak-8 kernel: [&amp;lt;ffffffffa0f77da9&amp;gt;] ? tgt_request_preprocess.isra.26+0x299/0x790 [ptlrpc]
Jun 12 22:29:10 soak-8 kernel: [&amp;lt;ffffffffa0f78bb5&amp;gt;] tgt_request_handle+0x915/0x1360 [ptlrpc]
Jun 12 22:29:10 soak-8 kernel: [&amp;lt;ffffffffa0f222b3&amp;gt;] ptlrpc_server_handle_request+0x233/0xa90 [ptlrpc]
Jun 12 22:29:10 soak-8 kernel: [&amp;lt;ffffffffa0f207a8&amp;gt;] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
Jun 12 22:29:10 soak-8 kernel: [&amp;lt;ffffffff810c54d2&amp;gt;] ? default_wake_function+0x12/0x20
Jun 12 22:29:10 soak-8 kernel: [&amp;lt;ffffffff810ba628&amp;gt;] ? __wake_up_common+0x58/0x90
Jun 12 22:29:10 soak-8 kernel: [&amp;lt;ffffffffa0f262c0&amp;gt;] ptlrpc_main+0xaa0/0x1dd0 [ptlrpc]
Jun 12 22:29:10 soak-8 kernel: [&amp;lt;ffffffffa0f25820&amp;gt;] ? ptlrpc_register_service+0xe60/0xe60 [ptlrpc]
Jun 12 22:29:10 soak-8 kernel: [&amp;lt;ffffffff810b0a4f&amp;gt;] kthread+0xcf/0xe0
Jun 12 22:29:10 soak-8 kernel: [&amp;lt;ffffffff810b0980&amp;gt;] ? kthread_create_on_node+0x140/0x140
Jun 12 22:29:10 soak-8 kernel: [&amp;lt;ffffffff81697318&amp;gt;] ret_from_fork+0x58/0x90
Jun 12 22:29:10 soak-8 kernel: [&amp;lt;ffffffff810b0980&amp;gt;] ? kthread_create_on_node+0x140/0x140
Jun 12 22:29:10 soak-8 kernel: Mem-Info:
Jun 12 22:29:10 soak-8 kernel: active_anon:7323 inactive_anon:7468 isolated_anon:0#012 active_file:4687260 inactive_file:2131312 isolated_file:0#012 unevictable:2202 dirty:8519 writeback:0 unstable:0#012 slab_reclaimable:292721 slab_unreclaimable:421232#012 mapped:14619 shmem:2286 pagetables:1440 bounce:0#012 free:271081 free_pcp:195 free_cma:0
Jun 12 22:29:10 soak-8 kernel: Node 1 Normal free:401704kB min:45728kB low:57160kB high:68592kB active_anon:17004kB inactive_anon:17016kB active_file:10443492kB inactive_file:3351644kB unevictable:4592kB isolated(anon):0kB isolated(file):0kB present:16777216kB managed:16498508kB mlocked:4592kB dirty:19776kB writeback:0kB mapped:10104kB shmem:8712kB slab_reclaimable:661888kB slab_unreclaimable:1229584kB kernel_stack:7616kB pagetables:2428kB unstable:0kB bounce:0kB free_pcp:780kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
Jun 12 22:29:10 soak-8 kernel: lowmem_reserve[]:
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Second MDS&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;un 12 22:09:27 soak-11 systemd: Stopping user-0.slice.
Jun 12 22:29:10 soak-11 kernel: LustreError: 11-0: soaked-MDT0000-osp-MDT0003: operation out_update to node 192.168.1.108@o2ib10 failed: rc = -12
Jun 12 22:29:10 soak-11 kernel: LustreError: 4611:0:(layout.c:2085:__req_capsule_get()) @@@ Wrong buffer &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; field `object_update_reply&lt;span class=&quot;code-quote&quot;&gt;&apos; (1 of 1) in format `OUT_UPDATE&apos;&lt;/span&gt;: 0 vs. 4096 (server)#012  req@ffff88075ff89e00 x1570036086727488/t0(0) o1000-&amp;gt;soaked-MDT0000-osp-MDT0003@192.168.1.108@o2ib10:24/4 lens 376/192 e 0 to 0 dl 1497306559 ref 2 fl Interpret:ReM/0/0 rc -12/-12
Jun 12 22:29:11 soak-11 kernel: general protection fault: 0000 [#1] SMP
Jun 12 22:29:11 soak-11 kernel: Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) zfs(POE) zunicode(POE) zavl(POE) zcommon(POE) znvpair(POE) spl(OE) zlib_deflate 8021q garp mrp stp llc rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm mlx4_ib ib_core dm_round_robin intel_powerclamp coretemp intel_rapl iosf_mbi sg kvm ipmi_ssif ipmi_devintf irqbypass ipmi_si crc32_pclmul ghash_clmulni_intel aesni_intel mei_me mei lrw i2c_i801 gf128mul glue_helper ablk_helper cryptd sb_edac pcspkr ipmi_msghandler iTCO_wdt iTCO_vendor_support lpc_ich edac_core ioatdma
Jun 12 22:29:11 soak-11 kernel: ntb shpchp wmi dm_multipath dm_mod nfsd nfs_acl lockd grace auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic mlx4_en mgag200 drm_kms_helper isci syscopyarea ahci sysfillrect libsas sysimgblt libahci igb fb_sys_fops crct10dif_pclmul ttm crct10dif_common ptp mpt3sas crc32c_intel pps_core libata drm mlx4_core raid_class dca i2c_algo_bit scsi_transport_sas devlink i2c_core fjes
Jun 12 22:29:11 soak-11 kernel: CPU: 14 PID: 4611 Comm: osp_up0-3 Tainted: P           OE  ------------   3.10.0-514.16.1.el7_lustre.x86_64 #1
Jun 12 22:29:11 soak-11 kernel: Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013
Jun 12 22:29:11 soak-11 kernel: task: ffff88042b320000 ti: ffff88081efc8000 task.ti: ffff88081efc8000
Jun 12 22:29:11 soak-11 kernel: RIP: 0010:[&amp;lt;ffffffffa14fd583&amp;gt;]  [&amp;lt;ffffffffa14fd583&amp;gt;] osp_send_update_thread+0x1d3/0x600 [osp]
Jun 12 22:29:11 soak-11 kernel: RSP: 0018:ffff88081efcbe00  EFLAGS: 00010282
Jun 12 22:29:11 soak-11 kernel: RAX: ffff880762b38180 RBX: ffff88082885e960 RCX: 00000000005466fe
Jun 12 22:29:11 soak-11 kernel: RDX: 00000000005466fd RSI: 5a5a5a5a5a5a5a5a RDI: ffff88017fc07600
Jun 12 22:29:11 soak-11 kernel: Lustre: 4521:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; sent delay: [sent 1497306543/real 0]  req@ffff880368255d00 x1570036086580928/t0(0) o400-&amp;gt;soaked-MDT0001-osp-MDT0003@192.168.1.109@o2ib10:24/4 lens 224/224 e 0 to 1 dl 1497306550 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
Jun 12 22:29:11 soak-11 kernel: Lustre: soaked-MDT0001-osp-MDT0003: Connection to soaked-MDT0001 (at 192.168.1.109@o2ib10) was lost; in progress operations using &lt;span class=&quot;code-keyword&quot;&gt;this&lt;/span&gt; service will wait &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; recovery to complete
Jun 12 22:29:11 soak-11 kernel: RBP: ffff88081efcbec0 R08: 0000000000019b20 R09: ffffffffa0cf4b61
Jun 12 22:29:11 soak-11 kernel: R10: ffff88082d999b20 R11: ffffea001d0f9600 R12: ffff88042b320000
Jun 12 22:29:11 soak-11 kernel: R13: ffff8803fdf77000 R14: 00000000fffffff4 R15: ffff88082885e970
Jun 12 22:29:11 soak-11 kernel: FS:  0000000000000000(0000) GS:ffff88082d980000(0000) knlGS:0000000000000000
Jun 12 22:29:11 soak-11 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jun 12 22:29:11 soak-11 kernel: CR2: 00007fcbe8826000 CR3: 00000000019be000 CR4: 00000000000407e0
Jun 12 22:29:11 soak-11 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Jun 12 22:29:11 soak-11 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Jun 12 22:29:11 soak-11 kernel: Call Trace:
Jun 12 22:29:11 soak-11 kernel: [&amp;lt;ffffffff810c54c0&amp;gt;] ? wake_up_state+0x20/0x20
Jun 12 22:29:11 soak-11 kernel: [&amp;lt;ffffffffa14fd3b0&amp;gt;] ? osp_invalidate_request+0x390/0x390 [osp]
Jun 12 22:29:11 soak-11 kernel: [&amp;lt;ffffffff810b0a4f&amp;gt;] kthread+0xcf/0xe0
Jun 12 22:29:11 soak-11 kernel: [&amp;lt;ffffffff810b0980&amp;gt;] ? kthread_create_on_node+0x140/0x140
Jun 12 22:29:11 soak-11 kernel: [&amp;lt;ffffffff81697318&amp;gt;] ret_from_fork+0x58/0x90
Jun 12 22:29:11 soak-11 kernel: [&amp;lt;ffffffff810b0980&amp;gt;] ? kthread_create_on_node+0x140/0x140
Jun 12 22:29:11 soak-11 kernel: Code: 19 e0 48 8b 95 58 ff ff ff 48 8b 52 48 48 3b 53 30 74 5d 4c 89 ff e8 ed 0f 19 e0 45 85 f6 78 65 48 8b 85 58 ff ff ff 48 8b 70 40 &amp;lt;f0&amp;gt; ff 8e 88 00 00 00 0f 94 c0 84 c0 0f 84 db fe ff ff 48 8d 7d
Jun 12 22:29:11 soak-11 kernel: RIP  [&amp;lt;ffffffffa14fd583&amp;gt;] osp_send_update_thread+0x1d3/0x600 [osp]
Jun 12 22:29:11 soak-11 kernel: RSP &amp;lt;ffff88081efcbe00&amp;gt;
Jun 12 22:29:11 soak-11 kernel: ---[ end trace 02ea2c24fb7e0d30 ]---
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;dump has been messed up on soak, will see if we get more data.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
</comment>
                            <comment id="198981" author="laisiyao" created="Tue, 13 Jun 2017 02:34:57 +0000"  >&lt;p&gt;This looks to be caused by OOM, and it&apos;s not directly related with the patch.&lt;/p&gt;</comment>
                            <comment id="199045" author="laisiyao" created="Tue, 13 Jun 2017 15:14:10 +0000"  >&lt;p&gt;I just rebased patch to latest master, Cliff, will you test the latest update again to make sure it&apos;s caused by this patch?&lt;/p&gt;</comment>
                            <comment id="199516" author="cliffw" created="Fri, 16 Jun 2017 21:48:53 +0000"  >&lt;p&gt;Tested latest version of the patch. Had a hard failure on OSS after OSS failover. May not be related&lt;/p&gt;</comment>
                            <comment id="199666" author="laisiyao" created="Tue, 20 Jun 2017 03:04:13 +0000"  >&lt;p&gt;Cliff, could you test soak again with both &lt;a href=&quot;https://review.whamcloud.com/26965&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/26965&lt;/a&gt; and &lt;a href=&quot;https://review.whamcloud.com/#/c/27708/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/#/c/27708/&lt;/a&gt; , the latter is the fix for &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-9678&quot; title=&quot;BUG: Kernel NULL pointer - lu_site_purge_objects&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-9678&quot;&gt;&lt;del&gt;LU-9678&lt;/del&gt;&lt;/a&gt;, which may be easily triggered by the former (though not the cause of it) by Nasf&apos;s opinion.&lt;/p&gt;</comment>
                            <comment id="199904" author="cliffw" created="Thu, 22 Jun 2017 00:32:27 +0000"  >&lt;p&gt;Ran 40+ hours in soak, 12+ hours with LFSCK. Did not hit the LBUG. &lt;/p&gt;</comment>
                            <comment id="199905" author="cliffw" created="Thu, 22 Jun 2017 01:29:12 +0000"  >&lt;p&gt;May have spoken too soon. Right after I posted the above, we had this:&lt;br/&gt;
MDT000&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;Jun 22 01:13:20 soak-8 kernel: mdt_out01_020: page allocation failure: order:4, mode:0x10c050
Jun 22 01:13:20 soak-8 kernel: CPU: 10 PID: 6106 Comm: mdt_out01_020 Tainted: P           OE  ------------   3.10.0-514.21.1.el7_lustre.x86_64 #1
Jun 22 01:13:20 soak-8 kernel: Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013
Jun 22 01:13:20 soak-8 kernel: 000000000010c050 000000005af41680 ffff8803eb6c38a8 ffffffff81687163
Jun 22 01:13:20 soak-8 kernel: ffff8803eb6c3938 ffffffff81187090 0000000000000000 ffff88083ffd8000
Jun 22 01:13:20 soak-8 kernel: 0000000000000004 000000000010c050 ffff8803eb6c3938 000000005af41680
Jun 22 01:13:20 soak-8 kernel: Call Trace:
Jun 22 01:13:20 soak-8 kernel: [&amp;lt;ffffffff81687163&amp;gt;] dump_stack+0x19/0x1b
Jun 22 01:13:20 soak-8 kernel: [&amp;lt;ffffffff81187090&amp;gt;] warn_alloc_failed+0x110/0x180
Jun 22 01:13:20 soak-8 kernel: [&amp;lt;ffffffff81682cf7&amp;gt;] __alloc_pages_slowpath+0x6b7/0x725
Jun 22 01:13:20 soak-8 kernel: [&amp;lt;ffffffff8118b645&amp;gt;] __alloc_pages_nodemask+0x405/0x420
Jun 22 01:13:20 soak-8 kernel: [&amp;lt;ffffffff811cf7fa&amp;gt;] alloc_pages_current+0xaa/0x170
Jun 22 01:13:20 soak-8 kernel: [&amp;lt;ffffffff81185f6e&amp;gt;] __get_free_pages+0xe/0x50
Jun 22 01:13:20 soak-8 kernel: [&amp;lt;ffffffff811db09e&amp;gt;] kmalloc_order_trace+0x2e/0xa0
Jun 22 01:13:20 soak-8 kernel: [&amp;lt;ffffffff811dd871&amp;gt;] __kmalloc+0x221/0x240
Jun 22 01:13:20 soak-8 kernel: [&amp;lt;ffffffffa0cd0399&amp;gt;] ? lprocfs_counter_add+0xf9/0x160 [obdclass]
Jun 22 01:13:20 soak-8 kernel: [&amp;lt;ffffffffa0f82e1d&amp;gt;] out_handle+0xa5d/0x1920 [ptlrpc]
Jun 22 01:13:20 soak-8 kernel: [&amp;lt;ffffffff810ce4c4&amp;gt;] ? update_curr+0x104/0x190
Jun 22 01:13:20 soak-8 kernel: [&amp;lt;ffffffff810c9bf8&amp;gt;] ? __enqueue_entity+0x78/0x80
Jun 22 01:13:20 soak-8 kernel: [&amp;lt;ffffffffa0f15b82&amp;gt;] ? lustre_msg_get_opc+0x22/0xf0 [ptlrpc]
Jun 22 01:13:20 soak-8 kernel: [&amp;lt;ffffffffa0f78a99&amp;gt;] ? tgt_request_preprocess.isra.26+0x299/0x790 [ptlrpc]
Jun 22 01:13:20 soak-8 kernel: [&amp;lt;ffffffffa0f798a5&amp;gt;] tgt_request_handle+0x915/0x1360 [ptlrpc]
Jun 22 01:13:20 soak-8 kernel: [&amp;lt;ffffffffa0f23143&amp;gt;] ptlrpc_server_handle_request+0x233/0xa90 [ptlrpc]
Jun 22 01:13:20 soak-8 kernel: [&amp;lt;ffffffffa0f20938&amp;gt;] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
Jun 22 01:13:20 soak-8 kernel: [&amp;lt;ffffffff810c54f2&amp;gt;] ? default_wake_function+0x12/0x20
Jun 22 01:13:20 soak-8 kernel: [&amp;lt;ffffffff810ba628&amp;gt;] ? __wake_up_common+0x58/0x90
Jun 22 01:13:20 soak-8 kernel: [&amp;lt;ffffffffa0f27120&amp;gt;] ptlrpc_main+0xaa0/0x1dd0 [ptlrpc]
Jun 22 01:13:20 soak-8 kernel: [&amp;lt;ffffffffa0f26680&amp;gt;] ? ptlrpc_register_service+0xe30/0xe30 [ptlrpc]
Jun 22 01:13:20 soak-8 kernel: [&amp;lt;ffffffff810b0a4f&amp;gt;] kthread+0xcf/0xe0
Jun 22 01:13:20 soak-8 kernel: [&amp;lt;ffffffff810b0980&amp;gt;] ? kthread_create_on_node+0x140/0x140
Jun 22 01:13:20 soak-8 kernel: [&amp;lt;ffffffff81697798&amp;gt;] ret_from_fork+0x58/0x90
Jun 22 01:13:20 soak-8 kernel: [&amp;lt;ffffffff810b0980&amp;gt;] ? kthread_create_on_node+0x140/0x140
Jun 22 01:13:20 soak-8 kernel: Mem-Info:
Jun 22 01:13:20 soak-8 kernel: active_anon:8213 inactive_anon:9497 isolated_anon:0#012 active_file:4337450 inactive_file:1514906 isolated_file:0#012 unevictable:2118 dirty:265 writeback:1 unstable:0#012 slab_reclaimable:1636391 slab_unreclaimable:224759#012 mapped:10001 shmem:4325 pagetables:1435 bounce:0#012 free:77455 free_pcp:174 free_cma:0
Jun 22 01:13:20 soak-8 kernel: Node 1 Normal free:129652kB min:45728kB low:57160kB high:68592kB active_anon:22588kB inactive_anon:24092kB active_file:9510128kB inactive_file:2799920kB unevictable:6448kB isolated(anon):0kB isolated(file):0kB present:16777216kB managed:16498508kB mlocked:6448kB dirty:1388kB writeback:0kB mapped:19960kB shmem:16728kB slab_reclaimable:2991008kB slab_unreclaimable:678664kB kernel_stack:6800kB pagetables:3076kB unstable:0kB bounce:0kB free_pcp:812kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
Jun 22 01:13:20 soak-8 kernel: lowmem_reserve[]: 0 0 0 0
Jun 22 01:13:20 soak-8 kernel: Node 1 Normal: 7564*4kB (UEM) 4942*8kB (UEM) 3212*16kB (UEM) 192*32kB (EM) 12*64kB (EM) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 128096kB
Jun 22 01:13:20 soak-8 kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
Jun 22 01:13:20 soak-8 kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Jun 22 01:13:20 soak-8 kernel: Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
Jun 22 01:13:20 soak-8 kernel: Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Jun 22 01:13:20 soak-8 kernel: 5859102 total pagecache pages
Jun 22 01:13:20 soak-8 kernel: 4 pages in swap cache
Jun 22 01:13:20 soak-8 kernel: Swap cache stats: add 16, delete 12, find 0/0
Jun 22 01:13:20 soak-8 kernel: Free swap  = 16319420kB
Jun 22 01:13:21 soak-8 kernel: Total swap = 16319484kB
Jun 22 01:13:21 soak-8 kernel: 8369064 pages RAM
Jun 22 01:13:21 soak-8 kernel: 0 pages HighMem/MovableOnly
Jun 22 01:13:21 soak-8 kernel: 241204 pages reserved
Jun 22 01:13:21 soak-8 kernel: BUG: unable to handle kernel paging request at ffffeb04007e4c40
Jun 22 01:13:21 soak-8 kernel: IP: [&amp;lt;ffffffffa0c3c2ff&amp;gt;] lnet_cpt_of_md+0xdf/0x120 [lnet]
Jun 22 01:13:21 soak-8 kernel: PGD 0
Jun 22 01:13:21 soak-8 kernel: Oops: 0000 [#1] SMP
Jun 22 01:13:21 soak-8 kernel: Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) zfs(POE) zunicode(POE) zavl(POE) zcommon(POE) znvpair(POE) spl(OE) zlib_deflate 8021q garp mrp stp llc rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm mlx4_ib ib_core intel_powerclamp coretemp intel_rapl iosf_mbi kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd dm_round_robin sb_edac ioatdma edac_core ipmi_devintf sg ntb mei_me mei iTCO_wdt iTCO_vendor_support i2c_i801 lpc_ich ipmi_ssif shpchp pcspkr wmi
Jun 22 01:13:21 soak-8 kernel: ipmi_si ipmi_msghandler dm_multipath nfsd dm_mod nfs_acl lockd grace auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic mlx4_en mgag200 drm_kms_helper isci syscopyarea sysfillrect sysimgblt igb fb_sys_fops ahci crct10dif_pclmul ttm crct10dif_common ptp crc32c_intel libsas libahci pps_core drm mlx4_core mpt2sas dca libata raid_class i2c_algo_bit scsi_transport_sas devlink i2c_core fjes
Jun 22 01:13:21 soak-8 kernel: CPU: 9 PID: 5062 Comm: mdt_out01_014 Tainted: P           OE  ------------   3.10.0-514.21.1.el7_lustre.x86_64 #1
Jun 22 01:13:21 soak-8 kernel: Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013
Jun 22 01:13:21 soak-8 kernel: task: ffff880829e50fb0 ti: ffff88080f344000 task.ti: ffff88080f344000
Jun 22 01:13:21 soak-8 kernel: RIP: 0010:[&amp;lt;ffffffffa0c3c2ff&amp;gt;]  [&amp;lt;ffffffffa0c3c2ff&amp;gt;] lnet_cpt_of_md+0xdf/0x120 [lnet]
Jun 22 01:13:21 soak-8 kernel: RSP: 0018:ffff88080f347a18  EFLAGS: 00010202
Jun 22 01:13:21 soak-8 kernel: RAX: 00000104007e4c40 RBX: 00050000c0a8016d RCX: 000077ff80000000
Jun 22 01:13:21 soak-8 kernel: RDX: ffffea0000000000 RSI: 0000000000000001 RDI: ffff8804182023c0
Jun 22 01:13:21 soak-8 kernel: RBP: ffff88080f347a18 R08: 0000000000000009 R09: 00000000000090c0
Jun 22 01:13:21 soak-8 kernel: R10: ffffffffa0c46171 R11: ffffc9001f931100 R12: ffff880503b5c000
Jun 22 01:13:21 soak-8 kernel: R13: 00050000c0a8016d R14: 0000000000000001 R15: ffff8804acad0600
Jun 22 01:13:21 soak-8 kernel: FS:  0000000000000000(0000) GS:ffff88082d840000(0000) knlGS:0000000000000000
Jun 22 01:13:21 soak-8 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jun 22 01:13:21 soak-8 kernel: CR2: ffffeb04007e4c40 CR3: 00000000019be000 CR4: 00000000000407e0
Jun 22 01:13:21 soak-8 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Jun 22 01:13:21 soak-8 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Jun 22 01:13:21 soak-8 kernel: Stack:
Jun 22 01:13:21 soak-8 kernel: ffff88080f347ad8 ffffffffa0c437dc ffff88080f347a90 ffff8804acad0638
Jun 22 01:13:21 soak-8 kernel: ffffea0020b0de00 ffff88080f347af8 ffffffffffffffff ffff880829e50fb0
Jun 22 01:13:21 soak-8 kernel: ffff880829e50fb0 ffff88082c379000 ffff88082c379000 0000000100017a88
Jun 22 01:13:21 soak-8 kernel: Call Trace:
Jun 22 01:13:21 soak-8 kernel: [&amp;lt;ffffffffa0c437dc&amp;gt;] lnet_select_pathway+0x5c/0x1140 [lnet]
Jun 22 01:13:21 soak-8 kernel: [&amp;lt;ffffffffa0c45fb1&amp;gt;] lnet_send+0x51/0x180 [lnet]
Jun 22 01:13:21 soak-8 kernel: [&amp;lt;ffffffffa0c46325&amp;gt;] LNetPut+0x245/0x7a0 [lnet]
Jun 22 01:13:21 soak-8 kernel: [&amp;lt;ffffffffa0f0ebc6&amp;gt;] ptl_send_buf+0x146/0x530 [ptlrpc]
Jun 22 01:13:21 soak-8 kernel: [&amp;lt;ffffffffa0baacde&amp;gt;] ? ktime_get_real_seconds+0xe/0x10 [libcfs]
Jun 22 01:13:21 soak-8 kernel: [&amp;lt;ffffffffa0f31fb7&amp;gt;] ? at_measured+0x1c7/0x380 [ptlrpc]
Jun 22 01:13:21 soak-8 kernel: [&amp;lt;ffffffffa0f11e6b&amp;gt;] ptlrpc_send_reply+0x29b/0x840 [ptlrpc]
Jun 22 01:13:21 soak-8 kernel: [&amp;lt;ffffffffa0ed098e&amp;gt;] target_send_reply_msg+0x8e/0x170 [ptlrpc]
Jun 22 01:13:21 soak-8 kernel: [&amp;lt;ffffffffa0edb476&amp;gt;] target_send_reply+0x306/0x730 [ptlrpc]
Jun 22 01:13:21 soak-8 kernel: [&amp;lt;ffffffffa0f18b07&amp;gt;] ? lustre_msg_set_last_committed+0x27/0xa0 [ptlrpc]
Jun 22 01:13:21 soak-8 kernel: [&amp;lt;ffffffffa0f79517&amp;gt;] tgt_request_handle+0x587/0x1360 [ptlrpc]
Jun 22 01:13:21 soak-8 kernel: [&amp;lt;ffffffffa0f23143&amp;gt;] ptlrpc_server_handle_request+0x233/0xa90 [ptlrpc]
Jun 22 01:13:21 soak-8 kernel: [&amp;lt;ffffffffa0f20938&amp;gt;] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
Jun 22 01:13:21 soak-8 kernel: [&amp;lt;ffffffff810c54f2&amp;gt;] ? default_wake_function+0x12/0x20
Jun 22 01:13:21 soak-8 kernel: [&amp;lt;ffffffff810ba628&amp;gt;] ? __wake_up_common+0x58/0x90
Jun 22 01:13:21 soak-8 kernel: [&amp;lt;ffffffffa0f27120&amp;gt;] ptlrpc_main+0xaa0/0x1dd0 [ptlrpc]
Jun 22 01:13:21 soak-8 kernel: [&amp;lt;ffffffffa0f26680&amp;gt;] ? ptlrpc_register_service+0xe30/0xe30 [ptlrpc]
Jun 22 01:13:21 soak-8 kernel: [&amp;lt;ffffffff810b0a4f&amp;gt;] kthread+0xcf/0xe0
Jun 22 01:13:21 soak-8 kernel: [&amp;lt;ffffffff810b0980&amp;gt;] ? kthread_create_on_node+0x140/0x140
Jun 22 01:13:21 soak-8 kernel: [&amp;lt;ffffffff81697798&amp;gt;] ret_from_fork+0x58/0x90
Jun 22 01:13:21 soak-8 kernel: [&amp;lt;ffffffff810b0980&amp;gt;] ? kthread_create_on_node+0x140/0x140
Jun 22 01:13:21 soak-8 kernel: Code: ff 77 00 00 48 8b 3d 21 53 03 00 48 01 d0 48 0f 42 0d 26 8d d8 e0 48 ba 00 00 00 00 00 ea ff ff 48 01 c8 48 c1 e8 0c 48 c1 e0 06 &amp;lt;48&amp;gt; 8b 34 10 48 c1 ee 36 e8 e4 ec f6 ff 5d c3 66 90 b8 ff ff ff
Jun 22 01:13:21 soak-8 kernel: RIP  [&amp;lt;ffffffffa0c3c2ff&amp;gt;] lnet_cpt_of_md+0xdf/0x120 [lnet]
Jun 22 01:13:21 soak-8 kernel: RSP &amp;lt;ffff88080f347a18&amp;gt;
Jun 22 01:13:21 soak-8 kernel: CR2: ffffeb04007e4c40
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt; 
&lt;p&gt;And on MDT001&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;Jun 22 01:13:20 soak-9 kernel: LustreError: 11-0: soaked-MDT0000-osp-MDT0001: operation out_update to node 192.168.1.108@o2ib failed: rc = -12
Jun 22 01:13:20 soak-9 kernel: LustreError: 4612:0:(layout.c:2085:__req_capsule_get()) @@@ Wrong buffer &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; field `object_update_reply&lt;span class=&quot;code-quote&quot;&gt;&apos; (1 of 1) in format `OUT_UPDATE&apos;&lt;/span&gt;: 0 vs. 4096 (server)#012  req@ffff88081121e300 x1570858955230384/t0(0) o1000-&amp;gt;soaked-MDT0000-osp-MDT0001@192.168.1.108@o2ib:24/4 lens 376/192 e 0 to 0 dl 1498094043 ref 2 fl Interpret:ReM/0/0 rc -12/-12
Jun 22 01:13:20 soak-9 kernel: general protection fault: 0000 [#1] SMP
Jun 22 01:13:20 soak-9 kernel: Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) zfs(POE) zunicode(POE) zavl(POE) zcommon(POE) znvpair(POE) spl(OE) zlib_deflate 8021q garp mrp stp llc rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm mlx4_ib ib_core intel_powerclamp coretemp intel_rapl iosf_mbi kvm dm_round_robin irqbypass crc32_pclmul ghash_clmulni_intel ipmi_ssif aesni_intel sg lrw gf128mul ipmi_devintf glue_helper ablk_helper cryptd mei_me mei ntb ipmi_si ioatdma ipmi_msghandler iTCO_wdt iTCO_vendor_support lpc_ich wmi pcspkr sb_edac i2c_i801
Jun 22 01:13:20 soak-9 kernel: edac_core shpchp dm_multipath dm_mod nfsd nfs_acl lockd grace auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic mlx4_en mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops igb ttm isci crct10dif_pclmul crct10dif_common ptp ahci libsas crc32c_intel libahci pps_core drm mlx4_core mpt2sas libata dca raid_class i2c_algo_bit scsi_transport_sas devlink i2c_core fjes
Jun 22 01:13:20 soak-9 kernel: CPU: 11 PID: 4612 Comm: osp_up0-1 Tainted: P           OE  ------------   3.10.0-514.21.1.el7_lustre.x86_64 #1
Jun 22 01:13:20 soak-9 kernel: Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013
Jun 22 01:13:20 soak-9 kernel: task: ffff8808155c0fb0 ti: ffff8808155c8000 task.ti: ffff8808155c8000
Jun 22 01:13:20 soak-9 kernel: RIP: 0010:[&amp;lt;ffffffffa14fe0d3&amp;gt;]  [&amp;lt;ffffffffa14fe0d3&amp;gt;] osp_send_update_thread+0x1d3/0x600 [osp]
Jun 22 01:13:20 soak-9 kernel: RSP: 0018:ffff8808155cbe00  EFLAGS: 00010282
Jun 22 01:13:20 soak-9 kernel: RAX: ffff8803f3ad1900 RBX: ffff8804014772a0 RCX: 000000018040003e
Jun 22 01:13:20 soak-9 kernel: RDX: 000000018040003f RSI: 5a5a5a5a5a5a5a5a RDI: 0000000040000000
Jun 22 01:13:20 soak-9 kernel: RBP: ffff8808155cbec0 R08: ffff8807c6b12600 R09: 000000018040003e
Jun 22 01:13:20 soak-9 kernel: R10: 00000000c6b11e01 R11: ffffea001f1ac400 R12: ffff8808155c0fb0
Jun 22 01:13:20 soak-9 kernel: R13: ffff880401b55000 R14: 00000000fffffff4 R15: ffff8804014772b0
Jun 22 01:13:20 soak-9 kernel: FS:  0000000000000000(0000) GS:ffff88082d8c0000(0000) knlGS:0000000000000000
Jun 22 01:13:20 soak-9 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jun 22 01:13:20 soak-9 kernel: CR2: 00007f8d82582000 CR3: 00000000019be000 CR4: 00000000000407e0
Jun 22 01:13:20 soak-9 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Jun 22 01:13:20 soak-9 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Jun 22 01:13:20 soak-9 kernel: Stack:
Jun 22 01:13:20 soak-9 kernel: ffff880401b551e8 ffff8808155c0fb0 ffff8808155c0fb0 ffff8803f3ad1900
Jun 22 01:13:20 soak-9 kernel: 0000000000000000 0000000000000000 0000000000000000 ffff8808155c0fb0
Jun 22 01:13:20 soak-9 kernel: ffffffff810c54e0 dead000000000100 dead000000000200 0000000210000003
Jun 22 01:13:20 soak-9 kernel: Call Trace:
Jun 22 01:13:20 soak-9 kernel: [&amp;lt;ffffffff810c54e0&amp;gt;] ? wake_up_state+0x20/0x20
Jun 22 01:13:20 soak-9 kernel: [&amp;lt;ffffffffa14fdf00&amp;gt;] ? osp_invalidate_request+0x390/0x390 [osp]
Jun 22 01:13:20 soak-9 kernel: [&amp;lt;ffffffff810b0a4f&amp;gt;] kthread+0xcf/0xe0
Jun 22 01:13:20 soak-9 kernel: [&amp;lt;ffffffff810b0980&amp;gt;] ? kthread_create_on_node+0x140/0x140
Jun 22 01:13:20 soak-9 kernel: [&amp;lt;ffffffff81697798&amp;gt;] ret_from_fork+0x58/0x90
Jun 22 01:13:20 soak-9 kernel: [&amp;lt;ffffffff810b0980&amp;gt;] ? kthread_create_on_node+0x140/0x140
Jun 22 01:13:20 soak-9 kernel: Code: 19 e0 48 8b 95 58 ff ff ff 48 8b 52 48 48 3b 53 30 74 5d 4c 89 ff e8 fd 08 19 e0 45 85 f6 78 65 48 8b 85 58 ff ff ff 48 8b 70 40 &amp;lt;f0&amp;gt; ff 8e 88 00 00 00 0f 94 c0 84 c0 0f 84 db fe ff ff 48 8d 7d
Jun 22 01:13:20 soak-9 kernel: RIP  [&amp;lt;ffffffffa14fe0d3&amp;gt;] osp_send_update_thread+0x1d3/0x600 [osp]
Jun 22 01:13:20 soak-9 kernel: RSP &amp;lt;ffff8808155cbe00&amp;gt;
Jun 22 01:13:20 soak-9 kernel: ---[ end trace 6ac75bb1f736a48a ]---
Jun 22 01:13:20 soak-9 kernel: Kernel panic - not syncing: Fatal exception
~                                                                                                                       
~                                                                                                                       
~           &lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Is this related or new?&lt;/p&gt;</comment>
                            <comment id="199908" author="yong.fan" created="Thu, 22 Jun 2017 01:57:25 +0000"  >&lt;p&gt;The new lnet_cpt_of_md trouble looks more like &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-9203&quot; title=&quot;parallel-scale-nfsv3 test_compilebench: MDS hit BUG: unable to handle kernel paging request&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-9203&quot;&gt;&lt;del&gt;LU-9203&lt;/del&gt;&lt;/a&gt;. &lt;/p&gt;</comment>
                            <comment id="199914" author="gerrit" created="Thu, 22 Jun 2017 02:55:24 +0000"  >&lt;p&gt;Oleg Drokin (oleg.drokin@intel.com) merged in patch &lt;a href=&quot;https://review.whamcloud.com/26965/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/26965/&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-9049&quot; title=&quot;DNE MDT Never completes recovery&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-9049&quot;&gt;&lt;del&gt;LU-9049&lt;/del&gt;&lt;/a&gt; obdclass: change object lookup to no wait mode&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: &lt;br/&gt;
Commit: fa14bdf6b648d1d4023a4fa88789059d185f4a07&lt;/p&gt;</comment>
                            <comment id="199918" author="pjones" created="Thu, 22 Jun 2017 03:05:45 +0000"  >&lt;p&gt;Landed for 2.10&lt;/p&gt;</comment>
                            <comment id="227421" author="simmonsja" created="Mon, 7 May 2018 15:08:07 +0000"  >&lt;p&gt;While porting this to upstream kernel the question came up what dependencies need&#160; to land before this patch lands.&lt;/p&gt;</comment>
                            <comment id="227475" author="laisiyao" created="Tue, 8 May 2018 03:20:34 +0000"  >&lt;p&gt;&lt;a href=&quot;https://review.whamcloud.com/#/c/27708/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/#/c/27708/&lt;/a&gt; is fix for &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-9678&quot; title=&quot;BUG: Kernel NULL pointer - lu_site_purge_objects&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-9678&quot;&gt;&lt;del&gt;LU-9678&lt;/del&gt;&lt;/a&gt;, and it&apos;s a fix for osd-zfs, which is server code.&lt;br/&gt;
&lt;a href=&quot;https://review.whamcloud.com/#/c/28165/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/#/c/28165/&lt;/a&gt; is fix for &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-9203&quot; title=&quot;parallel-scale-nfsv3 test_compilebench: MDS hit BUG: unable to handle kernel paging request&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-9203&quot;&gt;&lt;del&gt;LU-9203&lt;/del&gt;&lt;/a&gt;, which affects both client and server.&lt;br/&gt;
&lt;a href=&quot;https://review.whamcloud.com/#/c/26965/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/#/c/26965/&lt;/a&gt; is fix for this ticket, which is a fix for server code, but it modified lu_object_put(), so it affects client code too.&lt;/p&gt;

&lt;p&gt;I&apos;d suggest porting the latter two patches, and there is no direct dependency between them.&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                        <issuelink>
            <issuekey id="41437">LU-8815</issuekey>
        </issuelink>
                            </outwardlinks>
                                                                <inwardlinks description="is related to">
                                        <issuelink>
            <issuekey id="38641">LU-8481</issuekey>
        </issuelink>
                            </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                            <attachment id="26372" name="lu-9049.tar.gz" size="4690003" author="cliffw" created="Wed, 19 Apr 2017 19:21:30 +0000"/>
                            <attachment id="25072" name="lustre-log.1485304895.6159.txt.gz" size="10174192" author="cliffw" created="Wed, 25 Jan 2017 16:28:03 +0000"/>
                            <attachment id="25073" name="stack-dump-lola-10.txt" size="1277867" author="cliffw" created="Wed, 25 Jan 2017 16:27:57 +0000"/>
                            <attachment id="25074" name="trace" size="1077447" author="di.wang" created="Wed, 25 Jan 2017 17:32:30 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzz1nz:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>