<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 02:32:15 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-10123] Repeated timeouts during mount()- MDS and client. </title>
                <link>https://jira.whamcloud.com/browse/LU-10123</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;Attempting to mount filesystem, with DNE &lt;br/&gt;
MGT/MDS node hangs on initial mount. &lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;Oct 16 16:38:01 soak-8 kernel: LDISKFS-fs warning (device dm-0): ldiskfs_multi_mount_protect:322: MMP interval 42 higher than expected, please wait.
Oct 16 16:38:44 soak-8 kernel: LDISKFS-fs (dm-0): recovery complete
Oct 16 16:38:44 soak-8 kernel: LDISKFS-fs (dm-0): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,user_xattr,no_mbcache,nodelalloc
Oct 16 16:38:45 soak-8 kernel: Lustre: MGS: Connection restored to e7ad9453-d165-e9e1-75ba-304460ab678f (at 0@lo)
Oct 16 16:38:50 soak-8 kernel: Lustre: 12855:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; slow reply: [sent 1508171925/real 1508171925]  req@ffff88081c348c00 x1581432791105776/t0(0) o8-&amp;gt;soaked-OST0001-osc-MDT0000@192.168.1.103@o2ib:28/4 lens 520/544 e 0 to 1 dl 1508171930 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Oct 16 16:39:10 soak-8 kernel: LustreError: 137-5: soaked-MDT0001_UUID: not available &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; connect from 0@lo (no target). If you are running an HA pair check that the target is mounted on the other server.
Oct 16 16:39:15 soak-8 kernel: Lustre: 12855:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; slow reply: [sent 1508171950/real 1508171950]  req@ffff880814820c00 x1581432791105856/t0(0) o8-&amp;gt;soaked-OST0000-osc-MDT0000@192.168.1.103@o2ib:28/4 lens 520/544 e 0 to 1 dl 1508171955 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Oct 16 16:39:15 soak-8 kernel: Lustre: 12855:0:(client.c:2113:ptlrpc_expire_one_request()) Skipped 4 previous similar messages
Oct 16 16:39:24 soak-8 sssd_be: GSSAPI client step 1
Oct 16 16:39:24 soak-8 sssd_be: GSSAPI client step 1
Oct 16 16:39:24 soak-8 sssd_be: GSSAPI client step 1
Oct 16 16:39:24 soak-8 sssd_be: GSSAPI client step 2
Oct 16 16:39:45 soak-8 kernel: Lustre: 12855:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; slow reply: [sent 1508171975/real 1508171975]  req@ffff880814821800 x1581432791105936/t0(0) o38-&amp;gt;soaked-MDT0003-osp-MDT0000@192.168.1.111@o2ib:24/4 lens 520/544 e 0 to 1 dl 1508171985 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Oct 16 16:39:45 soak-8 kernel: Lustre: 12855:0:(client.c:2113:ptlrpc_expire_one_request()) Skipped 3 previous similar messages
Oct 16 16:40:00 soak-8 kernel: LustreError: 137-5: soaked-MDT0001_UUID: not available &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; connect from 0@lo (no target). If you are running an HA pair check that the target is mounted on the other server.
Oct 16 16:40:01 soak-8 systemd: Started Session 7 of user root.
Oct 16 16:40:01 soak-8 systemd: Starting Session 7 of user root.
Oct 16 16:40:01 soak-8 CROND[13027]: (root) CMD (/usr/lib64/sa/sa1 1 1)
Oct 16 16:40:10 soak-8 kernel: Lustre: 12855:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; slow reply: [sent 1508172000/real 1508172000]  req@ffff880814822d00 x1581432791106048/t0(0) o8-&amp;gt;soaked-OST0000-osc-MDT0000@192.168.1.103@o2ib:28/4 lens 520/544 e 0 to 1 dl 1508172010 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Oct 16 16:40:10 soak-8 kernel: Lustre: 12855:0:(client.c:2113:ptlrpc_expire_one_request()) Skipped 4 previous similar messages
Oct 16 16:40:40 soak-8 kernel: Lustre: 12855:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; slow reply: [sent 1508172025/real 1508172025]  req@ffff880814823c00 x1581432791106128/t0(0) o38-&amp;gt;soaked-MDT0003-osp-MDT0000@192.168.1.111@o2ib:24/4 lens 520/544 e 0 to 1 dl 1508172040 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Oct 16 16:40:40 soak-8 kernel: Lustre: 12855:0:(client.c:2113:ptlrpc_expire_one_request()) Skipped 3 previous similar messages
Oct 16 16:40:50 soak-8 kernel: LustreError: 137-5: soaked-MDT0001_UUID: not available &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; connect from 0@lo (no target). If you are running an HA pair check that the target is mounted on the other server.
Oct 16 16:40:53 soak-8 kernel: INFO: task mount.lustre:12918 blocked &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; more than 120 seconds.
Oct 16 16:40:54 soak-8 kernel: mount.lustre    D 0000000000000001     0 12918  12917 0x00000082
Oct 16 16:40:54 soak-8 kernel: ffff8804b9bbb7b0 0000000000000082 ffff88083f9e0000 ffff8804b9bbbfd8
Oct 16 16:40:54 soak-8 kernel: ffff8804b9bbbfd8 ffff8804b9bbbfd8 ffff88083f9e0000 ffff8808203a8150
Oct 16 16:40:54 soak-8 kernel: 7fffffffffffffff ffff8808203a8148 ffff88083f9e0000 0000000000000001
Oct 16 16:40:54 soak-8 kernel: Call Trace:
Oct 16 16:40:54 soak-8 kernel: [&amp;lt;ffffffff816a94c9&amp;gt;] schedule+0x29/0x70
Oct 16 16:40:54 soak-8 kernel: [&amp;lt;ffffffff816a6fd9&amp;gt;] schedule_timeout+0x239/0x2c0
Oct 16 16:40:54 soak-8 kernel: [&amp;lt;ffffffff810c9ef9&amp;gt;] ? select_task_rq_fair+0x549/0x700
Oct 16 16:40:54 soak-8 kernel: [&amp;lt;ffffffff816a987d&amp;gt;] wait_for_completion+0xfd/0x140
Oct 16 16:40:54 soak-8 kernel: [&amp;lt;ffffffff810c4810&amp;gt;] ? wake_up_state+0x20/0x20
Oct 16 16:40:54 soak-8 kernel: [&amp;lt;ffffffffc0a2f6b4&amp;gt;] llog_process_or_fork+0x244/0x450 [obdclass]
Oct 16 16:40:54 soak-8 kernel: [&amp;lt;ffffffffc0a2f8d4&amp;gt;] llog_process+0x14/0x20 [obdclass]
Oct 16 16:40:54 soak-8 kernel: [&amp;lt;ffffffffc0a5fe45&amp;gt;] class_config_parse_llog+0x125/0x350 [obdclass]
Oct 16 16:40:54 soak-8 kernel: [&amp;lt;ffffffffc10235a8&amp;gt;] mgc_process_cfg_log+0x788/0xc40 [mgc]
Oct 16 16:40:54 soak-8 kernel: [&amp;lt;ffffffffc1026413&amp;gt;] mgc_process_log+0x3d3/0x890 [mgc]
Oct 16 16:40:54 soak-8 kernel: [&amp;lt;ffffffffc0a68550&amp;gt;] ? class_config_dump_handler+0x7e0/0x7e0 [obdclass]
Oct 16 16:40:54 soak-8 kernel: [&amp;lt;ffffffffc1026b18&amp;gt;] ? do_config_log_add+0x248/0x580 [mgc]
Oct 16 16:40:54 soak-8 kernel: [&amp;lt;ffffffffc10279f0&amp;gt;] mgc_process_config+0x890/0x13f0 [mgc]
Oct 16 16:40:54 soak-8 kernel: [&amp;lt;ffffffffc0a6bfee&amp;gt;] lustre_process_log+0x2de/0xaf0 [obdclass]
Oct 16 16:40:54 soak-8 kernel: [&amp;lt;ffffffffc090dba7&amp;gt;] ? libcfs_debug_msg+0x57/0x80 [libcfs]
Oct 16 16:40:54 soak-8 kernel: [&amp;lt;ffffffffc0a57339&amp;gt;] ? lprocfs_counter_add+0xf9/0x160 [obdclass]
Oct 16 16:40:54 soak-8 kernel: [&amp;lt;ffffffffc0a98c32&amp;gt;] server_start_targets+0x1352/0x2a70 [obdclass]
Oct 16 16:40:54 soak-8 kernel: [&amp;lt;ffffffffc0a6ef5d&amp;gt;] ? lustre_start_mgc+0x20d/0x2800 [obdclass]
Oct 16 16:40:54 soak-8 kernel: [&amp;lt;ffffffffc0a68550&amp;gt;] ? class_config_dump_handler+0x7e0/0x7e0 [obdclass]
Oct 16 16:40:54 soak-8 kernel: [&amp;lt;ffffffffc0a9b3dd&amp;gt;] server_fill_super+0x108d/0x184c [obdclass]
Oct 16 16:40:54 soak-8 kernel: [&amp;lt;ffffffffc0a71ff8&amp;gt;] lustre_fill_super+0x328/0x950 [obdclass]
Oct 16 16:40:54 soak-8 kernel: [&amp;lt;ffffffffc0a71cd0&amp;gt;] ? lustre_common_put_super+0x270/0x270 [obdclass]
Oct 16 16:40:54 soak-8 kernel: [&amp;lt;ffffffff81204b0d&amp;gt;] mount_nodev+0x4d/0xb0
Oct 16 16:40:54 soak-8 kernel: [&amp;lt;ffffffffc0a69e18&amp;gt;] lustre_mount+0x38/0x60 [obdclass]
Oct 16 16:40:54 soak-8 kernel: [&amp;lt;ffffffff81205599&amp;gt;] mount_fs+0x39/0x1b0
Oct 16 16:40:54 soak-8 kernel: [&amp;lt;ffffffff81222077&amp;gt;] vfs_kern_mount+0x67/0x110
Oct 16 16:40:54 soak-8 kernel: [&amp;lt;ffffffff81224583&amp;gt;] do_mount+0x233/0xaf0
Oct 16 16:40:54 soak-8 kernel: [&amp;lt;ffffffff8118760e&amp;gt;] ? __get_free_pages+0xe/0x40
Oct 16 16:40:54 soak-8 kernel: [&amp;lt;ffffffff812251c6&amp;gt;] SyS_mount+0x96/0xf0
Oct 16 16:40:54 soak-8 kernel: [&amp;lt;ffffffff816b4fc9&amp;gt;] system_call_fastpath+0x16/0x1b
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Power-cycled soak-8, mount completed after power cycle. Continued bringing up the filesystem&lt;br/&gt;
soak-10 (MDT0003) hung during mount.&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;Oct 16 16:57:37 soak-10 kernel: INFO: task mount.lustre:13350 blocked &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; more than 120 seconds.
Oct 16 16:57:37 soak-10 kernel: &lt;span class=&quot;code-quote&quot;&gt;&quot;echo 0 &amp;gt; /proc/sys/kernel/hung_task_timeout_secs&quot;&lt;/span&gt; disables &lt;span class=&quot;code-keyword&quot;&gt;this&lt;/span&gt; message.
Oct 16 16:57:38 soak-10 kernel: mount.lustre    D 0000000000000001     0 13350  13349 0x00000082
Oct 16 16:57:38 soak-10 kernel: ffff880817bdf7b0 0000000000000086 ffff880817783f40 ffff880817bdffd8
Oct 16 16:57:38 soak-10 kernel: ffff880817bdffd8 ffff880817bdffd8 ffff880817783f40 ffff88082ce87c30
Oct 16 16:57:38 soak-10 kernel: 7fffffffffffffff ffff88082ce87c28 ffff880817783f40 0000000000000001
Oct 16 16:57:38 soak-10 kernel: Call Trace:
Oct 16 16:57:38 soak-10 kernel: [&amp;lt;ffffffff816a94c9&amp;gt;] schedule+0x29/0x70
Oct 16 16:57:38 soak-10 kernel: [&amp;lt;ffffffff816a6fd9&amp;gt;] schedule_timeout+0x239/0x2c0
Oct 16 16:57:38 soak-10 kernel: [&amp;lt;ffffffff810c76f5&amp;gt;] ? sched_clock_cpu+0x85/0xc0
Oct 16 16:57:38 soak-10 kernel: [&amp;lt;ffffffff810c12c8&amp;gt;] ? check_preempt_curr+0x78/0xa0
Oct 16 16:57:38 soak-10 kernel: [&amp;lt;ffffffff810c1309&amp;gt;] ? ttwu_do_wakeup+0x19/0xd0
Oct 16 16:57:38 soak-10 kernel: [&amp;lt;ffffffff816a987d&amp;gt;] wait_for_completion+0xfd/0x140
Oct 16 16:57:38 soak-10 kernel: [&amp;lt;ffffffff810c4810&amp;gt;] ? wake_up_state+0x20/0x20
Oct 16 16:57:38 soak-10 kernel: [&amp;lt;ffffffffc0a406b4&amp;gt;] llog_process_or_fork+0x244/0x450 [obdclass]
Oct 16 16:57:38 soak-10 kernel: [&amp;lt;ffffffffc0a408d4&amp;gt;] llog_process+0x14/0x20 [obdclass]
Oct 16 16:57:38 soak-10 kernel: [&amp;lt;ffffffffc0a70e45&amp;gt;] class_config_parse_llog+0x125/0x350 [obdclass]
Oct 16 16:57:38 soak-10 kernel: [&amp;lt;ffffffffc14505a8&amp;gt;] mgc_process_cfg_log+0x788/0xc40 [mgc]
Oct 16 16:57:38 soak-10 kernel: [&amp;lt;ffffffffc1453413&amp;gt;] mgc_process_log+0x3d3/0x890 [mgc]
Oct 16 16:57:38 soak-10 kernel: [&amp;lt;ffffffffc1453b77&amp;gt;] ? do_config_log_add+0x2a7/0x580 [mgc]
Oct 16 16:57:38 soak-10 kernel: [&amp;lt;ffffffffc14549f0&amp;gt;] mgc_process_config+0x890/0x13f0 [mgc]
Oct 16 16:57:38 soak-10 kernel: [&amp;lt;ffffffffc0a7cfee&amp;gt;] lustre_process_log+0x2de/0xaf0 [obdclass]
Oct 16 16:57:38 soak-10 kernel: [&amp;lt;ffffffffc091b4e4&amp;gt;] ? libcfs_log_return+0x24/0x30 [libcfs]
Oct 16 16:57:38 soak-10 kernel: [&amp;lt;ffffffffc0aa9c32&amp;gt;] server_start_targets+0x1352/0x2a70 [obdclass]
Oct 16 16:57:38 soak-10 kernel: [&amp;lt;ffffffffc0a80259&amp;gt;] ? lustre_start_mgc+0x509/0x2800 [obdclass]
Oct 16 16:57:38 soak-10 kernel: [&amp;lt;ffffffffc0a79550&amp;gt;] ? class_config_dump_handler+0x7e0/0x7e0 [obdclass]
Oct 16 16:57:38 soak-10 kernel: [&amp;lt;ffffffffc0aac3dd&amp;gt;] server_fill_super+0x108d/0x184c [obdclass]
Oct 16 16:57:38 soak-10 kernel: [&amp;lt;ffffffffc0a82ff8&amp;gt;] lustre_fill_super+0x328/0x950 [obdclass]
Oct 16 16:57:38 soak-10 kernel: [&amp;lt;ffffffffc0a82cd0&amp;gt;] ? lustre_common_put_super+0x270/0x270 [obdclass]
Oct 16 16:57:38 soak-10 kernel: [&amp;lt;ffffffff81204b0d&amp;gt;] mount_nodev+0x4d/0xb0
Oct 16 16:57:38 soak-10 kernel: [&amp;lt;ffffffffc0a7ae18&amp;gt;] lustre_mount+0x38/0x60 [obdclass]
Oct 16 16:57:38 soak-10 kernel: [&amp;lt;ffffffff81205599&amp;gt;] mount_fs+0x39/0x1b0
Oct 16 16:57:38 soak-10 kernel: [&amp;lt;ffffffff81222077&amp;gt;] vfs_kern_mount+0x67/0x110
Oct 16 16:57:38 soak-10 kernel: [&amp;lt;ffffffff81224583&amp;gt;] do_mount+0x233/0xaf0
Oct 16 16:57:38 soak-10 kernel: [&amp;lt;ffffffff8118760e&amp;gt;] ? __get_free_pages+0xe/0x40
Oct 16 16:57:38 soak-10 kernel: [&amp;lt;ffffffff812251c6&amp;gt;] SyS_mount+0x96/0xf0
Oct 16 16:57:38 soak-10 kernel: [&amp;lt;ffffffff816b4fc9&amp;gt;] system_call_fastpath+0x16/0x1b
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Power-cycled that node, continued with mount, second mount on soak-10 worked.&lt;br/&gt;
Mounting clients, soak-40 hung. &lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;Oct 16 17:17:56 soak-40 kernel: INFO: task mount.lustre:2933 blocked &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; more than 120 seconds.
Oct 16 17:17:56 soak-40 kernel: &lt;span class=&quot;code-quote&quot;&gt;&quot;echo 0 &amp;gt; /proc/sys/kernel/hung_task_timeout_secs&quot;&lt;/span&gt; disables &lt;span class=&quot;code-keyword&quot;&gt;this&lt;/span&gt; message.
Oct 16 17:17:56 soak-40 kernel: mount.lustre    D 0000000000000001     0  2933   2932 0x00000080
Oct 16 17:17:56 soak-40 kernel: ffff88107f827940 0000000000000086 ffff88085c2e5ee0 ffff88107f827fd8
Oct 16 17:17:56 soak-40 kernel: ffff88107f827fd8 ffff88107f827fd8 ffff88085c2e5ee0 ffff88085cb5c5d0
Oct 16 17:17:57 soak-40 kernel: 7fffffffffffffff ffff88085cb5c5c8 ffff88085c2e5ee0 0000000000000001
Oct 16 17:17:57 soak-40 kernel: Call Trace:
Oct 16 17:17:57 soak-40 kernel: [&amp;lt;ffffffff816a94e9&amp;gt;] schedule+0x29/0x70
Oct 16 17:17:57 soak-40 kernel: [&amp;lt;ffffffff816a6ff9&amp;gt;] schedule_timeout+0x239/0x2c0
Oct 16 17:17:57 soak-40 kernel: [&amp;lt;ffffffff810c76f5&amp;gt;] ? sched_clock_cpu+0x85/0xc0
Oct 16 17:17:57 soak-40 kernel: [&amp;lt;ffffffff810c12c8&amp;gt;] ? check_preempt_curr+0x78/0xa0
Oct 16 17:17:57 soak-40 kernel: [&amp;lt;ffffffff810c1309&amp;gt;] ? ttwu_do_wakeup+0x19/0xd0
Oct 16 17:17:57 soak-40 kernel: [&amp;lt;ffffffff816a989d&amp;gt;] wait_for_completion+0xfd/0x140
Oct 16 17:17:57 soak-40 kernel: [&amp;lt;ffffffff810c4810&amp;gt;] ? wake_up_state+0x20/0x20
Oct 16 17:17:57 soak-40 kernel: [&amp;lt;ffffffffc0ba66b4&amp;gt;] llog_process_or_fork+0x244/0x450 [obdclass]
Oct 16 17:17:57 soak-40 kernel: [&amp;lt;ffffffffc0ba68d4&amp;gt;] llog_process+0x14/0x20 [obdclass]
Oct 16 17:17:57 soak-40 kernel: [&amp;lt;ffffffffc0bd6d65&amp;gt;] class_config_parse_llog+0x125/0x350 [obdclass]
Oct 16 17:17:57 soak-40 kernel: [&amp;lt;ffffffffc0844368&amp;gt;] mgc_process_cfg_log+0x788/0xc40 [mgc]
Oct 16 17:17:57 soak-40 kernel: [&amp;lt;ffffffffc08471e3&amp;gt;] mgc_process_log+0x3d3/0x890 [mgc]
Oct 16 17:17:57 soak-40 kernel: [&amp;lt;ffffffffc0847947&amp;gt;] ? do_config_log_add+0x2a7/0x580 [mgc]
Oct 16 17:17:57 soak-40 kernel: [&amp;lt;ffffffffc08487c0&amp;gt;] mgc_process_config+0x890/0x13f0 [mgc]
Oct 16 17:17:58 soak-40 kernel: [&amp;lt;ffffffffc0be2d7e&amp;gt;] lustre_process_log+0x2de/0xaf0 [obdclass]
Oct 16 17:17:58 soak-40 kernel: [&amp;lt;ffffffff816aba0e&amp;gt;] ? _raw_spin_unlock_bh+0x1e/0x20
Oct 16 17:17:58 soak-40 kernel: [&amp;lt;ffffffff8132318b&amp;gt;] ? fprop_local_init_percpu+0x1b/0x30
Oct 16 17:17:58 soak-40 kernel: [&amp;lt;ffffffffc0d7b608&amp;gt;] ll_fill_super+0xaf8/0x1220 [lustre]
Oct 16 17:17:58 soak-40 kernel: [&amp;lt;ffffffffc0be8ce6&amp;gt;] lustre_fill_super+0x286/0x910 [obdclass]
Oct 16 17:17:58 soak-40 kernel: [&amp;lt;ffffffffc0be8a60&amp;gt;] ? lustre_common_put_super+0x270/0x270 [obdclass]
Oct 16 17:17:58 soak-40 kernel: [&amp;lt;ffffffff81204b0d&amp;gt;] mount_nodev+0x4d/0xb0
Oct 16 17:17:58 soak-40 kernel: [&amp;lt;ffffffffc0be0ba8&amp;gt;] lustre_mount+0x38/0x60 [obdclass]
Oct 16 17:17:58 soak-40 kernel: [&amp;lt;ffffffff81205599&amp;gt;] mount_fs+0x39/0x1b0
Oct 16 17:17:58 soak-40 kernel: [&amp;lt;ffffffff81222067&amp;gt;] vfs_kern_mount+0x67/0x110
Oct 16 17:17:58 soak-40 kernel: [&amp;lt;ffffffff81224573&amp;gt;] do_mount+0x233/0xaf0
Oct 16 17:17:58 soak-40 kernel: [&amp;lt;ffffffff8118760e&amp;gt;] ? __get_free_pages+0xe/0x40
Oct 16 17:17:58 soak-40 kernel: [&amp;lt;ffffffff812251b6&amp;gt;] SyS_mount+0x96/0xf0
Oct 16 17:17:58 soak-40 kernel: [&amp;lt;ffffffff816b5009&amp;gt;] system_call_fastpath+0x16/0x1b
Oct 16 17:19:59 soak-40 kernel: INFO: task mount.lustre:2933 blocked &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; more than 120 seconds.
Oct 16 17:19:59 soak-40 kernel: &lt;span class=&quot;code-quote&quot;&gt;&quot;echo 0 &amp;gt; /proc/sys/kernel/hung_task_timeout_secs&quot;&lt;/span&gt; disables &lt;span class=&quot;code-keyword&quot;&gt;this&lt;/span&gt; message.
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Dumped -1 lustre-log from client, attached.&lt;/p&gt;</description>
                <environment>Soak cluster, lustre-master build 3654 version=2.10.54_13_g84f690e</environment>
        <key id="48766">LU-10123</key>
            <summary>Repeated timeouts during mount()- MDS and client. </summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="3" iconUrl="https://jira.whamcloud.com/images/icons/priorities/major.svg">Major</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="bfaccini">Bruno Faccini</assignee>
                                    <reporter username="cliffw">Cliff White</reporter>
                        <labels>
                            <label>soak</label>
                    </labels>
                <created>Mon, 16 Oct 2017 17:28:22 +0000</created>
                <updated>Fri, 15 May 2020 19:24:21 +0000</updated>
                            <resolved>Tue, 6 Feb 2018 17:04:17 +0000</resolved>
                                    <version>Lustre 2.11.0</version>
                                    <fixVersion>Lustre 2.11.0</fixVersion>
                                        <due></due>
                            <votes>0</votes>
                                    <watches>9</watches>
                                                                            <comments>
                            <comment id="211212" author="pjones" created="Mon, 16 Oct 2017 17:36:48 +0000"  >&lt;p&gt;Bruno&lt;/p&gt;

&lt;p&gt;Could you please investigate this one?&lt;/p&gt;

&lt;p&gt;Thanks&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="212704" author="cliffw" created="Thu, 2 Nov 2017 20:48:15 +0000"  >&lt;p&gt;These are appearing again with the latest lustre-master build&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;soak-24 login: [ 7800.834964] INFO: task mount.lustre:2613 blocked &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; more than 120 seconds.
[ 7800.844715] &lt;span class=&quot;code-quote&quot;&gt;&quot;echo 0 &amp;gt; /proc/sys/kernel/hung_task_timeout_secs&quot;&lt;/span&gt; disables &lt;span class=&quot;code-keyword&quot;&gt;this&lt;/span&gt; message.
[ 7800.855390] mount.lustre    D 0000000000000001     0  2613   2612 0x00000080
[ 7800.865073]  ffff881014bdf940 0000000000000086 ffff88101c1cdee0 ffff881014bdffd8
[ 7800.875067]  ffff881014bdffd8 ffff881014bdffd8 ffff88101c1cdee0 ffff88101c13d5d0
[ 7800.884968]  7fffffffffffffff ffff88101c13d5c8 ffff88101c1cdee0 0000000000000001
[ 7800.894763] Call Trace:
[ 7800.898926]  [&amp;lt;ffffffff816a94e9&amp;gt;] schedule+0x29/0x70
[ 7800.905812]  [&amp;lt;ffffffff816a6ff9&amp;gt;] schedule_timeout+0x239/0x2c0
[ 7800.913671]  [&amp;lt;ffffffff810c9ef9&amp;gt;] ? select_task_rq_fair+0x549/0x700
[ 7800.921952]  [&amp;lt;ffffffff816a989d&amp;gt;] wait_for_completion+0xfd/0x140
[ 7800.929957]  [&amp;lt;ffffffff810c4810&amp;gt;] ? wake_up_state+0x20/0x20
[ 7800.937435]  [&amp;lt;ffffffffc09f3724&amp;gt;] llog_process_or_fork+0x244/0x450 [obdclass]
[ 7800.946642]  [&amp;lt;ffffffffc09f3944&amp;gt;] llog_process+0x14/0x20 [obdclass]
[ 7800.954833]  [&amp;lt;ffffffffc0a24a05&amp;gt;] class_config_parse_llog+0x125/0x350 [obdclass]
[ 7800.964256]  [&amp;lt;ffffffffc08cd368&amp;gt;] mgc_process_cfg_log+0x788/0xc40 [mgc]
[ 7800.972714]  [&amp;lt;ffffffffc08d01e3&amp;gt;] mgc_process_log+0x3d3/0x890 [mgc]
[ 7800.980801]  [&amp;lt;ffffffffc0a2c770&amp;gt;] ? class_config_dump_handler+0x7e0/0x7e0 [obdclass]
[ 7800.990481]  [&amp;lt;ffffffffc08d08e8&amp;gt;] ? do_config_log_add+0x248/0x580 [mgc]
[ 7800.998843]  [&amp;lt;ffffffffc08d17c0&amp;gt;] mgc_process_config+0x890/0x13f0 [mgc]
[ 7801.007224]  [&amp;lt;ffffffffc0a3006e&amp;gt;] lustre_process_log+0x2de/0xaf0 [obdclass]
[ 7801.015956]  [&amp;lt;ffffffffc0908c47&amp;gt;] ? libcfs_debug_msg+0x57/0x80 [libcfs]
[ 7801.024307]  [&amp;lt;ffffffffc0ddc736&amp;gt;] ll_fill_super+0xae6/0x11e0 [lustre]
[ 7801.032419]  [&amp;lt;ffffffffc0a35d86&amp;gt;] lustre_fill_super+0x286/0x910 [obdclass]
[ 7801.041016]  [&amp;lt;ffffffffc0a35b00&amp;gt;] ? lustre_common_put_super+0x270/0x270 [obdclass]
[ 7801.050348]  [&amp;lt;ffffffff81204b0d&amp;gt;] mount_nodev+0x4d/0xb0
[ 7801.057065]  [&amp;lt;ffffffffc0a2de98&amp;gt;] lustre_mount+0x38/0x60 [obdclass]
[ 7801.064931]  [&amp;lt;ffffffff81205599&amp;gt;] mount_fs+0x39/0x1b0
[ 7801.071426]  [&amp;lt;ffffffff81222067&amp;gt;] vfs_kern_mount+0x67/0x110
[ 7801.078498]  [&amp;lt;ffffffff81224573&amp;gt;] do_mount+0x233/0xaf0
[ 7801.085079]  [&amp;lt;ffffffff8118760e&amp;gt;] ? __get_free_pages+0xe/0x40
[ 7801.092332]  [&amp;lt;ffffffff812251b6&amp;gt;] SyS_mount+0x96/0xf0
[ 7801.098807]  [&amp;lt;ffffffff816b5009&amp;gt;] system_call_fastpath+0x16/0x1b
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="213361" author="sarah" created="Fri, 10 Nov 2017 22:24:42 +0000"  >&lt;p&gt;seeing this again on lustre-master-ib #16 DNE&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[ 5643.580301] INFO: task mount.lustre:2915 blocked for more than 120 seconds.
[ 5643.589964] &quot;echo 0 &amp;gt; /proc/sys/kernel/hung_task_timeout_secs&quot; disables this message.
[ 5643.600403] mount.lustre    D 0000000000000001     0  2915   2914 0x00000082
[ 5643.609908]  ffff88042e27b7b0 0000000000000082 ffff880409d71fa0 ffff88042e27bfd8
[ 5643.619780]  ffff88042e27bfd8 ffff88042e27bfd8 ffff880409d71fa0 ffff88042b9ad7b0
[ 5643.629533]  7fffffffffffffff ffff88042b9ad7a8 ffff880409d71fa0 0000000000000001
[ 5643.639137] Call Trace:
[ 5643.643121]  [&amp;lt;ffffffff816a9589&amp;gt;] schedule+0x29/0x70
[ 5643.649842]  [&amp;lt;ffffffff816a7099&amp;gt;] schedule_timeout+0x239/0x2c0
[ 5643.657539]  [&amp;lt;ffffffff81050b5c&amp;gt;] ? native_smp_send_reschedule+0x4c/0x70
[ 5643.666145]  [&amp;lt;ffffffff810c0558&amp;gt;] ? resched_curr+0xa8/0xc0
[ 5643.673388]  [&amp;lt;ffffffff810c12d8&amp;gt;] ? check_preempt_curr+0x78/0xa0
[ 5643.681144]  [&amp;lt;ffffffff810c1319&amp;gt;] ? ttwu_do_wakeup+0x19/0xd0
[ 5643.688510]  [&amp;lt;ffffffff816a993d&amp;gt;] wait_for_completion+0xfd/0x140
[ 5643.696212]  [&amp;lt;ffffffff810c4820&amp;gt;] ? wake_up_state+0x20/0x20
[ 5643.703465]  [&amp;lt;ffffffffc0a53724&amp;gt;] llog_process_or_fork+0x244/0x450 [obdclass]
[ 5643.712438]  [&amp;lt;ffffffffc0a53944&amp;gt;] llog_process+0x14/0x20 [obdclass]
[ 5643.720410]  [&amp;lt;ffffffffc0a84ac5&amp;gt;] class_config_parse_llog+0x125/0x350 [obdclass]
[ 5643.729617]  [&amp;lt;ffffffffc14215a8&amp;gt;] mgc_process_cfg_log+0x788/0xc40 [mgc]
[ 5643.737880]  [&amp;lt;ffffffffc1424413&amp;gt;] mgc_process_log+0x3d3/0x890 [mgc]
[ 5643.745784]  [&amp;lt;ffffffffc0a8c810&amp;gt;] ? class_config_dump_handler+0x7e0/0x7e0 [obdclass]
[ 5643.755294]  [&amp;lt;ffffffffc1424b18&amp;gt;] ? do_config_log_add+0x248/0x580 [mgc]
[ 5643.763560]  [&amp;lt;ffffffffc14259f0&amp;gt;] mgc_process_config+0x890/0x13f0 [mgc]
[ 5643.771778]  [&amp;lt;ffffffffc0a902ae&amp;gt;] lustre_process_log+0x2de/0xaf0 [obdclass]
[ 5643.780388]  [&amp;lt;ffffffffc0968c47&amp;gt;] ? libcfs_debug_msg+0x57/0x80 [libcfs]
[ 5643.788598]  [&amp;lt;ffffffffc0a7bbd9&amp;gt;] ? lprocfs_counter_add+0xf9/0x160 [obdclass]
[ 5643.797374]  [&amp;lt;ffffffffc0abca12&amp;gt;] server_start_targets+0x1352/0x2a70 [obdclass]
[ 5643.806348]  [&amp;lt;ffffffffc0a93270&amp;gt;] ? lustre_start_mgc+0x260/0x25b0 [obdclass]
[ 5643.815030]  [&amp;lt;ffffffffc0a8c810&amp;gt;] ? class_config_dump_handler+0x7e0/0x7e0 [obdclass]
[ 5643.824484]  [&amp;lt;ffffffffc0abf1bd&amp;gt;] server_fill_super+0x108d/0x184c [obdclass]
[ 5643.833157]  [&amp;lt;ffffffffc0a96068&amp;gt;] lustre_fill_super+0x328/0x950 [obdclass]
[ 5643.841638]  [&amp;lt;ffffffffc0a95d40&amp;gt;] ? lustre_common_put_super+0x270/0x270 [obdclass]
[ 5643.850881]  [&amp;lt;ffffffff81204a7d&amp;gt;] mount_nodev+0x4d/0xb0
[ 5643.857516]  [&amp;lt;ffffffffc0a8e0d8&amp;gt;] lustre_mount+0x38/0x60 [obdclass]
[ 5643.865298]  [&amp;lt;ffffffff81205509&amp;gt;] mount_fs+0x39/0x1b0
[ 5643.871716]  [&amp;lt;ffffffff81221fe7&amp;gt;] vfs_kern_mount+0x67/0x110
[ 5643.878713]  [&amp;lt;ffffffff812244f3&amp;gt;] do_mount+0x233/0xaf0
[ 5643.885221]  [&amp;lt;ffffffff8118761e&amp;gt;] ? __get_free_pages+0xe/0x40
[ 5643.892384]  [&amp;lt;ffffffff81225136&amp;gt;] SyS_mount+0x96/0xf0
[ 5643.898795]  [&amp;lt;ffffffff816b5089&amp;gt;] system_call_fastpath+0x16/0x1b
[root@soak-9 ~]# 

&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="213735" author="yujian" created="Wed, 15 Nov 2017 07:38:59 +0000"  >&lt;p&gt;conf-sanity test 53b also hit this issue on master branch, which is affecting patch review testing:&lt;br/&gt;
&lt;a href=&quot;https://testing.hpdd.intel.com/test_sets/e63e81e6-c94c-11e7-a066-52540065bddc&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://testing.hpdd.intel.com/test_sets/e63e81e6-c94c-11e7-a066-52540065bddc&lt;/a&gt;&lt;br/&gt;
&lt;a href=&quot;https://testing.hpdd.intel.com/test_sets/683390e6-c570-11e7-a066-52540065bddc&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://testing.hpdd.intel.com/test_sets/683390e6-c570-11e7-a066-52540065bddc&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="213867" author="talis" created="Thu, 16 Nov 2017 01:25:02 +0000"  >&lt;p&gt;I realize u work on 2.11 future rel.  This post not related to Soak, but looking at the Log it appears striped sets giving timeout problem.. May i ask why not use IB  RDMA?  i plan a 8-node running 2.10 Lustre with IB connect.. i wish to hold the DB in-memory..distrib across nodes. What is the best installation using  Mellanox ConnectX2  VPI 40Gb / node in pure IB thru a SX6036 switch.  These methods need integration within Lustre ..  At What stage ?&lt;/p&gt;

&lt;p&gt;Is the WQE encapsulated inside the PCIe doorbell and written via Programmed IO? Or is the WQE fetched in a separate DMA read by the HCA?&lt;br/&gt;
It can be either. The usual way is writing WQE and ringing the doorbell, in which case the HCA will fetch the WQE through DMA. The less common way is what is called &quot;Blue Flame&quot; - part of the PCIe BAR is used for WQEs, and the WQE is written to the HCA the same way the doorbell is written.&lt;/p&gt;

&lt;p&gt;Main issue, how does Lustre map to in-mem distributed store?  Even on a 2 node split using only ConnectX2  VPI (dual port)  or  X3 (but i hear IB performance is slower on X3).  New features of Lustre 10 makes local routing a breeze direct from the header.&lt;/p&gt;

&lt;p&gt; dont wish for this MPI issue to dominate :  (even tho not using MPI directly)  Can u say if MPI libraries are used at all in Lustre , or affected UDP binaries, in the case i  use UDP) ?&lt;/p&gt;

&lt;p&gt;Is IB datagram mode available as DMA  (or other layer) transport within Lustre 10 ?&lt;/p&gt;

&lt;p&gt;MPI IP performance is affected by the socket-buffer sizes for sending and receiving UDP data. These are defined by two network tuning parameters udp_sendspace and udp_recvspace. When the buffer for sending data is too small and quickly becomes full, UDP data transfer can be delayed. When the buffer for receiving data is too small, incoming UDP data can be dropped due to insufficient buffer space, resulting in send-side retransmission and very poor performance.&lt;/p&gt;

&lt;p&gt;Heres v2 of Mellan Ib utils :&lt;br/&gt;
&lt;a href=&quot;http://www.mellanox.com/page/products_dyn?product_family=110&amp;amp;mtag=monitoring_debug&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://www.mellanox.com/page/products_dyn?product_family=110&amp;amp;mtag=monitoring_debug&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Worth the ask, if Sarah or other may interest in remote install / Dev onto a NY cluster.. on a casual basis.&lt;/p&gt;</comment>
                            <comment id="213902" author="bfaccini" created="Thu, 16 Nov 2017 13:21:54 +0000"  >&lt;p&gt;The attached Lustre debug log indicates that during the mount process, Client is fetching the soaked-client config LLOG from MGS and running the different config steps according to its content.&lt;/p&gt;

&lt;p&gt;And it seems to be stuck during the LNet peer discovery of 192.168.1.110@o2ib NID for soaked-MDT0002-mdc import creation :&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;00000040:00001000:67.0:1508174041.158589:0:2946:0:(llog.c:513:llog_process_thread()) processing rec 0xffff8808f39f5698 type 0x10620000
00000040:00001000:67.0:1508174041.158590:0:2946:0:(llog.c:519:llog_process_thread()) after swabbing, type=0x10620000 idx=38
00000040:00001000:67.0:1508174041.158591:0:2946:0:(llog.c:595:llog_process_thread()) lrh_index: 38 lrh_len: 144 (2408 remains)
00000020:00000001:67.0:1508174041.158592:0:2946:0:(obd_config.c:1496:class_config_llog_handler()) Process entered
00000020:00000010:67.0:1508174041.158592:0:2946:0:(obd_config.c:1629:class_config_llog_handler()) kmalloced &apos;inst_name&apos;: 39 at ffff88083c75d040.
00000020:01000000:67.0:1508174041.158594:0:2946:0:(obd_config.c:1637:class_config_llog_handler()) cmd cf003, instance name: soaked-MDT0002-mdc-ffff88105dcb8800
00000020:00000010:67.0:1508174041.158595:0:2946:0:(obd_config.c:1707:class_config_llog_handler()) kmalloced &apos;lcfg_new&apos;: 136 at ffff88103f70e900.
00000020:00000080:67.0:1508174041.158596:0:2946:0:(obd_config.c:1180:class_process_config()) processing cmd: cf003
00000020:00000001:67.0:1508174041.158597:0:2946:0:(obd_config.c:479:class_setup()) Process entered
00000001:00000001:67.0:1508174041.158598:0:2946:0:(hash.c:1005:cfs_hash_create()) Process entered
00000001:00000010:67.0:1508174041.158599:0:2946:0:(hash.c:1032:cfs_hash_create()) alloc &apos;(hs)&apos;: 144 at ffff88103f70e840 (tot 21621812).
00000001:00000010:67.0:1508174041.158600:0:2946:0:(hash.c:893:cfs_hash_buckets_realloc()) alloc &apos;(new_bkts)&apos;: 32 at ffff88104f3f1580 (tot 21621844).
00000001:00000010:67.0:1508174041.158602:0:2946:0:(hash.c:906:cfs_hash_buckets_realloc()) alloc &apos;(new_bkts[i])&apos;: 280 at ffff88107fd4c600 (tot 21622124).
00000001:00000010:67.0:1508174041.158604:0:2946:0:(hash.c:906:cfs_hash_buckets_realloc()) alloc &apos;(new_bkts[i])&apos;: 280 at ffff88107fd4c800 (tot 21622404).
00000001:00000010:67.0:1508174041.158605:0:2946:0:(hash.c:906:cfs_hash_buckets_realloc()) alloc &apos;(new_bkts[i])&apos;: 280 at ffff88107fd4ca00 (tot 21622684).
00000001:00000010:67.0:1508174041.158607:0:2946:0:(hash.c:906:cfs_hash_buckets_realloc()) alloc &apos;(new_bkts[i])&apos;: 280 at ffff88107fd4cc00 (tot 21622964).
00000001:00000001:67.0:1508174041.158608:0:2946:0:(hash.c:1005:cfs_hash_create()) Process entered
00000001:00000010:67.0:1508174041.158609:0:2946:0:(hash.c:1032:cfs_hash_create()) alloc &apos;(hs)&apos;: 144 at ffff88103f70e780 (tot 21623108).
00000001:00000010:67.0:1508174041.158610:0:2946:0:(hash.c:893:cfs_hash_buckets_realloc()) alloc &apos;(new_bkts)&apos;: 32 at ffff88104f3f15a0 (tot 21623140).
00000001:00000010:67.0:1508174041.158611:0:2946:0:(hash.c:906:cfs_hash_buckets_realloc()) alloc &apos;(new_bkts[i])&apos;: 280 at ffff88107fd4ce00 (tot 21623420).
00000001:00000010:67.0:1508174041.158613:0:2946:0:(hash.c:906:cfs_hash_buckets_realloc()) alloc &apos;(new_bkts[i])&apos;: 280 at ffff88107fd4d000 (tot 21623700).
00000001:00000010:67.0:1508174041.158614:0:2946:0:(hash.c:906:cfs_hash_buckets_realloc()) alloc &apos;(new_bkts[i])&apos;: 280 at ffff88107fd4d200 (tot 21623980).
00000001:00000010:67.0:1508174041.158616:0:2946:0:(hash.c:906:cfs_hash_buckets_realloc()) alloc &apos;(new_bkts[i])&apos;: 280 at ffff88107fd4d400 (tot 21624260).
00000001:00000001:67.0:1508174041.158617:0:2946:0:(hash.c:1005:cfs_hash_create()) Process entered
00000001:00000010:67.0:1508174041.158617:0:2946:0:(hash.c:1032:cfs_hash_create()) alloc &apos;(hs)&apos;: 144 at ffff88103f70e6c0 (tot 21624404).
00000001:00000010:67.0:1508174041.158619:0:2946:0:(hash.c:893:cfs_hash_buckets_realloc()) alloc &apos;(new_bkts)&apos;: 32 at ffff88104f3f15e0 (tot 21624436).
00000001:00000010:67.0:1508174041.158620:0:2946:0:(hash.c:906:cfs_hash_buckets_realloc()) alloc &apos;(new_bkts[i])&apos;: 280 at ffff88107fd4d600 (tot 21624716).
00000001:00000010:67.0:1508174041.158622:0:2946:0:(hash.c:906:cfs_hash_buckets_realloc()) alloc &apos;(new_bkts[i])&apos;: 280 at ffff88107fd4d800 (tot 21624996).
00000001:00000010:67.0:1508174041.158623:0:2946:0:(hash.c:906:cfs_hash_buckets_realloc()) alloc &apos;(new_bkts[i])&apos;: 280 at ffff88107fd4da00 (tot 21625276).
00000001:00000010:67.0:1508174041.158625:0:2946:0:(hash.c:906:cfs_hash_buckets_realloc()) alloc &apos;(new_bkts[i])&apos;: 280 at ffff88107fd4dc00 (tot 21625556).
00000001:00000001:67.0:1508174041.158626:0:2946:0:(hash.c:1005:cfs_hash_create()) Process entered
00000001:00000010:67.0:1508174041.158626:0:2946:0:(hash.c:1032:cfs_hash_create()) alloc &apos;(hs)&apos;: 144 at ffff88103f70e600 (tot 21625700).
00000001:00000010:67.0:1508174041.158627:0:2946:0:(hash.c:893:cfs_hash_buckets_realloc()) alloc &apos;(new_bkts)&apos;: 32 at ffff88104f3f1600 (tot 21625732).
00000001:00000010:67.0:1508174041.158629:0:2946:0:(hash.c:906:cfs_hash_buckets_realloc()) alloc &apos;(new_bkts[i])&apos;: 280 at ffff88107fd4de00 (tot 21626012).
00000001:00000010:67.0:1508174041.158630:0:2946:0:(hash.c:906:cfs_hash_buckets_realloc()) alloc &apos;(new_bkts[i])&apos;: 280 at ffff88107fd4e000 (tot 21626292).
00000001:00000010:67.0:1508174041.158631:0:2946:0:(hash.c:906:cfs_hash_buckets_realloc()) alloc &apos;(new_bkts[i])&apos;: 280 at ffff88107fd4e200 (tot 21626572).
00000001:00000010:67.0:1508174041.158633:0:2946:0:(hash.c:906:cfs_hash_buckets_realloc()) alloc &apos;(new_bkts[i])&apos;: 280 at ffff88107fd4e400 (tot 21626852).
00000020:00000001:67.0:1508174041.158634:0:2946:0:(genops.c:853:class_new_export()) Process entered
00000020:00000010:67.0:1508174041.158636:0:2946:0:(genops.c:855:class_new_export()) kmalloced &apos;export&apos;: 968 at ffff88107f8f0400.
00000020:00000001:67.0:1508174041.158637:0:2946:0:(lustre_handles.c:64:class_handle_hash()) Process entered
00000020:00000040:67.0:1508174041.158638:0:2946:0:(lustre_handles.c:98:class_handle_hash()) added object ffff88107f8f0400 with handle 0x36c91a8e859b7cc8 to hash
00000020:00000001:67.0:1508174041.158639:0:2946:0:(lustre_handles.c:99:class_handle_hash()) Process leaving
00000020:00000001:67.0:1508174041.158640:0:2946:0:(obd_class.h:991:obd_init_export()) Process entered
00000020:00000001:67.0:1508174041.158641:0:2946:0:(obd_class.h:995:obd_init_export()) Process leaving (rc=0 : 0 : 0)
00000020:00000040:67.0:1508174041.158642:0:2946:0:(obd_config.c:753:class_incref()) incref soaked-MDT0002-mdc-ffff88105dcb8800 (ffff88107fdacf60) now 2
00000020:00000001:67.0:1508174041.158643:0:2946:0:(genops.c:928:class_new_export()) Process leaving (rc=18446612203173774336 : -131870535777280 : ffff88107f8f0400)
00000020:00000040:67.0:1508174041.158644:0:2946:0:(genops.c:828:class_export_put()) PUTting export ffff88107f8f0400 : new refcount 1
00000020:00000001:67.0:1508174041.158645:0:2946:0:(obd_class.h:582:obd_setup()) Process entered
00000002:00000001:67.0:1508174041.158646:0:2946:0:(mdc_request.c:2483:mdc_setup()) Process entered
00000100:00000001:67.0:1508174041.158647:0:2946:0:(ptlrpcd.c:944:ptlrpcd_addref()) Process entered
00000100:00000001:67.0:1508174041.158648:0:2946:0:(ptlrpcd.c:953:ptlrpcd_addref()) Process leaving (rc=0 : 0 : 0)
00010000:00000001:67.0:1508174041.158649:0:2946:0:(ldlm_lib.c:277:client_obd_setup()) Process entered
00010000:00000010:67.0:1508174041.158651:0:2946:0:(ldlm_lib.c:444:client_obd_setup()) kmalloced &apos;cli-&amp;gt;cl_mod_tag_bitmap&apos;: 64 at ffff88083c75df80.
00010000:00000001:67.0:1508174041.158652:0:2946:0:(ldlm_lockd.c:2788:ldlm_get_ref()) Process entered
00010000:00000001:67.0:1508174041.158653:0:2946:0:(ldlm_lockd.c:2797:ldlm_get_ref()) Process leaving (rc=0 : 0 : 0)
00000020:00000010:67.0:1508174041.158654:0:2946:0:(genops.c:1067:class_new_import()) kmalloced &apos;imp&apos;: 1480 at ffff88083e668000.
00000020:00000040:67.0:1508174041.158655:0:2946:0:(obd_config.c:753:class_incref()) incref soaked-MDT0002-mdc-ffff88105dcb8800 (ffff88107fdacf60) now 3
00000020:00000001:67.0:1508174041.158657:0:2946:0:(lustre_handles.c:64:class_handle_hash()) Process entered
00000020:00000040:67.0:1508174041.158658:0:2946:0:(lustre_handles.c:98:class_handle_hash()) added object ffff88083e668000 with handle 0x36c91a8e859b7ccf to hash
00000020:00000001:67.0:1508174041.158659:0:2946:0:(lustre_handles.c:99:class_handle_hash()) Process leaving
00000020:00000001:67.0:1508174041.158661:0:2946:0:(genops.c:1030:class_import_put()) Process entered
00000020:00000040:67.0:1508174041.158661:0:2946:0:(genops.c:1037:class_import_put()) import ffff88083e668000 refcount=1 obd=soaked-MDT0002-mdc-ffff88105dcb8800
00000020:00000001:67.0:1508174041.158663:0:2946:0:(genops.c:1046:class_import_put()) Process leaving
00010000:00000001:67.0:1508174041.158664:0:2946:0:(ldlm_lib.c:61:import_set_conn()) Process entered
00000100:00000200:67.0:1508174041.158666:0:2946:0:(events.c:542:ptlrpc_uuid_to_peer()) 192.168.1.110@o2ib-&amp;gt;12345-192.168.1.110@o2ib
00000100:00000001:67.0:1508174041.158667:0:2946:0:(connection.c:48:ptlrpc_connection_get()) Process entered
00000400:00000010:67.0:1508174041.158671:0:2946:0:(peer.c:239:lnet_peer_alloc()) alloc &apos;(lp)&apos;: 184 at ffff88083e75c900 (tot 21627036).
00000400:00000200:67.0:1508174041.158673:0:2946:0:(peer.c:261:lnet_peer_alloc()) ffff88083e75c900 nid 192.168.1.110@o2ib
00000400:00000010:67.0:1508174041.158675:0:2946:0:(peer.c:204:lnet_peer_net_alloc()) alloc &apos;(lpn)&apos;: 48 at ffff88105b155080 (tot 21627084).
00000400:00000200:67.0:1508174041.158676:0:2946:0:(peer.c:212:lnet_peer_net_alloc()) ffff88105b155080 net o2ib
00000400:00000010:67.0:1508174041.158680:0:2946:0:(peer.c:154:lnet_peer_ni_alloc()) alloc &apos;(lpni)&apos;: 336 at ffff88085d7a8200 (tot 21627420).
00000400:00000200:67.0:1508174041.158682:0:2946:0:(peer.c:194:lnet_peer_ni_alloc()) ffff88085d7a8200 nid 192.168.1.110@o2ib
00000400:00000200:67.0:1508174041.158685:0:2946:0:(peer.c:1186:lnet_peer_attach_peer_ni()) peer 192.168.1.110@o2ib NID 192.168.1.110@o2ib flags 0x0
00000400:00000200:67.0:1508174041.158691:0:2946:0:(peer.c:1751:lnet_peer_queue_for_discovery()) Queue peer 192.168.1.110@o2ib: 0
00000400:00000200:68.0:1508174041.158724:0:2519:0:(peer.c:3030:lnet_peer_discovery_wait_for_work()) woken: 0
00000400:00000200:68.0:1508174041.158728:0:2519:0:(peer.c:3136:lnet_peer_discovery()) peer 192.168.1.110@o2ib state 0x20
00000400:00000010:68.0:1508174041.158730:0:2519:0:(api-ni.c:1076:lnet_ping_buffer_alloc()) alloc &apos;(pbuf)&apos;: 280 at ffff88105c1e3600 (tot 21627700).
00000400:00000010:68.0:1508174041.158732:0:2519:0:(lib-lnet.h:228:lnet_md_alloc()) slab-alloced &apos;md&apos; of size 128 at ffff8810537faf00.
00000400:00000010:68.0:1508174041.158736:0:2519:0:(lib-lnet.h:479:lnet_msg_alloc()) alloc &apos;(msg)&apos;: 416 at ffff88105c1e3400 (tot 21628116).
00000400:00000200:68.0:1508174041.158738:0:2519:0:(lib-move.c:3187:LNetGet()) LNetGet -&amp;gt; 12345-192.168.1.110@o2ib
00000400:00000200:68.0:1508174041.158742:0:2519:0:(peer.c:856:lnet_peer_ni_set_non_mr_pref_nid()) peer 192.168.1.110@o2ib nid 192.168.1.140@o2ib: 0
00000400:00000200:68.0:1508174041.158746:0:2519:0:(lib-move.c:2114:lnet_select_pathway()) TRACE: 192.168.1.140@o2ib(192.168.1.140@o2ib:&amp;lt;?&amp;gt;) -&amp;gt; 192.168.1.110@o2ib(192.168.1.110@o2ib:192.168.1.110@o2ib) : GET
00000800:00000200:68.0:1508174041.158749:0:2519:0:(o2iblnd_cb.c:1511:kiblnd_send()) sending 0 bytes in 0 frags to 12345-192.168.1.110@o2ib
00000800:00000010:68.0:1508174041.158757:0:2519:0:(o2iblnd.c:326:kiblnd_create_peer()) alloc &apos;(peer_ni)&apos;: 112 at ffff88085c21d900 (tot 21628228).
00000800:00000200:68.0:1508174041.158759:0:2519:0:(o2iblnd_cb.c:1479:kiblnd_launch_tx()) peer_ni[ffff88085c21d900] -&amp;gt; 192.168.1.110@o2ib (1)++
00000800:00000200:68.0:1508174041.158763:0:2519:0:(o2iblnd_cb.c:1281:kiblnd_connect_peer()) peer_ni[ffff88085c21d900] -&amp;gt; 192.168.1.110@o2ib (2)++
00000800:00000200:68.0:1508174041.158782:0:2519:0:(o2iblnd_cb.c:1234:kiblnd_resolve_addr()) bound to port 1023
00000800:00000200:68.0:1508174041.158783:0:2519:0:(o2iblnd_cb.c:1302:kiblnd_connect_peer()) 192.168.1.110@o2ib: connection bound to ib0:192.168.1.140:mlx4_0
00000800:00000200:68.0:1508174041.158785:0:2519:0:(o2iblnd_cb.c:1486:kiblnd_launch_tx()) peer_ni[ffff88085c21d900] -&amp;gt; 192.168.1.110@o2ib (3)--
00000400:00000200:68.0:1508174041.158787:0:2519:0:(peer.c:2763:lnet_peer_send_ping()) peer 192.168.1.110@o2ib
00000400:00000200:68.0:1508174041.158788:0:2519:0:(peer.c:3157:lnet_peer_discovery()) peer 192.168.1.110@o2ib state 0x120 rc 0
00000800:00000200:15.0:1508174041.159054:0:691:0:(o2iblnd_cb.c:2991:kiblnd_cm_callback()) 192.168.1.110@o2ib Addr resolved: 0
00000800:00000200:14.0:1508174041.159155:0:6:0:(o2iblnd_cb.c:3021:kiblnd_cm_callback()) 192.168.1.110@o2ib Route resolved: 0
00000800:00000010:14.0:1508174041.159164:0:6:0:(o2iblnd.c:777:kiblnd_create_conn()) alloc &apos;(init_qp_attr)&apos;: 96 at ffff8808592e7ba0 (tot 21628324).
00000800:00000010:14.0:1508174041.159179:0:6:0:(o2iblnd.c:784:kiblnd_create_conn()) alloc &apos;(conn)&apos;: 280 at ffff8800748ee000 (tot 21628604).
00000800:00000010:14.0:1508174041.159188:0:6:0:(o2iblnd.c:808:kiblnd_create_conn()) alloc &apos;(conn-&amp;gt;ibc_connvars)&apos;: 136 at ffff8808467e26c0 (tot 21628740).
00000800:00000010:14.0:1508174041.159204:0:6:0:(o2iblnd.c:843:kiblnd_create_conn()) alloc &apos;(conn-&amp;gt;ibc_rxs)&apos;: 1872 at ffff88085c4f2000 (tot 21630612).
00000800:00000010:14.0:1508174041.159206:0:6:0:(o2iblnd.c:1242:kiblnd_alloc_pages()) alloc &apos;(p)&apos;: 152 at ffff8808467e2480 (tot 21630764).
00000800:00000200:14.0:1508174041.159214:0:6:0:(o2iblnd.c:1316:kiblnd_map_rx_descs()) rx 0: ffff880841b63000 0x841b63000(0x841b63000)
00000800:00000200:14.0:1508174041.159217:0:6:0:(o2iblnd.c:1316:kiblnd_map_rx_descs()) rx 1: ffff880859708000 0x859708000(0x859708000)
&#8230;&#8230;&#8230;&#8230;&#8230;&#8230;&#8230;&#8230;
00000800:00000200:14.0:1508174041.159237:0:6:0:(o2iblnd.c:1316:kiblnd_map_rx_descs()) rx 16: ffff88083a615000 0x83a615000(0x83a615000)
00000800:00000200:14.0:1508174041.159238:0:6:0:(o2iblnd.c:1316:kiblnd_map_rx_descs()) rx 17: ffff88083b2d8000 0x83b2d8000(0x83b2d8000)
00000800:00000200:14.0:1508174041.160476:0:6:0:(o2iblnd.c:915:kiblnd_create_conn()) original send wr 2056, created with 4079
00000800:00000010:14.0:1508174041.160478:0:6:0:(o2iblnd.c:917:kiblnd_create_conn()) kfreed &apos;init_qp_attr&apos;: 96 at ffff8808592e7ba0 (tot 21630668).
00000800:00000200:14.0:1508174041.160480:0:6:0:(o2iblnd_cb.c:194:kiblnd_post_rx()) conn[ffff8800748ee000] (19)++
00000800:00000200:14.0:1508174041.160482:0:6:0:(o2iblnd_cb.c:223:kiblnd_post_rx()) conn[ffff8800748ee000] (20)--
&#8230;&#8230;&#8230;&#8230;&#8230;&#8230;&#8230;..
00000800:00000200:14.0:1508174041.160513:0:6:0:(o2iblnd_cb.c:194:kiblnd_post_rx()) conn[ffff8800748ee000] (19)++
00000800:00000200:14.0:1508174041.160514:0:6:0:(o2iblnd_cb.c:223:kiblnd_post_rx()) conn[ffff8800748ee000] (20)--
00000800:00000200:15.0:1508174041.162919:0:691:0:(o2iblnd_cb.c:3088:kiblnd_cm_callback()) ESTABLISHED(active): 192.168.1.110@o2ib
00000800:00000200:15.0:1508174041.162923:0:691:0:(o2iblnd_cb.c:2123:kiblnd_connreq_done()) 192.168.1.110@o2ib: active(1), version(12), status(0)
00000800:00000010:15.0:1508174041.162925:0:691:0:(o2iblnd_cb.c:2131:kiblnd_connreq_done()) kfreed &apos;conn-&amp;gt;ibc_connvars&apos;: 136 at ffff8808467e26c0 (tot 21630532).
00000800:00000200:15.0:1508174041.162941:0:691:0:(o2iblnd_cb.c:2150:kiblnd_connreq_done()) conn[ffff8800748ee000] (19)++
00000800:00000200:15.0:1508174041.162944:0:691:0:(o2iblnd_cb.c:2192:kiblnd_connreq_done()) conn[ffff8800748ee000] (20)++
00000800:00000200:15.0:1508174041.162945:0:691:0:(o2iblnd_cb.c:1162:kiblnd_queue_tx_locked()) conn[ffff8800748ee000] (21)++
00000800:00000200:15.0:1508174041.162950:0:691:0:(o2iblnd_cb.c:2212:kiblnd_connreq_done()) conn[ffff8800748ee000] (22)--
00000800:00000200:20.2F:1508174041.162997:0:0:0:(o2iblnd_cb.c:3516:kiblnd_cq_completion()) conn[ffff8800748ee000] (21)++
00000800:00000200:20.0F:1508174041.163022:0:2510:0:(o2iblnd_cb.c:3638:kiblnd_scheduler()) conn[ffff8800748ee000] (22)++
00000800:00000200:20.0:1508174041.163028:0:2510:0:(o2iblnd_cb.c:74:kiblnd_tx_done()) conn[ffff8800748ee000] (23)--
00000400:00000200:20.0:1508174041.163032:0:2510:0:(peer.c:2326:lnet_discovery_event_handler()) Received event: 5
00000400:00000200:20.0:1508174041.163036:0:2510:0:(peer.c:2278:lnet_discovery_event_send()) Ping Send to 192.168.1.110@o2ib: 0
00000400:00000010:20.0:1508174041.163040:0:2510:0:(lib-lnet.h:489:lnet_msg_free()) kfreed &apos;msg&apos;: 416 at ffff88105c1e3400 (tot 21630116).
00000800:00000200:20.0:1508174041.163043:0:2510:0:(o2iblnd_cb.c:3654:kiblnd_scheduler()) conn[ffff8800748ee000] (22)--
00000800:00000200:20.0:1508174041.163045:0:2510:0:(o2iblnd_cb.c:3654:kiblnd_scheduler()) conn[ffff8800748ee000] (21)--
00000800:00000200:20.2:1508174041.163169:0:0:0:(o2iblnd_cb.c:3516:kiblnd_cq_completion()) conn[ffff8800748ee000] (20)++
00000800:00000200:20.0:1508174041.163176:0:2511:0:(o2iblnd_cb.c:3638:kiblnd_scheduler()) conn[ffff8800748ee000] (21)++
00000800:00000200:20.0:1508174041.163182:0:2511:0:(o2iblnd_cb.c:326:kiblnd_handle_rx()) Received d1[1] from 192.168.1.110@o2ib
00000400:00000200:20.0:1508174041.163187:0:2511:0:(lib-move.c:2599:lnet_parse()) TRACE: 192.168.1.140@o2ib(192.168.1.140@o2ib) &amp;lt;- 192.168.1.110@o2ib : REPLY - for me
00000400:00000010:20.0:1508174041.163191:0:2511:0:(lib-lnet.h:479:lnet_msg_alloc()) alloc &apos;(msg)&apos;: 416 at ffff88085cab6400 (tot 21630532).
00000400:00000200:20.0:1508174041.163198:0:2511:0:(lib-move.c:2366:lnet_parse_reply()) 192.168.1.140@o2ib: Reply from 12345-192.168.1.110@o2ib of length 48/48 into md 0xad
00000400:00000200:20.0:1508174041.163202:0:2511:0:(peer.c:2326:lnet_discovery_event_handler()) Received event: 3
00000400:00000200:20.0:1508174041.163205:0:2511:0:(peer.c:881:lnet_peer_ni_clr_non_mr_pref_nid()) peer 192.168.1.110@o2ib: 0
00000400:00000200:20.0:1508174041.163209:0:2511:0:(peer.c:2237:lnet_discovery_event_reply()) peer 192.168.1.110@o2ib data present 1
00000400:00000200:20.0:1508174041.163214:0:2511:0:(lib-md.c:69:lnet_md_unlink()) Unlinking md ffff8810537faf00
00000400:00000010:20.0:1508174041.163216:0:2511:0:(lib-lnet.h:259:lnet_md_free()) slab-freed &apos;md&apos; at ffff8810537faf00.
00000800:00000200:47.0:1508174041.163218:0:2508:0:(o2iblnd_cb.c:3654:kiblnd_scheduler()) conn[ffff8800748ee000] (22)--
00000400:00000010:20.0:1508174041.163219:0:2511:0:(lib-lnet.h:489:lnet_msg_free()) kfreed &apos;msg&apos;: 416 at ffff88085cab6400 (tot 21630116).
00000800:00000200:20.0:1508174041.163221:0:2511:0:(o2iblnd_cb.c:194:kiblnd_post_rx()) conn[ffff8800748ee000] (21)++
00000800:00000200:20.0:1508174041.163224:0:2511:0:(o2iblnd_cb.c:223:kiblnd_post_rx()) conn[ffff8800748ee000] (22)--
00000800:00000200:20.0:1508174041.163225:0:2511:0:(o2iblnd_cb.c:3654:kiblnd_scheduler()) conn[ffff8800748ee000] (21)--
00000400:00000200:68.0:1508174041.163252:0:2519:0:(peer.c:3030:lnet_peer_discovery_wait_for_work()) woken: 0
00000400:00000200:68.0:1508174041.163257:0:2519:0:(peer.c:3136:lnet_peer_discovery()) peer 192.168.1.110@o2ib state 0x61
00000400:00000010:68.0:1508174041.163259:0:2519:0:(peer.c:2399:lnet_peer_merge_data()) alloc &apos;(curnis)&apos;: 16 at ffff88105a776630 (tot 21630132).
00000400:00000010:68.0:1508174041.163261:0:2519:0:(peer.c:2400:lnet_peer_merge_data()) alloc &apos;(addnis)&apos;: 16 at ffff88105a776640 (tot 21630148).
00000400:00000010:68.0:1508174041.163262:0:2519:0:(peer.c:2401:lnet_peer_merge_data()) alloc &apos;(delnis)&apos;: 16 at ffff88105a776650 (tot 21630164).
00000400:00000010:68.0:1508174041.163263:0:2519:0:(peer.c:2470:lnet_peer_merge_data()) kfreed &apos;curnis&apos;: 16 at ffff88105a776630 (tot 21630148).
00000400:00000010:68.0:1508174041.163264:0:2519:0:(peer.c:2471:lnet_peer_merge_data()) kfreed &apos;addnis&apos;: 16 at ffff88105a776640 (tot 21630132).
00000400:00000010:68.0:1508174041.163265:0:2519:0:(peer.c:2472:lnet_peer_merge_data()) kfreed &apos;delnis&apos;: 16 at ffff88105a776650 (tot 21630116).
00000400:00000010:68.0:1508174041.163266:0:2519:0:(api-ni.c:1089:lnet_ping_buffer_free()) kfreed &apos;pbuf&apos;: 280 at ffff88105c1e3600 (tot 21629836).
00000400:00000200:68.0:1508174041.163268:0:2519:0:(peer.c:2474:lnet_peer_merge_data()) peer 192.168.1.110@o2ib: 0
00000400:00000200:68.0:1508174041.163269:0:2519:0:(peer.c:2634:lnet_peer_data_present()) peer 192.168.1.110@o2ib: 0
00000400:00000200:68.0:1508174041.163271:0:2519:0:(peer.c:3157:lnet_peer_discovery()) peer 192.168.1.110@o2ib state 0xa1 rc 1
00000400:00000200:68.0:1508174041.163274:0:2519:0:(peer.c:3136:lnet_peer_discovery()) peer 192.168.1.110@o2ib state 0xa1
00000400:00000010:68.0:1508174041.163275:0:2519:0:(lib-lnet.h:228:lnet_md_alloc()) slab-alloced &apos;md&apos; of size 128 at ffff88085c21d500.
00000400:00000010:68.0:1508174041.163279:0:2519:0:(lib-lnet.h:479:lnet_msg_alloc()) alloc &apos;(msg)&apos;: 416 at ffff88105c1e3600 (tot 21630252).
00000400:00000200:68.0:1508174041.163281:0:2519:0:(lib-move.c:2984:LNetPut()) LNetPut -&amp;gt; 12345-192.168.1.110@o2ib
00000400:00000200:68.0:1508174041.163288:0:2519:0:(lib-move.c:2114:lnet_select_pathway()) TRACE: 192.168.1.140@o2ib(192.168.1.140@o2ib:&amp;lt;?&amp;gt;) -&amp;gt; 192.168.1.110@o2ib(192.168.1.110@o2ib:192.168.1.110@o2ib) : PUT
00000800:00000200:68.0:1508174041.163291:0:2519:0:(o2iblnd_cb.c:1511:kiblnd_send()) sending 48 bytes in 1 frags to 12345-192.168.1.110@o2ib
00000800:00000200:68.0:1508174041.163296:0:2519:0:(o2iblnd.c:405:kiblnd_find_peer_locked()) got peer_ni [ffff88085c21d900] -&amp;gt; 192.168.1.110@o2ib (2) version: 12
00000800:00000200:68.0:1508174041.163298:0:2519:0:(o2iblnd_cb.c:1392:kiblnd_launch_tx()) conn[ffff8800748ee000] (20)++
00000800:00000200:68.0:1508174041.163299:0:2519:0:(o2iblnd_cb.c:1162:kiblnd_queue_tx_locked()) conn[ffff8800748ee000] (21)++
00000800:00000200:68.0:1508174041.163303:0:2519:0:(o2iblnd_cb.c:1398:kiblnd_launch_tx()) conn[ffff8800748ee000] (22)--
00000400:00000200:68.0:1508174041.163304:0:2519:0:(peer.c:2869:lnet_peer_send_push()) peer 192.168.1.110@o2ib
00000400:00000200:68.0:1508174041.163305:0:2519:0:(peer.c:3157:lnet_peer_discovery()) peer 192.168.1.110@o2ib state 0x2a1 rc 0
00000800:00000200:20.2:1508174041.163309:0:0:0:(o2iblnd_cb.c:3516:kiblnd_cq_completion()) conn[ffff8800748ee000] (21)++
00000800:00000200:20.0:1508174041.163315:0:2510:0:(o2iblnd_cb.c:3638:kiblnd_scheduler()) conn[ffff8800748ee000] (22)++
00000800:00000200:20.0:1508174041.163318:0:2510:0:(o2iblnd_cb.c:74:kiblnd_tx_done()) conn[ffff8800748ee000] (23)--
00000400:00000200:20.0:1508174041.163320:0:2510:0:(peer.c:2326:lnet_discovery_event_handler()) Received event: 5
00000400:00000200:20.0:1508174041.163323:0:2510:0:(peer.c:2278:lnet_discovery_event_send()) Push Send to 192.168.1.110@o2ib: 0
00000400:00000010:20.0:1508174041.163326:0:2510:0:(lib-lnet.h:489:lnet_msg_free()) kfreed &apos;msg&apos;: 416 at ffff88105c1e3600 (tot 21629836).
00000800:00000200:20.0:1508174041.163328:0:2510:0:(o2iblnd_cb.c:3654:kiblnd_scheduler()) conn[ffff8800748ee000] (22)--
00000800:00000200:20.0:1508174041.163330:0:2510:0:(o2iblnd_cb.c:3654:kiblnd_scheduler()) conn[ffff8800748ee000] (21)--
00000800:00000200:20.2:1508174041.163475:0:0:0:(o2iblnd_cb.c:3516:kiblnd_cq_completion()) conn[ffff8800748ee000] (20)++
00000800:00000200:20.0:1508174041.163481:0:2508:0:(o2iblnd_cb.c:3638:kiblnd_scheduler()) conn[ffff8800748ee000] (21)++
00000800:00000200:20.0:1508174041.163485:0:2508:0:(o2iblnd_cb.c:326:kiblnd_handle_rx()) Received d1[0] from 192.168.1.110@o2ib
00000400:00000200:20.0:1508174041.163489:0:2508:0:(lib-move.c:2599:lnet_parse()) TRACE: 192.168.1.140@o2ib(192.168.1.140@o2ib) &amp;lt;- 192.168.1.110@o2ib : ACK - for me
00000400:00000010:20.0:1508174041.163491:0:2508:0:(lib-lnet.h:479:lnet_msg_alloc()) alloc &apos;(msg)&apos;: 416 at ffff88085cab6400 (tot 21630252).
00000400:00000200:20.0:1508174041.163496:0:2508:0:(lib-move.c:2419:lnet_parse_ack()) 192.168.1.140@o2ib: ACK from 12345-192.168.1.110@o2ib into md 0xb5
00000400:00000200:20.0:1508174041.163498:0:2508:0:(peer.c:2326:lnet_discovery_event_handler()) Received event: 4
00000400:00000200:20.0:1508174041.163500:0:2508:0:(peer.c:2111:lnet_discovery_event_ack()) peer 192.168.1.110@o2ib ev-&amp;gt;status 0
00000400:00000200:20.0:1508174041.163505:0:2508:0:(lib-md.c:69:lnet_md_unlink()) Unlinking md ffff88085c21d500
00000400:00000010:20.0:1508174041.163506:0:2508:0:(lib-lnet.h:259:lnet_md_free()) slab-freed &apos;md&apos; at ffff88085c21d500.
00000400:00000010:20.0:1508174041.163508:0:2508:0:(lib-lnet.h:489:lnet_msg_free()) kfreed &apos;msg&apos;: 416 at ffff88085cab6400 (tot 21629836).
00000800:00000200:20.0:1508174041.163510:0:2508:0:(o2iblnd_cb.c:194:kiblnd_post_rx()) conn[ffff8800748ee000] (22)++
00000800:00000200:20.0:1508174041.163512:0:2508:0:(o2iblnd_cb.c:223:kiblnd_post_rx()) conn[ffff8800748ee000] (23)--
00000800:00000200:20.0:1508174041.163513:0:2508:0:(o2iblnd_cb.c:3654:kiblnd_scheduler()) conn[ffff8800748ee000] (22)--
00000800:00000200:20.0:1508174041.163515:0:2508:0:(o2iblnd_cb.c:3654:kiblnd_scheduler()) conn[ffff8800748ee000] (21)--
00000800:00000200:20.2:1508174041.163551:0:0:0:(o2iblnd_cb.c:3516:kiblnd_cq_completion()) conn[ffff8800748ee000] (20)++
00000800:00000200:20.0:1508174041.163556:0:2510:0:(o2iblnd_cb.c:3638:kiblnd_scheduler()) conn[ffff8800748ee000] (21)++
00000800:00000200:20.0:1508174041.163560:0:2510:0:(o2iblnd_cb.c:326:kiblnd_handle_rx()) Received d1[1] from 192.168.1.110@o2ib
00000400:00000200:20.0:1508174041.163563:0:2510:0:(lib-move.c:2599:lnet_parse()) TRACE: 192.168.1.140@o2ib(192.168.1.140@o2ib) &amp;lt;- 192.168.1.110@o2ib : PUT - for me
00000400:00000010:20.0:1508174041.163565:0:2510:0:(lib-lnet.h:479:lnet_msg_alloc()) alloc &apos;(msg)&apos;: 416 at ffff88085cab6400 (tot 21630252).
00000800:00000200:4.0:1508174041.163565:0:2509:0:(o2iblnd_cb.c:3654:kiblnd_scheduler()) conn[ffff8800748ee000] (22)--
00000400:00000200:20.0:1508174041.163569:0:2510:0:(lib-ptl.c:571:lnet_ptl_match_md()) Request from 12345-192.168.1.110@o2ib of length 48 into portal 0 MB=0x8000000000000000
00000400:00000200:20.0:1508174041.163575:0:2510:0:(lib-ptl.c:200:lnet_try_match_md()) Incoming put index 0 from 12345-192.168.1.110@o2ib of length 48/48 into md 0x9 [1] + 0
00000400:00000200:68.0:1508174041.163577:0:2519:0:(peer.c:3030:lnet_peer_discovery_wait_for_work()) woken: 0
00000400:00000200:20.0:1508174041.163581:0:2510:0:(peer.c:1939:lnet_peer_push_event()) Stale Push from 192.168.1.110@o2ib: got 1 have 1
00000400:00000200:68.0:1508174041.163585:0:2519:0:(peer.c:3136:lnet_peer_discovery()) peer 192.168.1.110@o2ib state 0xa1
00000400:00000200:68.0:1508174041.163588:0:2519:0:(peer.c:2917:lnet_peer_discovered()) peer 192.168.1.110@o2ib
00000400:00000200:68.0:1508174041.163589:0:2519:0:(peer.c:3157:lnet_peer_discovery()) peer 192.168.1.110@o2ib state 0x89 rc 0
00000400:00000200:20.0:1508174041.163592:0:2510:0:(peer.c:1751:lnet_peer_queue_for_discovery()) Queue peer 192.168.1.110@o2ib: -114
00000400:00000200:20.0:1508174041.163603:0:2510:0:(lib-move.c:2114:lnet_select_pathway()) TRACE: 192.168.1.140@o2ib(192.168.1.140@o2ib:192.168.1.140@o2ib) -&amp;gt; 192.168.1.110@o2ib(192.168.1.110@o2ib:192.168.1.110@o2ib) : ACK
00000800:00000200:20.0:1508174041.163608:0:2510:0:(o2iblnd_cb.c:1511:kiblnd_send()) sending 0 bytes in 1 frags to 12345-192.168.1.110@o2ib
00000800:00000200:20.0:1508174041.163613:0:2510:0:(o2iblnd.c:405:kiblnd_find_peer_locked()) got peer_ni [ffff88085c21d900] -&amp;gt; 192.168.1.110@o2ib (2) version: 12
00000800:00000200:20.0:1508174041.163616:0:2510:0:(o2iblnd_cb.c:1392:kiblnd_launch_tx()) conn[ffff8800748ee000] (21)++
00000800:00000200:20.0:1508174041.163618:0:2510:0:(o2iblnd_cb.c:1162:kiblnd_queue_tx_locked()) conn[ffff8800748ee000] (22)++
00000800:00000200:20.0:1508174041.163621:0:2510:0:(o2iblnd_cb.c:1398:kiblnd_launch_tx()) conn[ffff8800748ee000] (23)--
00000800:00000200:20.0:1508174041.163622:0:2510:0:(o2iblnd_cb.c:194:kiblnd_post_rx()) conn[ffff8800748ee000] (22)++
00000800:00000200:20.2:1508174041.163625:0:2510:0:(o2iblnd_cb.c:3516:kiblnd_cq_completion()) conn[ffff8800748ee000] (22)++
00000800:00000200:20.0:1508174041.163630:0:2510:0:(o2iblnd_cb.c:223:kiblnd_post_rx()) conn[ffff8800748ee000] (24)--
00000800:00000200:20.0:1508174041.163631:0:2510:0:(o2iblnd_cb.c:3654:kiblnd_scheduler()) conn[ffff8800748ee000] (23)--
00000800:00000200:20.0:1508174041.163633:0:2510:0:(o2iblnd_cb.c:3638:kiblnd_scheduler()) conn[ffff8800748ee000] (22)++
00000800:00000200:20.0:1508174041.163638:0:2510:0:(o2iblnd_cb.c:74:kiblnd_tx_done()) conn[ffff8800748ee000] (23)--
00000400:00000010:20.0:1508174041.163640:0:2510:0:(lib-lnet.h:489:lnet_msg_free()) kfreed &apos;msg&apos;: 416 at ffff88085cab6400 (tot 21629836).
00000800:00000200:20.0:1508174041.163641:0:2510:0:(o2iblnd_cb.c:3654:kiblnd_scheduler()) conn[ffff8800748ee000] (22)--
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;where PID 2946 should be the forked LLOG parser thread for the soak-client config, PID 2519 should be the LNet discovery thread, PIDs 2508/2510 should be ko2iblnd scheduler threads.&lt;/p&gt;

&lt;p&gt;Looks like there is a race between LNet discovery thread bottom handling of ACK for previous sent PUT msg to 192.168.1.110@o2ib, and 2510/ko2iblnd scheduler thread handling consecutive &quot;stale&quot; (unexpected or due to multi-rail?) PUT msg received from 192.168.1.110@o2ib, thus causing LNET_PEER_DISCOVERING flag to be set again and peer to be kept in discovery process/loop, but very likely on the ln_dc_working queue instead of ln_dc_request one so out of discovery thread control and should have been probably timed-out later.&lt;/p&gt;</comment>
                            <comment id="214003" author="gerrit" created="Fri, 17 Nov 2017 12:42:32 +0000"  >&lt;p&gt;Faccini Bruno (bruno.faccini@intel.com) uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/30147&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/30147&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-10123&quot; title=&quot;Repeated timeouts during mount()- MDS and client. &quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-10123&quot;&gt;&lt;del&gt;LU-10123&lt;/del&gt;&lt;/a&gt; lnet: ensure peer put back on dc request queue&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: 3bb495fcca17bdedd89a11cc9d9bf200604af0aa&lt;/p&gt;</comment>
                            <comment id="214062" author="cliffw" created="Fri, 17 Nov 2017 21:21:07 +0000"  >&lt;p&gt;Mount performance is much better with this patch&lt;/p&gt;</comment>
                            <comment id="214303" author="bfaccini" created="Tue, 21 Nov 2017 11:14:04 +0000"  >&lt;p&gt;Hmm, seems that my patch has also introduced some regression and causes this new LBUG on Client side for different auto-tests sub-tests/steps (like lnet-selftest. lustre-initialization, upon mount, ...) :&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[ 5914.956548] LNetError: 23347:0:(peer.c:272:lnet_destroy_peer_locked()) ASSERTION( list_empty(&amp;amp;lp-&amp;gt;lp_peer_nets) ) failed: 
[ 5914.960593] LNetError: 23347:0:(peer.c:272:lnet_destroy_peer_locked()) LBUG
[ 5914.964063] Pid: 23347, comm: lnet_discovery
[ 5914.967234] 
[ 5914.967234] Call Trace:
[ 5914.973027]  [&amp;lt;ffffffffc07337ae&amp;gt;] libcfs_call_trace+0x4e/0x60 [libcfs]
[ 5914.976440]  [&amp;lt;ffffffffc073383c&amp;gt;] lbug_with_loc+0x4c/0xb0 [libcfs]
[ 5914.979816]  [&amp;lt;ffffffffc06bab42&amp;gt;] lnet_destroy_peer_locked+0x232/0x340 [lnet]
[ 5914.983278]  [&amp;lt;ffffffffc06bb175&amp;gt;] lnet_peer_discovery_complete+0x295/0x340 [lnet]
[ 5914.986724]  [&amp;lt;ffffffffc06c0c30&amp;gt;] lnet_peer_discovery+0x6f0/0x11b0 [lnet]
[ 5914.990141]  [&amp;lt;ffffffff816a904d&amp;gt;] ? __schedule+0x39d/0x8b0
[ 5914.993348]  [&amp;lt;ffffffff810b1920&amp;gt;] ? autoremove_wake_function+0x0/0x40
[ 5914.996599]  [&amp;lt;ffffffffc06c0540&amp;gt;] ? lnet_peer_discovery+0x0/0x11b0 [lnet]
[ 5914.999836]  [&amp;lt;ffffffff810b099f&amp;gt;] kthread+0xcf/0xe0
[ 5915.002827]  [&amp;lt;ffffffff810b08d0&amp;gt;] ? kthread+0x0/0xe0
[ 5915.005771]  [&amp;lt;ffffffff816b4fd8&amp;gt;] ret_from_fork+0x58/0x90
[ 5915.008696]  [&amp;lt;ffffffff810b08d0&amp;gt;] ? kthread+0x0/0xe0
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Cliff, do you also trigger this ?&lt;/p&gt;</comment>
                            <comment id="214338" author="sarah" created="Tue, 21 Nov 2017 17:53:19 +0000"  >&lt;p&gt;Hi Bruno,&lt;/p&gt;

&lt;p&gt;Yesterday I tried to make a patch to include both your patch and the one for &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-10134&quot; title=&quot;LBUG lfsck_namespace_double_scan()) ASSERTION( list_empty(&amp;amp;lad-&amp;gt;lad_req_list) ) failed:&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-10134&quot;&gt;&lt;del&gt;LU-10134&lt;/del&gt;&lt;/a&gt; for soak. I checked out your patch and rebase master, then checkout &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-10134&quot; title=&quot;LBUG lfsck_namespace_double_scan()) ASSERTION( list_empty(&amp;amp;lad-&amp;gt;lad_req_list) ) failed:&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-10134&quot;&gt;&lt;del&gt;LU-10134&lt;/del&gt;&lt;/a&gt; patch and rebase on your patch. The new patch set #2 (which hit the LBUG) should have no change but only rebase master.&lt;/p&gt;

&lt;p&gt;The test patch which includes both patch also hit the LBUG&lt;br/&gt;
&lt;a href=&quot;https://review.whamcloud.com/#/c/30187/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/#/c/30187/&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="215205" author="bfaccini" created="Mon, 4 Dec 2017 12:29:04 +0000"  >&lt;p&gt;Sarah, Cliff,&lt;br/&gt;
Can you give a try to more recent version/patch-set #7 of my patch change #30147 ?&lt;/p&gt;</comment>
                            <comment id="215261" author="cliffw" created="Mon, 4 Dec 2017 20:00:17 +0000"  >&lt;p&gt;We are doing b2-10 now, will consider this when we move back to master&lt;/p&gt;</comment>
                            <comment id="216897" author="cliffw" created="Wed, 20 Dec 2017 21:38:39 +0000"  >&lt;p&gt;We are attempting to test master now, can you rebase your patch on current master? &lt;/p&gt;</comment>
                            <comment id="216919" author="bfaccini" created="Thu, 21 Dec 2017 06:42:56 +0000"  >&lt;p&gt;Cliff, I have just rebased my change #30147 on latest master.&lt;/p&gt;</comment>
                            <comment id="217546" author="cliffw" created="Thu, 4 Jan 2018 23:48:00 +0000"  >&lt;p&gt;We still see client mount hangs on latest master, is this patch in queue for landing?&lt;/p&gt;</comment>
                            <comment id="217575" author="pjones" created="Fri, 5 Jan 2018 13:49:57 +0000"  >&lt;p&gt;It had not been but I just added it to the chaser list. If there&apos;s something you think is important to get landed for soak please flag it either directly or via James so we can expedite it.&lt;/p&gt;</comment>
                            <comment id="219499" author="gerrit" created="Wed, 31 Jan 2018 05:52:03 +0000"  >&lt;p&gt;Oleg Drokin (oleg.drokin@intel.com) merged in patch &lt;a href=&quot;https://review.whamcloud.com/30147/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/30147/&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-10123&quot; title=&quot;Repeated timeouts during mount()- MDS and client. &quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-10123&quot;&gt;&lt;del&gt;LU-10123&lt;/del&gt;&lt;/a&gt; lnet: ensure peer put back on dc request queue&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: &lt;br/&gt;
Commit: d0185dd433943506f434eb5cf9707c2b6807b528&lt;/p&gt;</comment>
                            <comment id="220150" author="pjones" created="Tue, 6 Feb 2018 17:04:17 +0000"  >&lt;p&gt;Landed for 2.11&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10010">
                    <name>Duplicate</name>
                                                                <inwardlinks description="is duplicated by">
                                        <issuelink>
            <issuekey id="48203">LU-9954</issuekey>
        </issuelink>
                            </inwardlinks>
                                    </issuelinktype>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                                                <inwardlinks description="is related to">
                                                        </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                            <attachment id="28475" name="soak-40.lustre.log.gz" size="277047" author="cliffw" created="Mon, 16 Oct 2017 17:28:11 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzzlxz:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>