<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 02:28:06 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-9659] lnet assert after timeout on reconnect ASSERTION( !peer_ni-&gt;ibp_accepting &amp;&amp; !peer_ni-&gt;ibp_connecting &amp;&amp; list_empty(&amp;peer_ni-&gt;ibp_conns)</title>
                <link>https://jira.whamcloud.com/browse/LU-9659</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;Running on Centos 7.3, Lustre 2.9.58, build 15 coral-rc1-combined (based on 0.7.0 RC4).&lt;/p&gt;

&lt;p&gt;Found Lustre down this morning so I rebooted all of the Lustre servers to get ready for a demo.  The mds came up but when I brought the OSS online it crashed.  When I brought the mds node back up as soon as I mounted Lustre all the OSS nodes crashed: &lt;/p&gt;

&lt;p&gt;From ssu2_oss2&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[ 1886.319606] LNet: 44156:0:(o2iblnd_cb.c:3207:kiblnd_check_conns()) Timed out tx for 192.168.1.101@o2ib: 48 seconds
[ 1925.319158] LNetError: 44156:0:(o2iblnd_cb.c:1355:kiblnd_reconnect_peer()) ASSERTION( !peer_ni-&amp;gt;ibp_accepting &amp;amp;&amp;amp; !peer_ni-&amp;gt;ibp_connecting &amp;amp;&amp;amp; list_empty(&amp;amp;peer_ni-&amp;gt;ibp_conns) ) failed: 
[ 1925.319167] LNetError: 44156:0:(o2iblnd_cb.c:1355:kiblnd_reconnect_peer()) LBUG
[ 1925.319170] Pid: 44156, comm: kiblnd_connd
[ 1925.319171] 
Call Trace:
[ 1925.319198]  [&amp;lt;ffffffffa09697ee&amp;gt;] libcfs_call_trace+0x4e/0x60 [libcfs]
[ 1925.319208]  [&amp;lt;ffffffffa096987c&amp;gt;] lbug_with_loc+0x4c/0xb0 [libcfs]
[ 1925.319225]  [&amp;lt;ffffffffa0a6e206&amp;gt;] kiblnd_reconnect_peer+0x216/0x220 [ko2iblnd]
[ 1925.319234]  [&amp;lt;ffffffffa0a77214&amp;gt;] kiblnd_connd+0x464/0x900 [ko2iblnd]
[ 1925.319245]  [&amp;lt;ffffffff810c54c0&amp;gt;] ? default_wake_function+0x0/0x20
[ 1925.319253]  [&amp;lt;ffffffffa0a76db0&amp;gt;] ? kiblnd_connd+0x0/0x900 [ko2iblnd]
[ 1925.319259]  [&amp;lt;ffffffff810b0a4f&amp;gt;] kthread+0xcf/0xe0
[ 1925.319264]  [&amp;lt;ffffffff810b0980&amp;gt;] ? kthread+0x0/0xe0
[ 1925.319269]  [&amp;lt;ffffffff816970d8&amp;gt;] ret_from_fork+0x58/0x90
[ 1925.319273]  [&amp;lt;ffffffff810b0980&amp;gt;] ? kthread+0x0/0xe0
[ 1925.319275] 
[ 1925.319277] Kernel panic - not syncing: LBUG
[ 1925.319323] CPU: 3 PID: 44156 Comm: kiblnd_connd Tainted: P           OE  ------------   3.10.0-514.16.1.el7.x86_64 #1
[ 1925.319401] Hardware name: Intel Corporation S2600WTTR/S2600WTTR, BIOS SE5C610.86B.01.01.0015.012820160943 01/28/2016
[ 1925.319477]  ffffffffa0987dac 00000000665c43d0 ffff885e892f3d38 ffffffff81686ac3
[ 1925.319543]  ffff885e892f3db8 ffffffff8167feca ffffffff00000008 ffff885e892f3dc8
[ 1925.319606]  ffff885e892f3d68 00000000665c43d0 00000000665c43d0 0000000000000046
[ 1925.319668] Call Trace:
[ 1925.319696]  [&amp;lt;ffffffff81686ac3&amp;gt;] dump_stack+0x19/0x1b
[ 1925.319741]  [&amp;lt;ffffffff8167feca&amp;gt;] panic+0xe3/0x1f2
[ 1925.319789]  [&amp;lt;ffffffffa0969894&amp;gt;] lbug_with_loc+0x64/0xb0 [libcfs]
[ 1925.319844]  [&amp;lt;ffffffffa0a6e206&amp;gt;] kiblnd_reconnect_peer+0x216/0x220 [ko2iblnd]
[ 1925.319904]  [&amp;lt;ffffffffa0a77214&amp;gt;] kiblnd_connd+0x464/0x900 [ko2iblnd]
[ 1925.319957]  [&amp;lt;ffffffff810c54c0&amp;gt;] ? wake_up_state+0x20/0x20
[ 1925.320006]  [&amp;lt;ffffffffa0a76db0&amp;gt;] ? kiblnd_check_conns+0x840/0x840 [ko2iblnd]
[ 1925.320062]  [&amp;lt;ffffffff810b0a4f&amp;gt;] kthread+0xcf/0xe0
[ 1925.320103]  [&amp;lt;ffffffff810b0980&amp;gt;] ? kthread_create_on_node+0x140/0x140
[ 1925.320154]  [&amp;lt;ffffffff816970d8&amp;gt;] ret_from_fork+0x58/0x90
[ 1925.320198]  [&amp;lt;ffffffff810b0980&amp;gt;] ? kthread_create_on_node+0x140/0x140
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;  

&lt;p&gt;Here is a different node ssu1_oss1: &lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt; 
[  726.973784] Lustre: 12887:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 5 previous similar messages
[  815.624944] Lustre: nlsdraid-OST0000: recovery is timed out, evict stale exports
[  815.624967] Lustre: nlsdraid-OST0000: disconnecting 2 stale clients
[  815.856729] Lustre: nlsdraid-OST0000: Recovery over after 5:01, of 9 clients 7 recovered and 2 were evicted.
[  876.965091] Lustre: 12887:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1497360193/real 1497360193]  req@ffff885e1fea5700 x1570095499313792/t0(0) o38-&amp;gt;nlsdraid-MDT0000-lwp-OST0000@192.168.1.101@o2ib:12/10 lens 520/544 e 0 to 1 dl 1497360248 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1
[  876.965100] Lustre: 12887:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 11 previous similar messages
[ 1151.949457] Lustre: 12887:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1497360468/real 1497360468]  req@ffff885e1fea6300 x1570095499314144/t0(0) o38-&amp;gt;nlsdraid-MDT0000-lwp-OST0000@192.168.1.101@o2ib:12/10 lens 520/544 e 0 to 1 dl 1497360523 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1
[ 1151.949467] Lustre: 12887:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 21 previous similar messages
[ 1676.919121] Lustre: 12887:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1497360993/real 1497360993]  req@ffff885e1fea1b00 x1570095499314816/t0(0) o38-&amp;gt;nlsdraid-MDT0000-lwp-OST0000@192.168.1.101@o2ib:12/10 lens 520/544 e 0 to 1 dl 1497361048 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1
[ 1676.919131] Lustre: 12887:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 41 previous similar messages
[ 1942.553084] LNet: 12875:0:(o2iblnd_cb.c:3207:kiblnd_check_conns()) Timed out tx for 192.168.1.101@o2ib: 9 seconds
[ 1955.552348] LNet: 12875:0:(o2iblnd_cb.c:3207:kiblnd_check_conns()) Timed out tx for 192.168.1.101@o2ib: 22 seconds
[ 1955.552355] LNet: 12875:0:(o2iblnd_cb.c:3207:kiblnd_check_conns()) Skipped 1 previous similar message
[ 1977.551131] LNetError: 12875:0:(o2iblnd_cb.c:1355:kiblnd_reconnect_peer()) ASSERTION( !peer_ni-&amp;gt;ibp_accepting &amp;amp;&amp;amp; !peer_ni-&amp;gt;ibp_connecting &amp;amp;&amp;amp; list_empty(&amp;amp;peer_ni-&amp;gt;ibp_conns) ) failed: 
[ 1977.551141] LNetError: 12875:0:(o2iblnd_cb.c:1355:kiblnd_reconnect_peer()) LBUG
[ 1977.551144] Pid: 12875, comm: kiblnd_connd
[ 1977.551146] 
Call Trace:
[ 1977.551175]  [&amp;lt;ffffffffa09dc7ee&amp;gt;] libcfs_call_trace+0x4e/0x60 [libcfs]
[ 1977.551185]  [&amp;lt;ffffffffa09dc87c&amp;gt;] lbug_with_loc+0x4c/0xb0 [libcfs]
[ 1977.551201]  [&amp;lt;ffffffffa0b96206&amp;gt;] kiblnd_reconnect_peer+0x216/0x220 [ko2iblnd]
[ 1977.551211]  [&amp;lt;ffffffffa0b9f214&amp;gt;] kiblnd_connd+0x464/0x900 [ko2iblnd]
[ 1977.551221]  [&amp;lt;ffffffff810c54c0&amp;gt;] ? default_wake_function+0x0/0x20
[ 1977.551230]  [&amp;lt;ffffffffa0b9edb0&amp;gt;] ? kiblnd_connd+0x0/0x900 [ko2iblnd]
[ 1977.551236]  [&amp;lt;ffffffff810b0a4f&amp;gt;] kthread+0xcf/0xe0
[ 1977.551241]  [&amp;lt;ffffffff810b0980&amp;gt;] ? kthread+0x0/0xe0
[ 1977.551246]  [&amp;lt;ffffffff816970d8&amp;gt;] ret_from_fork+0x58/0x90
[ 1977.551251]  [&amp;lt;ffffffff810b0980&amp;gt;] ? kthread+0x0/0xe0
[ 1977.551253] 
[ 1977.551255] Kernel panic - not syncing: LBUG
[ 1977.551300] CPU: 17 PID: 12875 Comm: kiblnd_connd Tainted: P           OE  ------------   3.10.0-514.16.1.el7.x86_64 #1
[ 1977.551378] Hardware name: Intel Corporation S2600WTTR/S2600WTTR, BIOS SE5C610.86B.01.01.0015.012820160943 01/28/2016
[ 1977.551454]  ffffffffa09fadac 000000009c68e137 ffff882fb8a23d38 ffffffff81686ac3
[ 1977.551519]  ffff882fb8a23db8 ffffffff8167feca ffffffff00000008 ffff882fb8a23dc8
[ 1977.551582]  ffff882fb8a23d68 000000009c68e137 000000009c68e137 0000000000000046
[ 1977.551645] Call Trace:
[ 1977.551674]  [&amp;lt;ffffffff81686ac3&amp;gt;] dump_stack+0x19/0x1b
[ 1977.551719]  [&amp;lt;ffffffff8167feca&amp;gt;] panic+0xe3/0x1f2
[ 1977.551767]  [&amp;lt;ffffffffa09dc894&amp;gt;] lbug_with_loc+0x64/0xb0 [libcfs]
[ 1977.551823]  [&amp;lt;ffffffffa0b96206&amp;gt;] kiblnd_reconnect_peer+0x216/0x220 [ko2iblnd]
[ 1977.551883]  [&amp;lt;ffffffffa0b9f214&amp;gt;] kiblnd_connd+0x464/0x900 [ko2iblnd]
[ 1977.551936]  [&amp;lt;ffffffff810c54c0&amp;gt;] ? wake_up_state+0x20/0x20
[ 1977.551985]  [&amp;lt;ffffffffa0b9edb0&amp;gt;] ? kiblnd_check_conns+0x840/0x840 [ko2iblnd]
[ 1977.552041]  [&amp;lt;ffffffff810b0a4f&amp;gt;] kthread+0xcf/0xe0
[ 1977.552083]  [&amp;lt;ffffffff810b0980&amp;gt;] ? kthread_create_on_node+0x140/0x140
[ 1977.552137]  [&amp;lt;ffffffff816970d8&amp;gt;] ret_from_fork+0x58/0x90
[ 1977.552182]  [&amp;lt;ffffffff810b0980&amp;gt;] ? kthread_create_on_node+0x140/0x140
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;  

&lt;p&gt;On onyx: /scratch/dumps/lustre/jsalinas/kiblnd_reconnect_peer_assert.tgz &lt;/p&gt;</description>
                <environment>Centos 7.3&lt;br/&gt;
Lustre 2.9.58 &lt;br/&gt;
ZFS from coral-rc1-combined (based on 0.7.0 RC4) </environment>
        <key id="46671">LU-9659</key>
            <summary>lnet assert after timeout on reconnect ASSERTION( !peer_ni-&gt;ibp_accepting &amp;&amp; !peer_ni-&gt;ibp_connecting &amp;&amp; list_empty(&amp;peer_ni-&gt;ibp_conns)</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="2" iconUrl="https://jira.whamcloud.com/images/icons/priorities/critical.svg">Critical</priority>
                        <status id="1" iconUrl="https://jira.whamcloud.com/images/icons/statuses/open.png" description="The issue is open and ready for the assignee to start work on it.">Open</status>
                    <statusCategory id="2" key="new" colorName="default"/>
                                    <resolution id="-1">Unresolved</resolution>
                                        <assignee username="wc-triage">WC Triage</assignee>
                                    <reporter username="jsalians_intel">John Salinas</reporter>
                        <labels>
                            <label>LS_RZ</label>
                            <label>prod</label>
                    </labels>
                <created>Tue, 13 Jun 2017 15:08:40 +0000</created>
                <updated>Tue, 13 Jun 2017 21:18:25 +0000</updated>
                                            <version>Lustre 2.9.0</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>5</watches>
                                                                            <comments>
                            <comment id="199051" author="jsalians_intel" created="Tue, 13 Jun 2017 15:28:50 +0000"  >&lt;p&gt;Currently this blocks us from bringing up our cluster&lt;/p&gt;</comment>
                            <comment id="199052" author="ashehata" created="Tue, 13 Jun 2017 15:30:52 +0000"  >&lt;p&gt;That assertion has been removed in the latest code. Would you be able to try the latest master?&lt;/p&gt;

&lt;p&gt;The exact patch is:&lt;br/&gt;
&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-9507&quot; title=&quot;o2iblnd assert on reconnect&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-9507&quot;&gt;&lt;del&gt;LU-9507&lt;/del&gt;&lt;/a&gt; lnd: Don&apos;t Assert On Reconnect with MultiQP&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;</comment>
                            <comment id="199053" author="jsalians_intel" created="Tue, 13 Jun 2017 15:32:48 +0000"  >&lt;p&gt;Hi Amir, let see if I can trigger a build with that patch&lt;/p&gt;</comment>
                            <comment id="199056" author="jsalians_intel" created="Tue, 13 Jun 2017 15:40:21 +0000"  >&lt;p&gt;Any chance this could also be our mds issue?  I am not getting a dump the last messages I see in /var/log/messages are: Jun 13 09:46:56 mgs_mds_1 kernel: LNet: 7800:0:(o2iblnd_cb.c:3207:kiblnd_check_conns()) Timed out tx for 192.168.1.103@o2ib: 4294948 seconds&lt;br/&gt;
Jun 13 09:46:56 mgs_mds_1 kernel: LNet: 7800:0:(o2iblnd_cb.c:3207:kiblnd_check_conns()) Skipped 3 previous similar messages&lt;br/&gt;
Jun 13 09:46:56 mgs_mds_1 kernel: Lustre: 7812:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1497361611/rea&lt;br/&gt;
l 1497361616]  req@ffff882f44c59b00 x1570097395139168/t0(0) o8-&amp;gt;nlsdraid-OST0003-osc-MDT0000@192.168.1.106@o2ib:28/4 lens 520/544 e 0 to 1 dl 1497361626 ref 1 fl R&lt;br/&gt;
pc:eXN/0/ffffffff rc 0/-1&lt;br/&gt;
Jun 13 09:46:56 mgs_mds_1 kernel: Lustre: 7812:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 3 previous similar messages&lt;br/&gt;
Jun 13 09:47:21 mgs_mds_1 kernel: LNet: 7800:0:(o2iblnd_cb.c:3207:kiblnd_check_conns()) Timed out tx for 192.168.1.103@o2ib: 4294973 seconds&lt;br/&gt;
Jun 13 09:47:21 mgs_mds_1 kernel: LNet: 7800:0:(o2iblnd_cb.c:3207:kiblnd_check_conns()) Skipped 3 previous similar messages&lt;br/&gt;
Jun 13 09:47:21 mgs_mds_1 kernel: Lustre: 7812:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1497361636/rea&lt;br/&gt;
l 1497361641]  req@ffff882f44c5ad00 x1570097395139264/t0(0) o8-&amp;gt;nlsdraid-OST0003-osc-MDT0000@192.168.1.106@o2ib:28/4 lens 520/544 e 0 to 1 dl 1497361656 ref 1 fl R&lt;br/&gt;
pc:eXN/0/ffffffff rc 0/-1&lt;br/&gt;
Jun 13 09:47:21 mgs_mds_1 kernel: Lustre: 7812:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 3 previous similar messages&lt;br/&gt;
Jun 13 09:47:41 mgs_mds_1 kernel: Lustre: 7812:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1497361661/rea&lt;br/&gt;
l 1497361661]  req@ffff882f44c5b600 x1570097395139312/t0(0) o8-&amp;gt;nlsdraid-OST0000-osc-MDT0000@192.168.1.103@o2ib:28/4 lens 520/544 e 0 to 1 dl 1497361686 ref 1 fl R&lt;br/&gt;
pc:eXN/0/ffffffff rc 0/-1&lt;br/&gt;
Jun 13 09:47:41 mgs_mds_1 kernel: Lustre: 7812:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 3 previous similar messages &lt;/p&gt;</comment>
                            <comment id="199058" author="doug" created="Tue, 13 Jun 2017 15:43:35 +0000"  >&lt;p&gt;John,&lt;/p&gt;

&lt;p&gt;Is this system using OmniPath?&lt;/p&gt;</comment>
                            <comment id="199059" author="ashehata" created="Tue, 13 Jun 2017 15:44:39 +0000"  >&lt;p&gt;can you enable net and neterror to see more info from LNet/LND.&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;lctl set_param debug=+net
lctl set_param debug=+neterror
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Do your nodes have any OPA interfaces? Did you make sure that all the LND tunables are the same across your network?&lt;/p&gt;</comment>
                            <comment id="199060" author="jsalians_intel" created="Tue, 13 Jun 2017 15:44:50 +0000"  >&lt;p&gt;Yes, this is OPA1 &lt;/p&gt;</comment>
                            <comment id="199063" author="jsalians_intel" created="Tue, 13 Jun 2017 15:46:34 +0000"  >&lt;p&gt;I will try to set this and see &lt;/p&gt;

&lt;p&gt;Everything was working, I had some Lustre errors on a few nodes so I rebooted everything and couldn&apos;t come up after that.  Maybe some settings are inconsistent?  I will look and see.  I can also try the debug settings. &lt;/p&gt;</comment>
                            <comment id="199067" author="doug" created="Tue, 13 Jun 2017 15:58:33 +0000"  >&lt;p&gt;I think there are two issues here:&lt;/p&gt;
&lt;ol&gt;
	&lt;li&gt;Asserting on reconnects. &#160;This is addressed by &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-9507&quot; title=&quot;o2iblnd assert on reconnect&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-9507&quot;&gt;&lt;del&gt;LU-9507&lt;/del&gt;&lt;/a&gt;.&lt;/li&gt;
	&lt;li&gt;Timeouts on transmissions is failing connections causing them to need to reconnect. &#160;This one is a mystery and could indicate a hardware issue or switch which needs to be restarted, etc.&lt;/li&gt;
&lt;/ol&gt;


&lt;p&gt;Doug&lt;/p&gt;</comment>
                            <comment id="199123" author="jsalians_intel" created="Tue, 13 Jun 2017 19:13:28 +0000"  >&lt;p&gt;On mds node we rebooted switch, node, set the debug and we see: &lt;br/&gt;
[  428.470987] Lustre: Lustre: Build Version: 2.9.58_dirty&lt;br/&gt;
[  428.527447] LNet: Using FMR for registration&lt;br/&gt;
[  428.576708] LNet: Added LNI 192.168.1.101@o2ib &lt;span class=&quot;error&quot;&gt;&amp;#91;128/2048/0/180&amp;#93;&lt;/span&gt;&lt;br/&gt;
[  428.795760] Lustre: MGS: Connection restored to 1ae30eb4-5113-4ca8-1a68-afd0617f82be (at 0@lo)&lt;br/&gt;
[  428.902858] device-mapper: multipath: Reinstating path 66:192.&lt;br/&gt;
[  429.283680] Lustre: nlsdraid-MDT0000: Imperative Recovery not enabled, recovery window 300-900&lt;br/&gt;
[  429.385910] LustreError: 7665:0:(osd_oi.c:503:osd_oid()) nlsdraid-MDT0000-osd: unsupported quota oid: 0x16&lt;br/&gt;
[  433.527252] LNet: 7723:0:(o2iblnd_cb.c:3207:kiblnd_check_conns()) Timed out tx for 192.168.1.103@o2ib: 4295100 seconds&lt;br/&gt;
[  433.527328] Lustre: 7735:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: &lt;span class=&quot;error&quot;&gt;&amp;#91;sent 1497380279/real 1497380283&amp;#93;&lt;/span&gt;  req@ffff882f857e0600 x1570117022384320/t0(0) o8-&amp;gt;nlsdraid-OST0002-osc-MDT0000@192.168.1.105@o2ib:28/4 lens 520/544 e 0 to 1 dl 1497380284 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1 &lt;/p&gt;

&lt;p&gt;These timeouts appear to be to Lustre OSS that are not up: &lt;br/&gt;
192.168.1.103	ssu1_oss1 lustre1&lt;br/&gt;
192.168.1.104	ssu1_oss2 lustre2&lt;br/&gt;
192.168.1.105	ssu2_oss1 lustre3&lt;br/&gt;
192.168.1.106	ssu2_oss2 lustre4&lt;br/&gt;
192.168.1.101   mgs_mds_1 lustre5 &lt;/p&gt;

&lt;p&gt;Testing with lnet self test I can produce failures that looks like this on head node: &lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;3432045.164205&amp;#93;&lt;/span&gt; hfi1 0000:81:00.0: hfi1_0: 8051: Link down&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;3432045.164236&amp;#93;&lt;/span&gt; hfi1 0000:81:00.0: hfi1_0: set_link_state: current ACTIVE, new OFFLINE &lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;3432046.288501&amp;#93;&lt;/span&gt; hfi1 0000:81:00.0: hfi1_0: logical state changed to PORT_DOWN (0x1)&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;3432046.288574&amp;#93;&lt;/span&gt; hfi1 0000:81:00.0: hfi1_0: hfi1_ibphys_portstate: physical state changed to PHYS_OFFLINE (0x9), phy 0x90&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;3432046.288587&amp;#93;&lt;/span&gt; hfi1 0000:81:00.0: hfi1_0: set_link_state: current OFFLINE, new POLL &lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;3432046.288688&amp;#93;&lt;/span&gt; hfi1 0000:81:00.0: hfi1_0: Entering SPC freeze &lt;/p&gt;

&lt;p&gt;I have opened a ticket with admin of the system to look at that&lt;/p&gt;</comment>
                            <comment id="199135" author="doug" created="Tue, 13 Jun 2017 21:18:25 +0000"  >&lt;p&gt;Yeah, seeing HFI take a port down is not a good sign.  Could be a marginal cable.&lt;/p&gt;</comment>
                    </comments>
                    <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzzez3:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>