<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 02:37:29 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-10707] TCP eth routed LNet traffic broken</title>
                <link>https://jira.whamcloud.com/browse/LU-10707</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;Hi Folks,&lt;/p&gt;

&lt;p&gt;We&apos;ve been experiencing a problem with our LNet routers in lustre 2.10.x and hoping we could get some guidance on a resolution.&lt;/p&gt;

&lt;p&gt;In short: Connections from clients which reside in a TCP ethernet environment are timing and expiring after the (default) &quot;peer_timeout 180&quot; limit is up. The same client/router configuration with lustre-client 2.9.0 on our routers does not have the same behaviour. As far as can be determined, the issue is only present on the ethernet side and only when the router uses lustre version 2.10.x (tried 2.10.1 / 2.10.2 / 2.10.3)&lt;/p&gt;

&lt;p&gt;Our routers have a single port OPA, dual port connectx-3, dual port connectx-4 100GbE, dual port 10GbE. I tested with various combinations of those cards installed, the most basic failed configuration being a 10Gige and CX-3 to our Qlogic fabric.&lt;/p&gt;

&lt;p&gt;On the ethernet side, we&apos;ve tried multiple ethernet fabrics (Cisco Nexus, Mellanox w/Cumulus), multiple adapters configurations - native vlan vs tagged vlan, bonded vs non-bonded. Issues with all of them.&lt;/p&gt;

&lt;p&gt;Multiple router/client lustre.conf configs were tried, including various settings (and empty) ko2iblnd.conf configs on the router too.&lt;/p&gt;

&lt;p&gt;What&apos;s observed from the eth client: &lt;br/&gt;
 If I only ping the @tcp router address, it will respond up until the 180 second timeout. Routes are marked as up during this period until the peer_timeout is reached, at which point the routes will be marked down.&lt;/p&gt;

&lt;p&gt;However, if I ping a machine on IB network, I&apos;ll recieve an &quot;Input/output error&quot;, eg:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt; 
&quot;failed to ping 192.168.55.143@o2ib10: Input/output error&quot;


&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Routes will then be marked down 50 seconds after the first &quot;Input/output error&quot; to an IB network.&lt;/p&gt;

&lt;p&gt;On the lnet router, I&apos;m not seeing any errors logged when pinging an IB network from the client and Iv&apos;e received an error. I do see ping error in the logs when pinging an @tcp address, but only after the routes are marked down. eg:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[VM root@data-mover-dev ~]# lctl ping 10.8.49.16@tcp101
12345-0@lo
12345-192.168.44.16@o2ib44
12345-192.168.55.232@o2ib10
12345-192.168.55.232@o2ib
12345-10.8.49.16@tcp101
[VM root@data-mover-dev ~]#


&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;wait the 180 secs..&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[VM root@data-mover-dev ~]# lctl ping 10.8.49.16@tcp101
failed to ping 10.8.49.16@tcp101: Input/output error
[VM root@data-mover-dev ~]#


&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Feb 23 23:14:05 lnet02 kernel: LNetError: 33850:0:(lib-move.c:2120:lnet_parse_get()) 10.8.49.16@tcp101: Unable to send REPLY for GET from 12345-10.8.49.155@tcp101: -113


&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;I found it a little tricky to debug the LNET traffic flow, and welcome recommendations? At a TCP level I&apos;ve captured the flow and can show the differences between a non-working 2.9.0 client / 2.10 router and a working 2.9.0 client/router. Would that be of any use?... It only really shows non-working lctl ping reply.&lt;/p&gt;

&lt;p&gt;Ethernet client&apos;s lustre.conf:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;options lnet networks=tcp101(eth3.3015) routes=&quot;o2ib0 1 10.8.44.16@tcp101;o2ib10 1 10.8.44.16@tcp101;o2ib44 1 10.8.44.16@tcp101&quot;


&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Lnet router&apos;s lustre.conf:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;options lnet networks=&quot;o2ib44(ib0), o2ib10(ib1), o2ib0(ib1), tcp101(bond0.3015)&quot; forwarding=enabled

&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;After searching around there&apos;s this thread which is pretty similar: &lt;br/&gt;
 &lt;a href=&quot;https://www.mail-archive.com/lustre-discuss@lists.lustre.org/msg14168.html&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://www.mail-archive.com/lustre-discuss@lists.lustre.org/msg14168.html&lt;/a&gt; &lt;br/&gt;
 AFAIK we need 2.10.x for EL7.4. I&apos;m not sure lustre-client 2.9.0 will build on EL7.4? (Can&apos;t build it via DKMS, and from source RPM it fails &#8211; looked like OFED changes in 7.4??).&lt;/p&gt;

&lt;p&gt;Glad to provide more information on request. &lt;/p&gt;

&lt;p&gt;Regards,&lt;br/&gt;
 Simon&lt;/p&gt;</description>
                <environment>CentOS 7.4, OPA, QDR, kernel OFED, lustre-client 2.10.3. </environment>
        <key id="50923">LU-10707</key>
            <summary>TCP eth routed LNet traffic broken</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="3" iconUrl="https://jira.whamcloud.com/images/icons/priorities/major.svg">Major</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="simmonsja">James A Simmons</assignee>
                                    <reporter username="scadmin">SC Admin</reporter>
                        <labels>
                            <label>lnet</label>
                            <label>patch</label>
                    </labels>
                <created>Fri, 23 Feb 2018 12:52:22 +0000</created>
                <updated>Fri, 8 Nov 2019 02:49:36 +0000</updated>
                            <resolved>Thu, 3 May 2018 19:27:15 +0000</resolved>
                                    <version>Lustre 2.10.1</version>
                    <version>Lustre 2.10.2</version>
                    <version>Lustre 2.10.3</version>
                                    <fixVersion>Lustre 2.10.4</fixVersion>
                                        <due></due>
                            <votes>0</votes>
                                    <watches>10</watches>
                                                                            <comments>
                            <comment id="221558" author="scadmin" created="Fri, 23 Feb 2018 13:07:40 +0000"  >&lt;p&gt;realised I pasted the older config line from the ethernet client&apos;s lustre.conf. It&apos;s:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;options lnet networks=tcp101(eth3.3015) routes=&quot;o2ib0 1 10.8.49.16@tcp101;o2ib10 1 10.8.49.16@tcp101;o2ib44 1 10.8.49.16@tcp101&quot;

&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="221560" author="pjones" created="Fri, 23 Feb 2018 13:31:50 +0000"  >&lt;p&gt;Amir&lt;/p&gt;

&lt;p&gt;Can you please advise&lt;/p&gt;

&lt;p&gt;Thanks&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="221586" author="ashehata" created="Fri, 23 Feb 2018 17:32:11 +0000"  >&lt;p&gt;Let&apos;s start with the most basic configuration&lt;/p&gt;

&lt;p&gt;OPA NODE &amp;lt;----&lt;del&gt;&amp;gt; router &amp;lt;&lt;/del&gt;-----&amp;gt; TCP NODE&lt;/p&gt;

&lt;p&gt;Can you provide the following information on both the OPA and TCP nodes:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;lnetctl export &amp;gt; config.yaml
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Also can you enable net logging on OPA, router and TCP nodes using &lt;/p&gt;

&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;lctl set_param debug=+&lt;span class=&quot;code-quote&quot;&gt;&quot;net neterror&quot;&lt;/span&gt;
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;And then run the failed ping test. Afterwards collect the dump from all 3 nodes:&lt;/p&gt;

&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;lctl dk &amp;gt; &amp;lt;node&amp;gt;.log
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="221595" author="ashehata" created="Fri, 23 Feb 2018 18:27:40 +0000"  >&lt;p&gt;I&apos;m able to reproduce the problem. I&apos;ll update the ticket once I have a resolution.&lt;/p&gt;</comment>
                            <comment id="221616" author="ashehata" created="Sat, 24 Feb 2018 02:48:31 +0000"  >&lt;p&gt;The problem was introduced by the following two patches:&lt;br/&gt;
&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-6245&quot; title=&quot;Untangle userland and kernel space support for libcfs&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-6245&quot;&gt;&lt;del&gt;LU-6245&lt;/del&gt;&lt;/a&gt; libcfs: add ktime_get_real_seconds support&lt;br/&gt;
&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-9397&quot; title=&quot;Inconsistence use of cfs_time_current() and ktime_get_real_seconds()&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-9397&quot;&gt;&lt;del&gt;LU-9397&lt;/del&gt;&lt;/a&gt; ksocklnd: move remaining time handling to 64 bits&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-9397&quot; title=&quot;Inconsistence use of cfs_time_current() and ktime_get_real_seconds()&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-9397&quot;&gt;&lt;del&gt;LU-9397&lt;/del&gt;&lt;/a&gt; needs to be reverted and the socklnd changes that were made as part of &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-6245&quot; title=&quot;Untangle userland and kernel space support for libcfs&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-6245&quot;&gt;&lt;del&gt;LU-6245&lt;/del&gt;&lt;/a&gt; need to be reverted.&lt;/p&gt;</comment>
                            <comment id="221625" author="simmonsja" created="Sun, 25 Feb 2018 00:19:36 +0000"  >&lt;p&gt;Instead of reverting lets figure out what the problem is. Note if you revert we end up with the problem of jiffies being uses for node to node communication. If one node uses different value of HZ then we also can run into problems. This would be trading one corner case for another. I will discuss with you a clear to duplicate it so we can properly fix it.&lt;/p&gt;</comment>
                            <comment id="224475" author="pjones" created="Sat, 24 Mar 2018 14:29:12 +0000"  >&lt;p&gt;&lt;a href=&quot;https://jira.whamcloud.com/secure/ViewProfile.jspa?name=simmonsja&quot; class=&quot;user-hover&quot; rel=&quot;simmonsja&quot;&gt;simmonsja&lt;/a&gt; it seems like this might take a while to work through. How about we revert to a consistent state for 2.10.4 while the longer term work is ongoing?&lt;/p&gt;</comment>
                            <comment id="224485" author="simmonsja" created="Sat, 24 Mar 2018 15:34:22 +0000"  >&lt;p&gt;Its just a matter of me getting a test setup. I will talk to Amir. I think I know what fix is need.&lt;/p&gt;</comment>
                            <comment id="224702" author="gerrit" created="Wed, 28 Mar 2018 11:53:04 +0000"  >&lt;p&gt;James Simmons (uja.ornl@yahoo.com) uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/31810&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/31810&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-10707&quot; title=&quot;TCP eth routed LNet traffic broken&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-10707&quot;&gt;&lt;del&gt;LU-10707&lt;/del&gt;&lt;/a&gt; socklnd: replace cfs_duration_sec with cfs_time_seconds&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: eb040d9f64db9bae0029b6a8481c5efc24c0462d&lt;/p&gt;</comment>
                            <comment id="224848" author="simmonsja" created="Thu, 29 Mar 2018 23:48:10 +0000"  >&lt;p&gt;Hmm. The direction of the patch will be determined by porting back newer MOFED stack support since these patches have many dependencies on each other.&lt;/p&gt;</comment>
                            <comment id="225203" author="scadmin" created="Thu, 5 Apr 2018 14:02:45 +0000"  >&lt;p&gt;Hi,&lt;/p&gt;

&lt;p&gt;Had a test of 2.10.3 + patches #1, #2 but realised I&apos;d missed out on a minor network config change due to some physical node changes in the last 3 weeks. I&apos;d moved onto to testing the new lustre-client 2.11.0 on the lnet router by the time I realised that. I can report that 2.11.0 as the lnet routers lustre-client version (with the cfs_time_seconds) change does mean it remains up and pingable for the ethernet client.&lt;/p&gt;

&lt;p&gt;Routes currently look like this:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[VM root@data-mover-dev ~]# lctl show_route
net               o2ib hops 4294967295 gw                10.8.49.16@tcp101 up pri 0
net             o2ib10 hops 4294967295 gw                10.8.49.16@tcp101 up pri 0
net             o2ib44 hops 4294967295 gw                10.8.49.16@tcp101 down pri 0
[VM root@data-mover-dev ~]#&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;From the LNET router itself I&apos;m able to mount and use lustre filesystems on&lt;/p&gt;
&lt;ul&gt;
	&lt;li&gt;o2ib0&lt;/li&gt;
	&lt;li&gt;o2ib10&lt;/li&gt;
	&lt;li&gt;o2ib44&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;From the Ethernet client I can ping only o2ib0 &amp;amp; o2ib10. I can ping MDT&apos;s but not mount filesystems. Obviously I can&apos;t ping nor mount on o2ib44 too. Possibly a misconfiguration with lustre module for OPA? This is our first go at routing between eth/opa/ib and multiple FS&apos;s on each.&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[VM root@data-mover-dev ~]# mount -t lustre 192.168.55.143@o2ib10:/beer /beer
mount.lustre: mount 192.168.55.143@o2ib10:/beer at /beer failed: Input/output error
Is the MGS running?
[VM root@data-mover-dev ~]# mount -t lustre 192.168.55.129@o2ib:192.168.55.130@o2ib:/lustre /lustre
mount.lustre: mount 192.168.55.129@o2ib:192.168.55.130@o2ib:/lustre at /lustre failed: Input/output error
Is the MGS running?
[VM root@data-mover-dev ~]# lctl ping 192.168.55.143@o2ib10
failed to ping 192.168.55.143@o2ib10: Input/output error
[VM root@data-mover-dev ~]# lctl ping 192.168.55.143@o2ib10
12345-0@lo
12345-192.168.55.143@o2ib10
[VM root@data-mover-dev ~]#
[VM root@data-mover-dev ~]#
[VM root@data-mover-dev ~]#
[VM root@data-mover-dev ~]# lctl ping 192.168.55.129@o2ib; lctl ping 192.168.55.130@o2ib
failed to ping 192.168.55.129@o2ib: Input/output error
12345-0@lo
12345-192.168.55.130@o2ib
[VM root@data-mover-dev ~]# lctl ping 192.168.55.129@o2ib; lctl ping 192.168.55.130@o2ib
12345-0@lo
12345-192.168.55.129@o2ib
failed to ping 192.168.55.130@o2ib: Input/output error
[VM root@data-mover-dev ~]#
[VM root@data-mover-dev ~]#
[VM root@data-mover-dev ~]#&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Getting late.. I&apos;ll re-do with 2.10.3 + patch tomorrow.&#160; It&apos;s also a chance to review and make sure I didn&apos;t miss anything else. It&apos;s been over a month since I looked at this &lt;img class=&quot;emoticon&quot; src=&quot;https://jira.whamcloud.com/images/icons/emoticons/smile.png&quot; height=&quot;16&quot; width=&quot;16&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/p&gt;

&lt;p&gt;Cheers,&lt;/p&gt;

&lt;p&gt;Simon&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;</comment>
                            <comment id="226117" author="gerrit" created="Mon, 16 Apr 2018 23:26:08 +0000"  >&lt;p&gt;James Simmons (uja.ornl@yahoo.com) uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/32015&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/32015&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-10707&quot; title=&quot;TCP eth routed LNet traffic broken&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-10707&quot;&gt;&lt;del&gt;LU-10707&lt;/del&gt;&lt;/a&gt; ksocklnd: revert back to jiffies&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: b2_10&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: 70f5192518961cfb056bd4fa1960c6520a030289&lt;/p&gt;</comment>
                            <comment id="226166" author="scadmin" created="Tue, 17 Apr 2018 14:54:30 +0000"  >&lt;p&gt;Hi, &lt;/p&gt;

&lt;p&gt;Thanks for putting out another update on this. I had a look at the latest 70f5192.diff this evening. I tested a patched lnet router running 2.10.3 with both unpatched/patched 2.10.3 lustre-client. No go unfortunately. Similar scenario as previous patches. Pings will remain working past the 180 seconds mark - though intermittently. eg: &lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[VM root@data-mover-dev ~]# lctl ping 10.8.49.16@tcp101
12345-0@lo
12345-192.168.44.16@o2ib44
12345-192.168.55.232@o2ib10
12345-192.168.55.232@o2ib
12345-10.8.49.16@tcp101
[VM root@data-mover-dev ~]# lctl ping 10.8.49.16@tcp101
12345-0@lo
12345-192.168.44.16@o2ib44
12345-192.168.55.232@o2ib10
12345-192.168.55.232@o2ib
12345-10.8.49.16@tcp101
[VM root@data-mover-dev ~]# lctl ping 10.8.49.16@tcp101
12345-0@lo
12345-192.168.44.16@o2ib44
12345-192.168.55.232@o2ib10
12345-192.168.55.232@o2ib
12345-10.8.49.16@tcp101
[VM root@data-mover-dev ~]# lctl ping 10.8.49.16@tcp101
failed to ping 10.8.49.16@tcp101: Input/output error
[VM root@data-mover-dev ~]# lctl ping 10.8.49.16@tcp101
failed to ping 10.8.49.16@tcp101: Input/output error
[VM root@data-mover-dev ~]# lctl ping 10.8.49.16@tcp101
12345-0@lo
12345-192.168.44.16@o2ib44
12345-192.168.55.232@o2ib10
12345-192.168.55.232@o2ib
12345-10.8.49.16@tcp101
[VM root@data-mover-dev ~]#
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Routes seem to all be &apos;up&apos; for 120 seconds, but I&apos;m not able to actually route and traffic. eg: &lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;LNET configured
Wed Apr 18 00:09:17 AEST 2018
net               o2ib hops 4294967295 gw                10.8.49.16@tcp101 up pri 0
net             o2ib10 hops 4294967295 gw                10.8.49.16@tcp101 up pri 0
net             o2ib44 hops 4294967295 gw                10.8.49.16@tcp101 up pri 0

Wed Apr 18 00:11:19 AEST 2018
net               o2ib hops 4294967295 gw                10.8.49.16@tcp101 up pri 0
net             o2ib10 hops 4294967295 gw                10.8.49.16@tcp101 up pri 0
net             o2ib44 hops 4294967295 gw                10.8.49.16@tcp101 up pri 0

Wed Apr 18 00:11:20 AEST 2018
net               o2ib hops 4294967295 gw                10.8.49.16@tcp101 down pri 0
net             o2ib10 hops 4294967295 gw                10.8.49.16@tcp101 down pri 0
net             o2ib44 hops 4294967295 gw                10.8.49.16@tcp101 down pri 0
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;I can still ping the lnet router from the client after the routes are marked down. eg: &lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[VM root@data-mover-dev ~]# lctl ping 10.8.49.16@tcp101
12345-0@lo
12345-192.168.44.16@o2ib44
12345-192.168.55.232@o2ib10
12345-192.168.55.232@o2ib
12345-10.8.49.16@tcp101
[VM root@data-mover-dev ~]#
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;


&lt;p&gt;But from the client, whilst the routes are marked &apos;up&apos; I&apos;m not able to ping a routed network. eg: &lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[VM root@data-mover-dev ~]# lctl ping 192.168.55.143@o2ib10
failed to ping 192.168.55.143@o2ib10: Input/output error
[VM root@data-mover-dev ~]# lctl ping 192.168.55.143@o2ib10
failed to ping 192.168.55.143@o2ib10: Input/output error
[VM root@data-mover-dev ~]# lctl ping 192.168.55.143@o2ib10
failed to ping 192.168.55.143@o2ib10: Input/output error
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;


&lt;p&gt;This works on another client which is routed via a 2.9.x lnet router. eg: &lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[VM root@data-mover01 ~]# lctl ping 192.168.55.143@o2ib10
12345-0@lo
12345-192.168.55.143@o2ib10
[VM root@data-mover01 ~]#
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Cheers, &lt;br/&gt;
Simon&lt;/p&gt;










</comment>
                            <comment id="226207" author="scadmin" created="Wed, 18 Apr 2018 02:01:36 +0000"  >&lt;p&gt;An update: looked more into it this morning. Routes can stay up now. &lt;/p&gt;

&lt;p&gt;&#160;On some destination hosts we saw:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Apr 18 08:39:54 metadata01 kernel: LNetError: 1719:0:(o2iblnd_cb.c:2643:kiblnd_rejected()) 192.168.55.232@o2ib rejected: incompatible # of RDMA fragments 32, 256&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;On the LNET router I changed &quot;map_on_demand=32&quot; to &quot;0&quot;, reloaded and got: &lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Apr 18 08:46:50 metadata01 kernel: LNetError: 1719:0:(o2iblnd_cb.c:2311:kiblnd_passive_connect()) Can&apos;t accept 192.168.55.232@o2ib: incompatible queue depth 128 (8 wanted)
Apr 18 08:46:50 metadata01 kernel: LNetError: 1719:0:(o2iblnd_cb.c:2311:kiblnd_passive_connect()) Skipped 3 previous similar messages
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Again on the LNET router, I changed &quot;peer_credits=128&quot; to &quot;8&quot; and haven&apos;t seen further LNetErrors on the test hosts, nor routes marked down again. &lt;/p&gt;

&lt;p&gt;Ping&apos;s are still erratic - in that repeating lctl pings to a host will result in success, then input/output errors, success, etc. Is this to be expected?&lt;/p&gt;

&lt;p&gt;Still not able to mount our old filesystems (the ones on the truescale qlogic gear), but now we seem to have challenge both OPA and truescale in the LNET router and finding the right ko2iblnd.conf settings? Will aim to get the OPA lustre storage servers configured to test the LNET router soon. &lt;/p&gt;

&lt;p&gt;cheers&lt;br/&gt;
simon&lt;/p&gt;</comment>
                            <comment id="226238" author="scadmin" created="Wed, 18 Apr 2018 13:03:24 +0000"  >&lt;p&gt;Good news. Got our our QDR and OPA lustre FS&apos;s up and going via the patched lnet router earlier tonight and they&apos;ve remained that way since! &lt;img class=&quot;emoticon&quot; src=&quot;https://jira.whamcloud.com/images/icons/emoticons/smile.png&quot; height=&quot;16&quot; width=&quot;16&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/p&gt;

&lt;p&gt;Cheers,&lt;/p&gt;

&lt;p&gt;simon&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;</comment>
                            <comment id="226403" author="gerrit" created="Thu, 19 Apr 2018 17:43:40 +0000"  >&lt;p&gt;Amir Shehata (amir.shehata@intel.com) uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/32082&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/32082&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-10707&quot; title=&quot;TCP eth routed LNet traffic broken&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-10707&quot;&gt;&lt;del&gt;LU-10707&lt;/del&gt;&lt;/a&gt; lnet: revert to cfs_time functions&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: b2_10&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: 44cdaa38e5ed2e53572b91ba08ba91680a616532&lt;/p&gt;</comment>
                            <comment id="226467" author="sbuisson" created="Fri, 20 Apr 2018 13:18:42 +0000"  >&lt;p&gt;With patch &lt;a href=&quot;https://review.whamcloud.com/32082,&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/32082,&lt;/a&gt;&#160;I am not able to reproduce the ping timeout issue anymore.&lt;/p&gt;</comment>
                            <comment id="226519" author="scadmin" created="Sat, 21 Apr 2018 11:13:44 +0000"  >&lt;p&gt;Added in the updated patch &lt;a href=&quot;https://review.whamcloud.com/32082&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/32082 &lt;/a&gt; and it&apos;s resolved the reconnects on lustre nodes, plus lnet_selftest passes now.&lt;/p&gt;

&lt;p&gt;cheers,&lt;br/&gt;
Simon&lt;/p&gt;</comment>
                            <comment id="226521" author="pjones" created="Sat, 21 Apr 2018 13:02:53 +0000"  >&lt;p&gt;That&apos;s good news Simon. We&apos;ll look to queue up this fix for the upcoming 2.10.4 release&lt;/p&gt;</comment>
                            <comment id="226522" author="simmonsja" created="Sat, 21 Apr 2018 13:38:10 +0000"  >&lt;p&gt;Sadly some of the work reverted was a back port from the linux lustre client. This means that the upstream client is broken with router &lt;img class=&quot;emoticon&quot; src=&quot;https://jira.whamcloud.com/images/icons/emoticons/sad.png&quot; height=&quot;16&quot; width=&quot;16&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&#160;&lt;/p&gt;</comment>
                            <comment id="226525" author="simmonsja" created="Sun, 22 Apr 2018 17:56:07 +0000"  >&lt;p&gt;I see two patches are needed. One patch from me &lt;a href=&quot;https://review.whamcloud.com/#/c/32015&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/#/c/32015&lt;/a&gt;&#160;and another patch &lt;a href=&quot;https://review.whamcloud.com/#/c/32082/2&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/#/c/32082&lt;/a&gt;&#160;from Amir. Sebastien can you change your review on my patch so both can land.&lt;/p&gt;</comment>
                            <comment id="227235" author="gerrit" created="Thu, 3 May 2018 19:16:59 +0000"  >&lt;p&gt;John L. Hammond (john.hammond@intel.com) merged in patch &lt;a href=&quot;https://review.whamcloud.com/32015/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/32015/&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-10707&quot; title=&quot;TCP eth routed LNet traffic broken&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-10707&quot;&gt;&lt;del&gt;LU-10707&lt;/del&gt;&lt;/a&gt; ksocklnd: revert back to jiffies&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: b2_10&lt;br/&gt;
Current Patch Set: &lt;br/&gt;
Commit: 62947eaec70d74d753faadee3f22f928b59fec52&lt;/p&gt;</comment>
                            <comment id="227236" author="gerrit" created="Thu, 3 May 2018 19:17:13 +0000"  >&lt;p&gt;John L. Hammond (john.hammond@intel.com) merged in patch &lt;a href=&quot;https://review.whamcloud.com/32082/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/32082/&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-10707&quot; title=&quot;TCP eth routed LNet traffic broken&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-10707&quot;&gt;&lt;del&gt;LU-10707&lt;/del&gt;&lt;/a&gt; lnet: revert to cfs_time functions&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: b2_10&lt;br/&gt;
Current Patch Set: &lt;br/&gt;
Commit: 0049c057d0ad5e1c56dc972004ca414dbfe6a6b8&lt;/p&gt;</comment>
                            <comment id="227237" author="simmonsja" created="Thu, 3 May 2018 19:27:15 +0000"  >&lt;p&gt;Should be fixed now. If you still see problem feel free to open&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                        <issuelink>
            <issuekey id="45718">LU-9397</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="28700">LU-6245</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="42994">LU-9019</issuekey>
        </issuelink>
                            </outwardlinks>
                                                                <inwardlinks description="is related to">
                                        <issuelink>
            <issuekey id="51318">LU-10807</issuekey>
        </issuelink>
                            </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzzt87:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10021"><![CDATA[2]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>