[LU-10707] TCP eth routed LNet traffic broken - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: Lustre 2.10.4
Affects Version/s: Lustre 2.10.1, Lustre 2.10.2, Lustre 2.10.3
Labels:
- lnet
- patch
Environment:
CentOS 7.4, OPA, QDR, kernel OFED, lustre-client 2.10.3.

Severity:
2
Rank (Obsolete):
9223372036854775807

Description

Hi Folks,

We've been experiencing a problem with our LNet routers in lustre 2.10.x and hoping we could get some guidance on a resolution.

In short: Connections from clients which reside in a TCP ethernet environment are timing and expiring after the (default) "peer_timeout 180" limit is up. The same client/router configuration with lustre-client 2.9.0 on our routers does not have the same behaviour. As far as can be determined, the issue is only present on the ethernet side and only when the router uses lustre version 2.10.x (tried 2.10.1 / 2.10.2 / 2.10.3)

Our routers have a single port OPA, dual port connectx-3, dual port connectx-4 100GbE, dual port 10GbE. I tested with various combinations of those cards installed, the most basic failed configuration being a 10Gige and CX-3 to our Qlogic fabric.

On the ethernet side, we've tried multiple ethernet fabrics (Cisco Nexus, Mellanox w/Cumulus), multiple adapters configurations - native vlan vs tagged vlan, bonded vs non-bonded. Issues with all of them.

Multiple router/client lustre.conf configs were tried, including various settings (and empty) ko2iblnd.conf configs on the router too.

What's observed from the eth client:
If I only ping the @tcp router address, it will respond up until the 180 second timeout. Routes are marked as up during this period until the peer_timeout is reached, at which point the routes will be marked down.

However, if I ping a machine on IB network, I'll recieve an "Input/output error", eg:

 
"failed to ping 192.168.55.143@o2ib10: Input/output error"

Routes will then be marked down 50 seconds after the first "Input/output error" to an IB network.

On the lnet router, I'm not seeing any errors logged when pinging an IB network from the client and Iv'e received an error. I do see ping error in the logs when pinging an @tcp address, but only after the routes are marked down. eg:

[VM root@data-mover-dev ~]# lctl ping 10.8.49.16@tcp101
12345-0@lo
12345-192.168.44.16@o2ib44
12345-192.168.55.232@o2ib10
12345-192.168.55.232@o2ib
12345-10.8.49.16@tcp101
[VM root@data-mover-dev ~]#

wait the 180 secs..

[VM root@data-mover-dev ~]# lctl ping 10.8.49.16@tcp101
failed to ping 10.8.49.16@tcp101: Input/output error
[VM root@data-mover-dev ~]#

Feb 23 23:14:05 lnet02 kernel: LNetError: 33850:0:(lib-move.c:2120:lnet_parse_get()) 10.8.49.16@tcp101: Unable to send REPLY for GET from 12345-10.8.49.155@tcp101: -113

I found it a little tricky to debug the LNET traffic flow, and welcome recommendations? At a TCP level I've captured the flow and can show the differences between a non-working 2.9.0 client / 2.10 router and a working 2.9.0 client/router. Would that be of any use?... It only really shows non-working lctl ping reply.

Ethernet client's lustre.conf:

options lnet networks=tcp101(eth3.3015) routes="o2ib0 1 10.8.44.16@tcp101;o2ib10 1 10.8.44.16@tcp101;o2ib44 1 10.8.44.16@tcp101"

Lnet router's lustre.conf:

options lnet networks="o2ib44(ib0), o2ib10(ib1), o2ib0(ib1), tcp101(bond0.3015)" forwarding=enabled

After searching around there's this thread which is pretty similar:
https://www.mail-archive.com/lustre-discuss@lists.lustre.org/msg14168.html
AFAIK we need 2.10.x for EL7.4. I'm not sure lustre-client 2.9.0 will build on EL7.4? (Can't build it via DKMS, and from source RPM it fails – looked like OFED changes in 7.4??).

Glad to provide more information on request.

Regards,
Simon

Attachments

Issue Links

is related to

LU-10807 ksocknal_reaper() jitter on b2_10

Resolved

is related to

LU-9397 Inconsistence use of cfs_time_current() and ktime_get_real_seconds()

Resolved

LU-6245 Untangle userland and kernel space support for libcfs

Resolved

LU-9019 Migrate lustre to standard 64 bit time kernel API

Resolved

Activity

People

Assignee:: James A Simmons

Reporter:: SC Admin

Votes:: 0 Vote for this issue

Watchers:: 10 Start watching this issue

Dates

Created:: 23/Feb/18 12:52 PM

Updated:: 08/Nov/19 2:49 AM

Resolved:: 03/May/18 7:27 PM