[LU-13566] socklnd: wrong NID to interface mapping Created: 15/May/20  Updated: 29/Jul/20  Resolved: 28/Jun/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.14.0

Type: Bug Priority: Minor
Reporter: Amir Shehata (Inactive) Assignee: Amir Shehata (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-13790 socklnd: NID to interface mapping issues Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

In a Multi-Rail setup using ethernet interfaces, it appears like there is a wrong mapping between the LNet level NID and the ethernet interfaces.
When sending traffic, LNet reports all NIDs are being used for traffic, but when we use netstat -i to monitor LNet traffic, we only see traffic on a subset of the interfaces.

When we restrict traffic from LNet on a subset of the NIDs, even for that subset, the interfaces don't match. For example netstat -i can show traffic on eth0 and eth2. But LNet shows that it's using eth1 and eth2.

However, when using iperf, all ethernet interfaces are used according to netstat -i

This behavior is easily reproducible on a simple 2 VM MR setup.



 Comments   
Comment by Andreas Dilger [ 21/May/20 ]

Is this happening with multiple Ethernet interfaces on the same subnet? I recall ages ago that there was a problem with "source routing" for ethernet, in that the kernel would select whatever interface it wanted on that subnet, even if LNet is trying to use a specific interface for outgoing packets.

This might be helped by patch https://review.whamcloud.com/37702 "LU-10391 socklnd: use interface index to track local addr" to ensure that the specific interface is used rather than trying to use the address to guide interface selection.

Comment by Amir Shehata (Inactive) [ 22/May/20 ]

I actually found a problem with that patch. It breaks binding a socket to the correct interface. As a result we keep binding to the same interface.

However, even when I fixed this issue netstat -i still shows all traffic going over only one of the interfaces. I'm continuing my investigation.

*Correction the same problem was there from the beginning in socklnd. It was not introduced by the patch.

Comment by Gerrit Updater [ 28/May/20 ]

Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38743
Subject: LU-13566 socklnd: fix local interface binding
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 6ba5f24c9f3845dc991bb45d344172af0c9e6d90

Comment by Amir Shehata (Inactive) [ 28/May/20 ]

For tcp workloads it's important to properly set the ARP, reverse path filtering and routing config, to make sure packets egress over the intended interfaces in a multi-rail setup.

Comment by Gerrit Updater [ 28/Jun/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38743/
Subject: LU-13566 socklnd: fix local interface binding
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: a7c9aba5eb96dd1e53899108a65af381b49e657b

Comment by Peter Jones [ 28/Jun/20 ]

Landed for 2.14

Generated at Sat Feb 10 03:02:21 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.