socklnd needs improved interface selection and configuration (LU-14064)

[LU-13565] LNet socklnd with NAT is not working properly Created: 15/May/20  Updated: 22/Oct/20

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Technical task Priority: Minor
Reporter: Amir Shehata (Inactive) Assignee: Amir Shehata (Inactive)
Resolution: Unresolved Votes: 0
Labels: None

Issue Links:
Related
Rank (Obsolete): 9223372036854775807

 Description   

sihara found this issue.

When using NAT on a setup, it appears like the wrong NID is used:

peer 1: 192.168.122.135 (NAT: 10.128.13.120)

(lib-move.c:1790:lnet_handle_send()) TRACE: 192.168.122.135@tcp(192.168.122.135@tcp:<?>) -> 10.128.13.130@tcp(10.128.13.130@tcp:10.128.13.130@tcp) : GET try# 0

peer2: 10.128.13.130

(lib-move.c:4302:lnet_parse()) TRACE: 10.128.13.130@tcp(10.128.13.130@tcp) <- 10.128.13.120@tcp : GET - for me

(lib-move.c:1858:lnet_handle_send()) TRACE: 10.128.13.130@tcp(10.128.13.130@tcp:10.128.13.130@tcp) -> 10.128.13.120@tcp(10.128.13.120@tcp:10.128.13.120@tcp)

peer1:

:(lib-move.c:4236:lnet_parse()) TRACE: 10.128.13.120@tcp(192.168.122.135@tcp) <- 10.128.13.130@tcp : REPLY - routed

It appears like the NID is of the node changes some where along the line.
LNet shouldn't care about NAT in this case and should work.



 Comments   
Comment by Amir Shehata (Inactive) [ 29/May/20 ]

The problem here is that the socklnd is using the IP address of the socket on the passive side. When a connection is established the passive side looks up the peer IP address from the socket. That IP is the NATed IP address however. So then the local peer structure on the passive side is created with a NID using the NATed IP address of the active. When a response is finally sent to the active node, the NID in the message contains the NATed IP address and not the private IP address LNet on the active node was configured with. The message is then dropped.

What we need to do is keep a mapping between private and public IP addresses in socklnd. So the correct IP address ends up being used.

Generated at Sat Feb 10 03:02:21 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.