Details
-
Technical task
-
Resolution: Unresolved
-
Minor
-
None
-
None
-
None
-
9223372036854775807
Description
sihara found this issue.
When using NAT on a setup, it appears like the wrong NID is used:
peer 1: 192.168.122.135 (NAT: 10.128.13.120)
(lib-move.c:1790:lnet_handle_send()) TRACE: 192.168.122.135@tcp(192.168.122.135@tcp:<?>) -> 10.128.13.130@tcp(10.128.13.130@tcp:10.128.13.130@tcp) : GET try# 0
peer2: 10.128.13.130
(lib-move.c:4302:lnet_parse()) TRACE: 10.128.13.130@tcp(10.128.13.130@tcp) <- 10.128.13.120@tcp : GET - for me
(lib-move.c:1858:lnet_handle_send()) TRACE: 10.128.13.130@tcp(10.128.13.130@tcp:10.128.13.130@tcp) -> 10.128.13.120@tcp(10.128.13.120@tcp:10.128.13.120@tcp)
peer1:
:(lib-move.c:4236:lnet_parse()) TRACE: 10.128.13.120@tcp(192.168.122.135@tcp) <- 10.128.13.130@tcp : REPLY - routed
It appears like the NID is of the node changes some where along the line.
LNet shouldn't care about NAT in this case and should work.
The problem here is that the socklnd is using the IP address of the socket on the passive side. When a connection is established the passive side looks up the peer IP address from the socket. That IP is the NATed IP address however. So then the local peer structure on the passive side is created with a NID using the NATed IP address of the active. When a response is finally sent to the active node, the NID in the message contains the NATed IP address and not the private IP address LNet on the active node was configured with. The message is then dropped.
What we need to do is keep a mapping between private and public IP addresses in socklnd. So the correct IP address ends up being used.