[LU-15192] socklnd: using typed_conns=0 disables communication Created: 03/Nov/21  Updated: 06/Nov/21

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Serguei Smirnov Assignee: Serguei Smirnov
Resolution: Unresolved Votes: 0
Labels: ksocklnd, lnet

Issue Links:
Duplicate
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

 Switching to "untyped" socklnd connections by using the following option

options ksocklnd typed_conns=0

appears to make socklnd unable to communicate. Self-pinging fails:

lnetctl ping 192.168.122.123@tcp
manage:
    - ping:
          errno: -1
          descr: failed to ping 192.168.122.123@tcp: Input/output error

Typical net debug trace is 

00000400:00000200:0.0:1635961675.271877:0:9092:0:(lib-move.c:4834:LNetGet()) LNetGet -> 12345-192.168.122.123@tcp
00000400:00000200:0.0:1635961675.271885:0:9092:0:(lib-move.c:2450:lnet_handle_send_case_locked()) Source ANY to NMR:  192.168.122.123@tcp local destination
00000400:00000200:0.0:1635961675.271892:0:9092:0:(lib-move.c:1714:lnet_handle_send()) rspt_next_hop_nid = 192.168.122.123@tcp
00000400:00000200:0.0:1635961675.271899:0:9092:0:(lib-move.c:1728:lnet_handle_send()) TRACE: 192.168.122.123@tcp(192.168.122.123@tcp:<?>) -> 192.168.122.123@tcp(192.168.122.123@tcp:192.168.122.123@tcp) : GET try# 0
00000800:00000200:0.0:1635961675.271905:0:9092:0:(socklnd_cb.c:1003:ksocknal_send()) sending 0 bytes in 0 frags to 12345-192.168.122.123@tcp
00000800:00000200:0.0:1635961675.271912:0:9092:0:(socklnd.c:195:ksocknal_find_peer_locked()) got peer_ni [ffff9cacc153ae00] -> 12345-192.168.122.123@tcp (4)
00000800:00000200:0.1F:1635961675.271919:0:9092:0:(socklnd.c:195:ksocknal_find_peer_locked()) got peer_ni [ffff9cacc153ae00] -> 12345-192.168.122.123@tcp (4)
00000800:00000100:0.0:1635961675.271924:0:9092:0:(socklnd_cb.c:979:ksocknal_launch_packet()) No usable routes to 12345-192.168.122.123@tcp
00000400:00000200:0.0:1635961675.271926:0:9092:0:(lib-msg.c:816:lnet_is_health_check()) health check = 1, status = -5, hstatus = 7
00000400:00000200:0.0:1635961675.271932:0:9092:0:(lib-msg.c:630:lnet_health_check()) health check: 192.168.122.123@tcp->192.168.122.123@tcp: GET: REMOTE_ERROR
00000400:00000200:0.0:1635961675.271937:0:9092:0:(api-ni.c:4096:lnet_ping()) poll 1(5 -5)
00000400:00000200:0.0:1635961675.271940:0:9092:0:(lib-md.c:69:lnet_md_unlink()) Unlinking md ffff9cad1077c110
00000400:00000200:0.0:1635961675.271942:0:9092:0:(api-ni.c:4096:lnet_ping()) poll 1(6 0) unlinked
00000800:00000200:0.0:1635961678.781862:0:8854:0:(socklnd.c:195:ksocknal_find_peer_locked()) got peer_ni [ffff9cacc153ae00] -> 12345-192.168.122.123@tcp (4)
00000800:00000200:0.1:1635961678.781869:0:8854:0:(socklnd.c:195:ksocknal_find_peer_locked()) got peer_ni [ffff9cacc153ae00] -> 12345-192.168.122.123@tcp (4)
00000800:00000100:0.0:1635961678.781873:0:8854:0:(socklnd_cb.c:979:ksocknal_launch_packet()) No usable routes to 12345-192.168.122.123@tcp
00000800:00000200:0.0:1635961678.781878:0:8854:0:(socklnd.c:195:ksocknal_find_peer_locked()) got peer_ni [ffff9cad3a278300] -> 12345-192.168.122.137@tcp (4)
00000800:00000200:0.1:1635961678.781881:0:8854:0:(socklnd.c:195:ksocknal_find_peer_locked()) got peer_ni [ffff9cad3a278300] -> 12345-192.168.122.137@tcp (4)
00000800:00000100:0.0:1635961678.781884:0:8854:0:(socklnd_cb.c:979:ksocknal_launch_packet()) No usable routes to 12345-192.168.122.137@tcp


 Comments   
Comment by Serguei Smirnov [ 03/Nov/21 ]

Trying to find the commit which broke the untyped connection functionality. Went as far back as 

9a2013af0668737dc564 ("LU-11893 ksocklnd: add secondary IP address handling")

still seeing the same problem.

Comment by Serguei Smirnov [ 05/Nov/21 ]

It looks like commit 0a9c9e444635dcf35a74bfb2f46efb3040ca17a0 broke "typeless" connection functionality. It is an old commit from 2009, so I don't see any associated ticket. The commit description is:

  Socklnd protocol V3:

    . dedicated connection for emergency message (ZC-ACK)

    . keepalive ping

Fixing some of the code that this commit introduced does seemingly bring back to life "typeless" connection mode, but the question is whether it should rather just be deprecated if no one needed it for so long.

Comment by Andreas Dilger [ 06/Nov/21 ]

I guess the question is what motivated you to open this ticket in the first place? Was there a reason you were testing with "typed_conns=0"? Is that to reduce the number of sockets used for a very large number of TCP clients, or something else? Definitely if it has not been working since 2009, then there shouldn't be any reason that it cannot be removed.

Generated at Sat Feb 10 03:16:14 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.