[LU-15192] socklnd: using typed_conns=0 disables communication Created: 03/Nov/21 Updated: 06/Nov/21 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Serguei Smirnov | Assignee: | Serguei Smirnov |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | ksocklnd, lnet | ||
| Issue Links: |
|
||||
| Severity: | 3 | ||||
| Rank (Obsolete): | 9223372036854775807 | ||||
| Description |
|
Switching to "untyped" socklnd connections by using the following option options ksocklnd typed_conns=0 appears to make socklnd unable to communicate. Self-pinging fails:
lnetctl ping 192.168.122.123@tcp
manage:
- ping:
errno: -1
descr: failed to ping 192.168.122.123@tcp: Input/output error
Typical net debug trace is
00000400:00000200:0.0:1635961675.271877:0:9092:0:(lib-move.c:4834:LNetGet()) LNetGet -> 12345-192.168.122.123@tcp
00000400:00000200:0.0:1635961675.271885:0:9092:0:(lib-move.c:2450:lnet_handle_send_case_locked()) Source ANY to NMR: 192.168.122.123@tcp local destination
00000400:00000200:0.0:1635961675.271892:0:9092:0:(lib-move.c:1714:lnet_handle_send()) rspt_next_hop_nid = 192.168.122.123@tcp
00000400:00000200:0.0:1635961675.271899:0:9092:0:(lib-move.c:1728:lnet_handle_send()) TRACE: 192.168.122.123@tcp(192.168.122.123@tcp:<?>) -> 192.168.122.123@tcp(192.168.122.123@tcp:192.168.122.123@tcp) : GET try# 0
00000800:00000200:0.0:1635961675.271905:0:9092:0:(socklnd_cb.c:1003:ksocknal_send()) sending 0 bytes in 0 frags to 12345-192.168.122.123@tcp
00000800:00000200:0.0:1635961675.271912:0:9092:0:(socklnd.c:195:ksocknal_find_peer_locked()) got peer_ni [ffff9cacc153ae00] -> 12345-192.168.122.123@tcp (4)
00000800:00000200:0.1F:1635961675.271919:0:9092:0:(socklnd.c:195:ksocknal_find_peer_locked()) got peer_ni [ffff9cacc153ae00] -> 12345-192.168.122.123@tcp (4)
00000800:00000100:0.0:1635961675.271924:0:9092:0:(socklnd_cb.c:979:ksocknal_launch_packet()) No usable routes to 12345-192.168.122.123@tcp
00000400:00000200:0.0:1635961675.271926:0:9092:0:(lib-msg.c:816:lnet_is_health_check()) health check = 1, status = -5, hstatus = 7
00000400:00000200:0.0:1635961675.271932:0:9092:0:(lib-msg.c:630:lnet_health_check()) health check: 192.168.122.123@tcp->192.168.122.123@tcp: GET: REMOTE_ERROR
00000400:00000200:0.0:1635961675.271937:0:9092:0:(api-ni.c:4096:lnet_ping()) poll 1(5 -5)
00000400:00000200:0.0:1635961675.271940:0:9092:0:(lib-md.c:69:lnet_md_unlink()) Unlinking md ffff9cad1077c110
00000400:00000200:0.0:1635961675.271942:0:9092:0:(api-ni.c:4096:lnet_ping()) poll 1(6 0) unlinked
00000800:00000200:0.0:1635961678.781862:0:8854:0:(socklnd.c:195:ksocknal_find_peer_locked()) got peer_ni [ffff9cacc153ae00] -> 12345-192.168.122.123@tcp (4)
00000800:00000200:0.1:1635961678.781869:0:8854:0:(socklnd.c:195:ksocknal_find_peer_locked()) got peer_ni [ffff9cacc153ae00] -> 12345-192.168.122.123@tcp (4)
00000800:00000100:0.0:1635961678.781873:0:8854:0:(socklnd_cb.c:979:ksocknal_launch_packet()) No usable routes to 12345-192.168.122.123@tcp
00000800:00000200:0.0:1635961678.781878:0:8854:0:(socklnd.c:195:ksocknal_find_peer_locked()) got peer_ni [ffff9cad3a278300] -> 12345-192.168.122.137@tcp (4)
00000800:00000200:0.1:1635961678.781881:0:8854:0:(socklnd.c:195:ksocknal_find_peer_locked()) got peer_ni [ffff9cad3a278300] -> 12345-192.168.122.137@tcp (4)
00000800:00000100:0.0:1635961678.781884:0:8854:0:(socklnd_cb.c:979:ksocknal_launch_packet()) No usable routes to 12345-192.168.122.137@tcp
|
| Comments |
| Comment by Serguei Smirnov [ 03/Nov/21 ] |
|
Trying to find the commit which broke the untyped connection functionality. Went as far back as 9a2013af0668737dc564 (" still seeing the same problem. |
| Comment by Serguei Smirnov [ 05/Nov/21 ] |
|
It looks like commit 0a9c9e444635dcf35a74bfb2f46efb3040ca17a0 broke "typeless" connection functionality. It is an old commit from 2009, so I don't see any associated ticket. The commit description is: Socklnd protocol V3: . dedicated connection for emergency message (ZC-ACK) . keepalive ping Fixing some of the code that this commit introduced does seemingly bring back to life "typeless" connection mode, but the question is whether it should rather just be deprecated if no one needed it for so long. |
| Comment by Andreas Dilger [ 06/Nov/21 ] |
|
I guess the question is what motivated you to open this ticket in the first place? Was there a reason you were testing with "typed_conns=0"? Is that to reduce the number of sockets used for a very large number of TCP clients, or something else? Definitely if it has not been working since 2009, then there shouldn't be any reason that it cannot be removed. |