Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-15192

socklnd: using typed_conns=0 disables communication

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • None
    • 3
    • 9223372036854775807

    Description

       Switching to "untyped" socklnd connections by using the following option

      options ksocklnd typed_conns=0

      appears to make socklnd unable to communicate. Self-pinging fails:

      lnetctl ping 192.168.122.123@tcp
      manage:
          - ping:
                errno: -1
                descr: failed to ping 192.168.122.123@tcp: Input/output error

      Typical net debug trace is 

      00000400:00000200:0.0:1635961675.271877:0:9092:0:(lib-move.c:4834:LNetGet()) LNetGet -> 12345-192.168.122.123@tcp
      00000400:00000200:0.0:1635961675.271885:0:9092:0:(lib-move.c:2450:lnet_handle_send_case_locked()) Source ANY to NMR:  192.168.122.123@tcp local destination
      00000400:00000200:0.0:1635961675.271892:0:9092:0:(lib-move.c:1714:lnet_handle_send()) rspt_next_hop_nid = 192.168.122.123@tcp
      00000400:00000200:0.0:1635961675.271899:0:9092:0:(lib-move.c:1728:lnet_handle_send()) TRACE: 192.168.122.123@tcp(192.168.122.123@tcp:<?>) -> 192.168.122.123@tcp(192.168.122.123@tcp:192.168.122.123@tcp) : GET try# 0
      00000800:00000200:0.0:1635961675.271905:0:9092:0:(socklnd_cb.c:1003:ksocknal_send()) sending 0 bytes in 0 frags to 12345-192.168.122.123@tcp
      00000800:00000200:0.0:1635961675.271912:0:9092:0:(socklnd.c:195:ksocknal_find_peer_locked()) got peer_ni [ffff9cacc153ae00] -> 12345-192.168.122.123@tcp (4)
      00000800:00000200:0.1F:1635961675.271919:0:9092:0:(socklnd.c:195:ksocknal_find_peer_locked()) got peer_ni [ffff9cacc153ae00] -> 12345-192.168.122.123@tcp (4)
      00000800:00000100:0.0:1635961675.271924:0:9092:0:(socklnd_cb.c:979:ksocknal_launch_packet()) No usable routes to 12345-192.168.122.123@tcp
      00000400:00000200:0.0:1635961675.271926:0:9092:0:(lib-msg.c:816:lnet_is_health_check()) health check = 1, status = -5, hstatus = 7
      00000400:00000200:0.0:1635961675.271932:0:9092:0:(lib-msg.c:630:lnet_health_check()) health check: 192.168.122.123@tcp->192.168.122.123@tcp: GET: REMOTE_ERROR
      00000400:00000200:0.0:1635961675.271937:0:9092:0:(api-ni.c:4096:lnet_ping()) poll 1(5 -5)
      00000400:00000200:0.0:1635961675.271940:0:9092:0:(lib-md.c:69:lnet_md_unlink()) Unlinking md ffff9cad1077c110
      00000400:00000200:0.0:1635961675.271942:0:9092:0:(api-ni.c:4096:lnet_ping()) poll 1(6 0) unlinked
      00000800:00000200:0.0:1635961678.781862:0:8854:0:(socklnd.c:195:ksocknal_find_peer_locked()) got peer_ni [ffff9cacc153ae00] -> 12345-192.168.122.123@tcp (4)
      00000800:00000200:0.1:1635961678.781869:0:8854:0:(socklnd.c:195:ksocknal_find_peer_locked()) got peer_ni [ffff9cacc153ae00] -> 12345-192.168.122.123@tcp (4)
      00000800:00000100:0.0:1635961678.781873:0:8854:0:(socklnd_cb.c:979:ksocknal_launch_packet()) No usable routes to 12345-192.168.122.123@tcp
      00000800:00000200:0.0:1635961678.781878:0:8854:0:(socklnd.c:195:ksocknal_find_peer_locked()) got peer_ni [ffff9cad3a278300] -> 12345-192.168.122.137@tcp (4)
      00000800:00000200:0.1:1635961678.781881:0:8854:0:(socklnd.c:195:ksocknal_find_peer_locked()) got peer_ni [ffff9cad3a278300] -> 12345-192.168.122.137@tcp (4)
      00000800:00000100:0.0:1635961678.781884:0:8854:0:(socklnd_cb.c:979:ksocknal_launch_packet()) No usable routes to 12345-192.168.122.137@tcp

      Attachments

        Activity

          [LU-15192] socklnd: using typed_conns=0 disables communication

          I guess the question is what motivated you to open this ticket in the first place? Was there a reason you were testing with "typed_conns=0"? Is that to reduce the number of sockets used for a very large number of TCP clients, or something else? Definitely if it has not been working since 2009, then there shouldn't be any reason that it cannot be removed.

          adilger Andreas Dilger added a comment - I guess the question is what motivated you to open this ticket in the first place? Was there a reason you were testing with "typed_conns=0"? Is that to reduce the number of sockets used for a very large number of TCP clients, or something else? Definitely if it has not been working since 2009, then there shouldn't be any reason that it cannot be removed.

          It looks like commit 0a9c9e444635dcf35a74bfb2f46efb3040ca17a0 broke "typeless" connection functionality. It is an old commit from 2009, so I don't see any associated ticket. The commit description is:

            Socklnd protocol V3:

              . dedicated connection for emergency message (ZC-ACK)

              . keepalive ping

          Fixing some of the code that this commit introduced does seemingly bring back to life "typeless" connection mode, but the question is whether it should rather just be deprecated if no one needed it for so long.

          ssmirnov Serguei Smirnov added a comment - It looks like commit 0a9c9e444635dcf35a74bfb2f46efb3040ca17a0 broke "typeless" connection functionality. It is an old commit from 2009, so I don't see any associated ticket. The commit description is:   Socklnd protocol V3:     . dedicated connection for emergency message (ZC-ACK)     . keepalive ping Fixing some of the code that this commit introduced does seemingly bring back to life "typeless" connection mode, but the question is whether it should rather just be deprecated if no one needed it for so long.

          Trying to find the commit which broke the untyped connection functionality. Went as far back as 

          9a2013af0668737dc564 ("LU-11893 ksocklnd: add secondary IP address handling")

          still seeing the same problem.

          ssmirnov Serguei Smirnov added a comment - Trying to find the commit which broke the untyped connection functionality. Went as far back as  9a2013af0668737dc564 (" LU-11893 ksocklnd: add secondary IP address handling") still seeing the same problem.

          People

            ssmirnov Serguei Smirnov
            ssmirnov Serguei Smirnov
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: