Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-16644

LNetError kiblnd_connect_peer() Can't resolve addr for 192.168.112.4@o2ib17: -13

    XMLWordPrintable

Details

    • Bug
    • Resolution: Duplicate
    • Major
    • None
    • Lustre 2.12.9
    • None
    • lustre-2.12.9_3.llnl-3.t4.x86_64
      toss-4.5-4
      4.18.0-425.13.1.1toss.t4.x86_64
    • 3
    • 9223372036854775807

    Description

      Intermittent console messages like this:

      [Wed Mar 15 16:54:55 2023] LNetError: 3119923:0:(o2iblnd_cb.c:1473:kiblnd_connect_peer()) Can't resolve addr for 192.168.112.4@o2ib17: -13
      [Wed Mar 15 17:15:15 2023] LNetError: 3124466:0:(o2iblnd_cb.c:1473:kiblnd_connect_peer()) Can't resolve addr for 192.168.112.4@o2ib17: -13
      

      along with other messages indicating LNet routes are down:

      [Wed Mar 15 17:15:15 2023] LNetError: 31377:0:(lib-move.c:2005:lnet_handle_find_routed_path()) no route to 172.21.3.62@o2ib700 from <?>
      

      and possibly other LustreError messages.

      The error emitted by kiblnd_connect_peer() is coming from calls to rdma_resolve_addr(), like so:

      kiblnd_connect_peer() -> kiblnd_resolve_addr() ->rdma_resolve_addr()
      

      where the actual call looks like

              /* look for a free privileged port */
              for (port = PROT_SOCK-1; port > 0; port--) {
                      srcaddr->sin_port = htons(port);
                      rc = rdma_resolve_addr(cmid,
                                             (struct sockaddr *)srcaddr,
                                             (struct sockaddr *)dstaddr,
                                             timeout_ms);
      
      

      Lustre 2.12.9 does not have either of these patches:

      • 30b356a28b LU-14296 lnet: use an unbound cred in kiblnd_resolve_addr()
      • 1e4bd16acf LU-14006 o2ib: raise bind cap before resolving address

      I can pull these patches onto our stack and push them to b2_12 for testing and review, but I don't understand two things:

      (1) We see this routinely on one fairly small cluster (<100 nodes), and almost never on any other cluster (collectively, >5000 nodes). Do you know why this would be?

      (2) I added some debugging and was able to determine that when I see this, some threads have CAP_NET_BIND_SERVICE and so the rdma_resolve_addr() calls succeed, and other threads do not have it. Is there a reason that all the lnet threads involved in connection setup would not all have the same capabilities?

      Those questions make me wonder if the problem is really elsewhere, e.g. some code that drops capabilities and then fails to restore them after it's finished with the sensitive task.

      For my records, my local ticket is TOSS5940

      Attachments

        Issue Links

          Activity

            People

              ssmirnov Serguei Smirnov
              ofaaland Olaf Faaland
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: