Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-16644

LNetError kiblnd_connect_peer() Can't resolve addr for 192.168.112.4@o2ib17: -13

Details

    • Bug
    • Resolution: Duplicate
    • Major
    • None
    • Lustre 2.12.9
    • None
    • lustre-2.12.9_3.llnl-3.t4.x86_64
      toss-4.5-4
      4.18.0-425.13.1.1toss.t4.x86_64
    • 3
    • 9223372036854775807

    Description

      Intermittent console messages like this:

      [Wed Mar 15 16:54:55 2023] LNetError: 3119923:0:(o2iblnd_cb.c:1473:kiblnd_connect_peer()) Can't resolve addr for 192.168.112.4@o2ib17: -13
      [Wed Mar 15 17:15:15 2023] LNetError: 3124466:0:(o2iblnd_cb.c:1473:kiblnd_connect_peer()) Can't resolve addr for 192.168.112.4@o2ib17: -13
      

      along with other messages indicating LNet routes are down:

      [Wed Mar 15 17:15:15 2023] LNetError: 31377:0:(lib-move.c:2005:lnet_handle_find_routed_path()) no route to 172.21.3.62@o2ib700 from <?>
      

      and possibly other LustreError messages.

      The error emitted by kiblnd_connect_peer() is coming from calls to rdma_resolve_addr(), like so:

      kiblnd_connect_peer() -> kiblnd_resolve_addr() ->rdma_resolve_addr()
      

      where the actual call looks like

              /* look for a free privileged port */
              for (port = PROT_SOCK-1; port > 0; port--) {
                      srcaddr->sin_port = htons(port);
                      rc = rdma_resolve_addr(cmid,
                                             (struct sockaddr *)srcaddr,
                                             (struct sockaddr *)dstaddr,
                                             timeout_ms);
      
      

      Lustre 2.12.9 does not have either of these patches:

      • 30b356a28b LU-14296 lnet: use an unbound cred in kiblnd_resolve_addr()
      • 1e4bd16acf LU-14006 o2ib: raise bind cap before resolving address

      I can pull these patches onto our stack and push them to b2_12 for testing and review, but I don't understand two things:

      (1) We see this routinely on one fairly small cluster (<100 nodes), and almost never on any other cluster (collectively, >5000 nodes). Do you know why this would be?

      (2) I added some debugging and was able to determine that when I see this, some threads have CAP_NET_BIND_SERVICE and so the rdma_resolve_addr() calls succeed, and other threads do not have it. Is there a reason that all the lnet threads involved in connection setup would not all have the same capabilities?

      Those questions make me wonder if the problem is really elsewhere, e.g. some code that drops capabilities and then fails to restore them after it's finished with the sensitive task.

      For my records, my local ticket is TOSS5940

      Attachments

        Issue Links

          Activity

            [LU-16644] LNetError kiblnd_connect_peer() Can't resolve addr for 192.168.112.4@o2ib17: -13
            pjones Peter Jones added a comment -

            I think yes - we can close this ticket as a duplicate

            pjones Peter Jones added a comment - I think yes - we can close this ticket as a duplicate
            ofaaland Olaf Faaland added a comment - - edited

            Serguei,

            I don't need anything else for this issue.  Since 2.15 is the current stable branch, should we just resolve this ticket now?

            thanks

            ofaaland Olaf Faaland added a comment - - edited Serguei, I don't need anything else for this issue.  Since 2.15 is the current stable branch, should we just resolve this ticket now? thanks
            ofaaland Olaf Faaland added a comment - - edited

            I'd already pulled those two patches into our stack.  In case it's useful to others, I also pushed those two patches to gerrit against b2_12:

             

            ofaaland Olaf Faaland added a comment - - edited I'd already pulled those two patches into our stack.  In case it's useful to others, I also pushed those two patches to gerrit against b2_12: https://review.whamcloud.com/c/fs/lustre-release/+/50642 LU-14006 o2ib: raise bind cap before resolving address https://review.whamcloud.com/c/fs/lustre-release/+/50643 LU-14296 lnet: use an unbound cred in kiblnd_resolve_addr()  
            ofaaland Olaf Faaland added a comment -

            > It is possible for application threads to trigger connection attempts

            Thanks, Chris

            ofaaland Olaf Faaland added a comment - > It is possible for application threads to trigger connection attempts Thanks, Chris
            hornc Chris Horn added a comment - - edited

            For example, this is ftrace data that was gathered while we were investigating this issue. The application is the mdtest benchmark program. An open() syscall results in the failing call to rdma_resolve_addr():

                     mdtest-677   [018] .... 614101.377767: <stack trace>
             => rdma_resolve_addr+0x1/0x7c0 [rdma_cm]
             => kiblnd_connect_peer+0x204/0x660 [ko2iblnd]
             => kiblnd_launch_tx+0x96d/0xbe0 [ko2iblnd]
             => kiblnd_send+0x1f6/0x9b0 [ko2iblnd]
             => lnet_ni_send+0x42/0xd0 [lnet]
             => lnet_post_send_locked+0x115/0x620 [lnet]
             => lnet_return_tx_credits_locked+0x1f2/0x4e0 [lnet]
             => lnet_msg_decommit+0x102/0x6c0 [lnet]
             => lnet_finalize+0x407/0x900 [lnet]
             => kiblnd_tx_done+0xf9/0x3a0 [ko2iblnd]
             => kiblnd_txlist_done+0x5f/0x80 [ko2iblnd]
             => kiblnd_peer_connect_failed+0x184/0x320 [ko2iblnd]
             => kiblnd_connect_peer+0x3e1/0x660 [ko2iblnd]
             => kiblnd_launch_tx+0x96d/0xbe0 [ko2iblnd]
             => kiblnd_send+0x1f6/0x9b0 [ko2iblnd]
             => lnet_ni_send+0x42/0xd0 [lnet]
             => lnet_send+0x97/0x1d0 [lnet]
             => LNetPut+0x2c0/0x920 [lnet]
             => ptl_send_buf+0x206/0x560 [ptlrpc]
             => ptl_send_rpc+0x93e/0xea0 [ptlrpc]
             => ptlrpc_send_new_req+0x585/0x9d0 [ptlrpc]
             => ptlrpc_set_wait+0x28e/0x730 [ptlrpc]
             => ptlrpc_queue_wait+0x82/0x230 [ptlrpc]
             => ldlm_cli_enqueue+0x4c4/0xa50 [ptlrpc]
             => mdc_enqueue_base+0x31b/0x1af0 [mdc]
             => mdc_intent_lock+0x2bf/0x4e0 [mdc]
             => lmv_intent_open+0x248/0xba0 [lmv]
             => lmv_intent_lock+0x1c5/0x980 [lmv]
             => ll_lookup_it+0x395/0x1bb0 [lustre]
             => ll_atomic_open+0x133/0x1020 [lustre]
             => path_openat+0xd99/0x1520
             => do_filp_open+0x9b/0x110
             => do_sys_open+0x1bd/0x260
             => do_syscall_64+0x5b/0x1e0
            

            The application thread does not have the CAP_NET_BIND_SERVICE capability, so it is unable to bind to the default privileged port. rdma_bind_addr() fails with -EACCESS.

            hornc Chris Horn added a comment - - edited For example, this is ftrace data that was gathered while we were investigating this issue. The application is the mdtest benchmark program. An open() syscall results in the failing call to rdma_resolve_addr(): mdtest-677 [018] .... 614101.377767: <stack trace> => rdma_resolve_addr+0x1/0x7c0 [rdma_cm] => kiblnd_connect_peer+0x204/0x660 [ko2iblnd] => kiblnd_launch_tx+0x96d/0xbe0 [ko2iblnd] => kiblnd_send+0x1f6/0x9b0 [ko2iblnd] => lnet_ni_send+0x42/0xd0 [lnet] => lnet_post_send_locked+0x115/0x620 [lnet] => lnet_return_tx_credits_locked+0x1f2/0x4e0 [lnet] => lnet_msg_decommit+0x102/0x6c0 [lnet] => lnet_finalize+0x407/0x900 [lnet] => kiblnd_tx_done+0xf9/0x3a0 [ko2iblnd] => kiblnd_txlist_done+0x5f/0x80 [ko2iblnd] => kiblnd_peer_connect_failed+0x184/0x320 [ko2iblnd] => kiblnd_connect_peer+0x3e1/0x660 [ko2iblnd] => kiblnd_launch_tx+0x96d/0xbe0 [ko2iblnd] => kiblnd_send+0x1f6/0x9b0 [ko2iblnd] => lnet_ni_send+0x42/0xd0 [lnet] => lnet_send+0x97/0x1d0 [lnet] => LNetPut+0x2c0/0x920 [lnet] => ptl_send_buf+0x206/0x560 [ptlrpc] => ptl_send_rpc+0x93e/0xea0 [ptlrpc] => ptlrpc_send_new_req+0x585/0x9d0 [ptlrpc] => ptlrpc_set_wait+0x28e/0x730 [ptlrpc] => ptlrpc_queue_wait+0x82/0x230 [ptlrpc] => ldlm_cli_enqueue+0x4c4/0xa50 [ptlrpc] => mdc_enqueue_base+0x31b/0x1af0 [mdc] => mdc_intent_lock+0x2bf/0x4e0 [mdc] => lmv_intent_open+0x248/0xba0 [lmv] => lmv_intent_lock+0x1c5/0x980 [lmv] => ll_lookup_it+0x395/0x1bb0 [lustre] => ll_atomic_open+0x133/0x1020 [lustre] => path_openat+0xd99/0x1520 => do_filp_open+0x9b/0x110 => do_sys_open+0x1bd/0x260 => do_syscall_64+0x5b/0x1e0 The application thread does not have the CAP_NET_BIND_SERVICE capability, so it is unable to bind to the default privileged port. rdma_bind_addr() fails with -EACCESS.
            hornc Chris Horn added a comment -

            (2) I added some debugging and was able to determine that when I see this, some threads have CAP_NET_BIND_SERVICE and so the rdma_resolve_addr() calls succeed, and other threads do not have it. Is there a reason that all the lnet threads involved in connection setup would not all have the same capabilities?

            It is possible for application threads to trigger connection attempts and they will then run into the capabilities issue. It is easy to reproduce by running mdtest and then inducing network errors with net drop rules.

            hornc Chris Horn added a comment - (2) I added some debugging and was able to determine that when I see this, some threads have CAP_NET_BIND_SERVICE and so the rdma_resolve_addr() calls succeed, and other threads do not have it. Is there a reason that all the lnet threads involved in connection setup would not all have the same capabilities? It is possible for application threads to trigger connection attempts and they will then run into the capabilities issue. It is easy to reproduce by running mdtest and then inducing network errors with net drop rules.

            It is good to rule out opensm issues. If there's no evidence of port exhaustion, I think long recovery queues are unlikely to be the problem. You may still experiment with "use_privileged_port=0" setting though as this test should be able to confirm whether porting the capability fixes is going to help.

            Thanks,

            Serguei.

            ssmirnov Serguei Smirnov added a comment - It is good to rule out opensm issues. If there's no evidence of port exhaustion, I think long recovery queues are unlikely to be the problem. You may still experiment with "use_privileged_port=0" setting though as this test should be able to confirm whether porting the capability fixes is going to help. Thanks, Serguei.
            ofaaland Olaf Faaland added a comment - - edited

            Hi Serguei,

            We checked opensm logs and also tested with other tools such as ib_read_bw, ping, etc. We found no evidence of problems with the IB fabric.

            We never see the CERROR "Failed to bind to a free privileged port" in the console logs, so I conclude we have never exhausted the local ports.   Also, in cases where my debug patch reported more information about the failure, the local port number that was tried was 1023.  The output is like this:

            416421:2023-03-16 19:39:16.142718 00000800:00000200:3.0::0:416421:0:(o2iblnd_cb.c:1420:kiblnd_resolve_addr()) bind to port 1023 failed: -13
            416421:2023-03-16 19:40:25.774547 00000800:00000200:3.0::0:416421:0:(o2iblnd_cb.c:1420:kiblnd_resolve_addr()) bind to port 1023 failed: -13

            Since there's no evidence of port exhaustion, should I still look at the recovery debug output?

            thanks,
            Olaf

            ofaaland Olaf Faaland added a comment - - edited Hi Serguei, We checked opensm logs and also tested with other tools such as ib_read_bw, ping, etc. We found no evidence of problems with the IB fabric. We never see the CERROR "Failed to bind to a free privileged port" in the console logs, so I conclude we have never exhausted the local ports.   Also, in cases where my debug patch reported more information about the failure, the local port number that was tried was 1023.  The output is like this: 416421:2023-03-16 19:39:16.142718 00000800:00000200:3.0::0:416421:0:(o2iblnd_cb.c:1420:kiblnd_resolve_addr()) bind to port 1023 failed: -13 416421:2023-03-16 19:40:25.774547 00000800:00000200:3.0::0:416421:0:(o2iblnd_cb.c:1420:kiblnd_resolve_addr()) bind to port 1023 failed: -13 Since there's no evidence of port exhaustion, should I still look at the recovery debug output? thanks, Olaf

            Hi Olaf,

            At this point I don't have definitive answers to the questions you posted, but here are some ideas for clarification.

            I guess before pulling the patches you mentioned, you could experiment with "options ko2iblnd use_privileged_port=0". 

            The reason for the "-13" error may be exhausting the local ports needed to bind to when resolving the IP address. 

            We could check if recovery activities may be interfering by taking a look at the recovery queues:

            lnetctl debug recovery --peer
            lnetctl debug recovery --local

            long recovery queues may contribute to port exhaustion. 

            We can also take a look at opensm logs to check if there are any indications of the problem on that level. I wonder if there's anything going on there which could explain "no route to" errors.

            Thanks,

            Serguei.

            ssmirnov Serguei Smirnov added a comment - Hi Olaf, At this point I don't have definitive answers to the questions you posted, but here are some ideas for clarification. I guess before pulling the patches you mentioned, you could experiment with "options ko2iblnd use_privileged_port= 0 ".  The reason for the "-13" error may be exhausting the local ports needed to bind to when resolving the IP address.  We could check if recovery activities may be interfering by taking a look at the recovery queues: lnetctl debug recovery --peer lnetctl debug recovery --local long recovery queues may contribute to port exhaustion.  We can also take a look at opensm logs to check if there are any indications of the problem on that level. I wonder if there's anything going on there which could explain "no route to" errors. Thanks, Serguei.
            pjones Peter Jones added a comment -

            Serguei

            Could you please advise?

            Thanks

            Peter

            pjones Peter Jones added a comment - Serguei Could you please advise? Thanks Peter

            People

              ssmirnov Serguei Smirnov
              ofaaland Olaf Faaland
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: