[LU-16644] LNetError kiblnd_connect_peer() Can't resolve addr for 192.168.112.4@o2ib17: -13 Created: 17/Mar/23  Updated: 21/Apr/23  Resolved: 21/Apr/23

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.9
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Olaf Faaland Assignee: Serguei Smirnov
Resolution: Duplicate Votes: 0
Labels: None
Environment:

lustre-2.12.9_3.llnl-3.t4.x86_64
toss-4.5-4
4.18.0-425.13.1.1toss.t4.x86_64


Issue Links:
Related
is related to LU-14006 raise CAP_NET_BIND_SERVICE before cal... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Intermittent console messages like this:

[Wed Mar 15 16:54:55 2023] LNetError: 3119923:0:(o2iblnd_cb.c:1473:kiblnd_connect_peer()) Can't resolve addr for 192.168.112.4@o2ib17: -13
[Wed Mar 15 17:15:15 2023] LNetError: 3124466:0:(o2iblnd_cb.c:1473:kiblnd_connect_peer()) Can't resolve addr for 192.168.112.4@o2ib17: -13

along with other messages indicating LNet routes are down:

[Wed Mar 15 17:15:15 2023] LNetError: 31377:0:(lib-move.c:2005:lnet_handle_find_routed_path()) no route to 172.21.3.62@o2ib700 from <?>

and possibly other LustreError messages.

The error emitted by kiblnd_connect_peer() is coming from calls to rdma_resolve_addr(), like so:

kiblnd_connect_peer() -> kiblnd_resolve_addr() ->rdma_resolve_addr()

where the actual call looks like

        /* look for a free privileged port */
        for (port = PROT_SOCK-1; port > 0; port--) {
                srcaddr->sin_port = htons(port);
                rc = rdma_resolve_addr(cmid,
                                       (struct sockaddr *)srcaddr,
                                       (struct sockaddr *)dstaddr,
                                       timeout_ms);

Lustre 2.12.9 does not have either of these patches:

  • 30b356a28b LU-14296 lnet: use an unbound cred in kiblnd_resolve_addr()
  • 1e4bd16acf LU-14006 o2ib: raise bind cap before resolving address

I can pull these patches onto our stack and push them to b2_12 for testing and review, but I don't understand two things:

(1) We see this routinely on one fairly small cluster (<100 nodes), and almost never on any other cluster (collectively, >5000 nodes). Do you know why this would be?

(2) I added some debugging and was able to determine that when I see this, some threads have CAP_NET_BIND_SERVICE and so the rdma_resolve_addr() calls succeed, and other threads do not have it. Is there a reason that all the lnet threads involved in connection setup would not all have the same capabilities?

Those questions make me wonder if the problem is really elsewhere, e.g. some code that drops capabilities and then fails to restore them after it's finished with the sensitive task.

For my records, my local ticket is TOSS5940



 Comments   
Comment by Olaf Faaland [ 17/Mar/23 ]

Related to LU-14006

Comment by Peter Jones [ 17/Mar/23 ]

Serguei

Could you please advise?

Thanks

Peter

Comment by Serguei Smirnov [ 20/Mar/23 ]

Hi Olaf,

At this point I don't have definitive answers to the questions you posted, but here are some ideas for clarification.

I guess before pulling the patches you mentioned, you could experiment with "options ko2iblnd use_privileged_port=0". 

The reason for the "-13" error may be exhausting the local ports needed to bind to when resolving the IP address. 

We could check if recovery activities may be interfering by taking a look at the recovery queues:

lnetctl debug recovery --peer
lnetctl debug recovery --local

long recovery queues may contribute to port exhaustion. 

We can also take a look at opensm logs to check if there are any indications of the problem on that level. I wonder if there's anything going on there which could explain "no route to" errors.

Thanks,

Serguei.

Comment by Olaf Faaland [ 21/Mar/23 ]

Hi Serguei,

We checked opensm logs and also tested with other tools such as ib_read_bw, ping, etc. We found no evidence of problems with the IB fabric.

We never see the CERROR "Failed to bind to a free privileged port" in the console logs, so I conclude we have never exhausted the local ports.   Also, in cases where my debug patch reported more information about the failure, the local port number that was tried was 1023.  The output is like this:

416421:2023-03-16 19:39:16.142718 00000800:00000200:3.0::0:416421:0:(o2iblnd_cb.c:1420:kiblnd_resolve_addr()) bind to port 1023 failed: -13
416421:2023-03-16 19:40:25.774547 00000800:00000200:3.0::0:416421:0:(o2iblnd_cb.c:1420:kiblnd_resolve_addr()) bind to port 1023 failed: -13

Since there's no evidence of port exhaustion, should I still look at the recovery debug output?

thanks,
Olaf

Comment by Serguei Smirnov [ 21/Mar/23 ]

It is good to rule out opensm issues. If there's no evidence of port exhaustion, I think long recovery queues are unlikely to be the problem. You may still experiment with "use_privileged_port=0" setting though as this test should be able to confirm whether porting the capability fixes is going to help.

Thanks,

Serguei.

Comment by Chris Horn [ 14/Apr/23 ]

(2) I added some debugging and was able to determine that when I see this, some threads have CAP_NET_BIND_SERVICE and so the rdma_resolve_addr() calls succeed, and other threads do not have it. Is there a reason that all the lnet threads involved in connection setup would not all have the same capabilities?

It is possible for application threads to trigger connection attempts and they will then run into the capabilities issue. It is easy to reproduce by running mdtest and then inducing network errors with net drop rules.

Comment by Chris Horn [ 14/Apr/23 ]

For example, this is ftrace data that was gathered while we were investigating this issue. The application is the mdtest benchmark program. An open() syscall results in the failing call to rdma_resolve_addr():

         mdtest-677   [018] .... 614101.377767: <stack trace>
 => rdma_resolve_addr+0x1/0x7c0 [rdma_cm]
 => kiblnd_connect_peer+0x204/0x660 [ko2iblnd]
 => kiblnd_launch_tx+0x96d/0xbe0 [ko2iblnd]
 => kiblnd_send+0x1f6/0x9b0 [ko2iblnd]
 => lnet_ni_send+0x42/0xd0 [lnet]
 => lnet_post_send_locked+0x115/0x620 [lnet]
 => lnet_return_tx_credits_locked+0x1f2/0x4e0 [lnet]
 => lnet_msg_decommit+0x102/0x6c0 [lnet]
 => lnet_finalize+0x407/0x900 [lnet]
 => kiblnd_tx_done+0xf9/0x3a0 [ko2iblnd]
 => kiblnd_txlist_done+0x5f/0x80 [ko2iblnd]
 => kiblnd_peer_connect_failed+0x184/0x320 [ko2iblnd]
 => kiblnd_connect_peer+0x3e1/0x660 [ko2iblnd]
 => kiblnd_launch_tx+0x96d/0xbe0 [ko2iblnd]
 => kiblnd_send+0x1f6/0x9b0 [ko2iblnd]
 => lnet_ni_send+0x42/0xd0 [lnet]
 => lnet_send+0x97/0x1d0 [lnet]
 => LNetPut+0x2c0/0x920 [lnet]
 => ptl_send_buf+0x206/0x560 [ptlrpc]
 => ptl_send_rpc+0x93e/0xea0 [ptlrpc]
 => ptlrpc_send_new_req+0x585/0x9d0 [ptlrpc]
 => ptlrpc_set_wait+0x28e/0x730 [ptlrpc]
 => ptlrpc_queue_wait+0x82/0x230 [ptlrpc]
 => ldlm_cli_enqueue+0x4c4/0xa50 [ptlrpc]
 => mdc_enqueue_base+0x31b/0x1af0 [mdc]
 => mdc_intent_lock+0x2bf/0x4e0 [mdc]
 => lmv_intent_open+0x248/0xba0 [lmv]
 => lmv_intent_lock+0x1c5/0x980 [lmv]
 => ll_lookup_it+0x395/0x1bb0 [lustre]
 => ll_atomic_open+0x133/0x1020 [lustre]
 => path_openat+0xd99/0x1520
 => do_filp_open+0x9b/0x110
 => do_sys_open+0x1bd/0x260
 => do_syscall_64+0x5b/0x1e0

The application thread does not have the CAP_NET_BIND_SERVICE capability, so it is unable to bind to the default privileged port. rdma_bind_addr() fails with -EACCESS.

Comment by Olaf Faaland [ 14/Apr/23 ]

> It is possible for application threads to trigger connection attempts

Thanks, Chris

Comment by Olaf Faaland [ 14/Apr/23 ]

I'd already pulled those two patches into our stack.  In case it's useful to others, I also pushed those two patches to gerrit against b2_12:

 

Comment by Olaf Faaland [ 14/Apr/23 ]

Serguei,

I don't need anything else for this issue.  Since 2.15 is the current stable branch, should we just resolve this ticket now?

thanks

Comment by Peter Jones [ 21/Apr/23 ]

I think yes - we can close this ticket as a duplicate

Generated at Sat Feb 10 03:28:47 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.