[LU-16644] LNetError kiblnd_connect_peer() Can't resolve addr for 192.168.112.4@o2ib17: -13 - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Duplicate
Priority: Major
Fix Version/s: None
Affects Version/s: Lustre 2.12.9
Labels:
None
Environment:
lustre-2.12.9_3.llnl-3.t4.x86_64
toss-4.5-4
4.18.0-425.13.1.1toss.t4.x86_64

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

Intermittent console messages like this:

[Wed Mar 15 16:54:55 2023] LNetError: 3119923:0:(o2iblnd_cb.c:1473:kiblnd_connect_peer()) Can't resolve addr for 192.168.112.4@o2ib17: -13
[Wed Mar 15 17:15:15 2023] LNetError: 3124466:0:(o2iblnd_cb.c:1473:kiblnd_connect_peer()) Can't resolve addr for 192.168.112.4@o2ib17: -13

along with other messages indicating LNet routes are down:

[Wed Mar 15 17:15:15 2023] LNetError: 31377:0:(lib-move.c:2005:lnet_handle_find_routed_path()) no route to 172.21.3.62@o2ib700 from <?>

and possibly other LustreError messages.

The error emitted by kiblnd_connect_peer() is coming from calls to rdma_resolve_addr(), like so:

kiblnd_connect_peer() -> kiblnd_resolve_addr() ->rdma_resolve_addr()

where the actual call looks like

        /* look for a free privileged port */
        for (port = PROT_SOCK-1; port > 0; port--) {
                srcaddr->sin_port = htons(port);
                rc = rdma_resolve_addr(cmid,
                                       (struct sockaddr *)srcaddr,
                                       (struct sockaddr *)dstaddr,
                                       timeout_ms);

Lustre 2.12.9 does not have either of these patches:

30b356a28b ~~LU-14296~~ lnet: use an unbound cred in kiblnd_resolve_addr()
1e4bd16acf ~~LU-14006~~ o2ib: raise bind cap before resolving address

I can pull these patches onto our stack and push them to b2_12 for testing and review, but I don't understand two things:

(1) We see this routinely on one fairly small cluster (<100 nodes), and almost never on any other cluster (collectively, >5000 nodes). Do you know why this would be?

(2) I added some debugging and was able to determine that when I see this, some threads have CAP_NET_BIND_SERVICE and so the rdma_resolve_addr() calls succeed, and other threads do not have it. Is there a reason that all the lnet threads involved in connection setup would not all have the same capabilities?

Those questions make me wonder if the problem is really elsewhere, e.g. some code that drops capabilities and then fails to restore them after it's finished with the sensitive task.

For my records, my local ticket is TOSS5940

Attachments

Issue Links

is related to

LU-14006 raise CAP_NET_BIND_SERVICE before calling rdma_resolve_addr()

Resolved

Activity

People

Assignee:: Serguei Smirnov

Reporter:: Olaf Faaland

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 17/Mar/23 3:28 AM

Updated:: 21/Apr/23 2:01 PM

Resolved:: 21/Apr/23 2:01 PM