Details
-
Bug
-
Resolution: Duplicate
-
Major
-
None
-
Lustre 2.12.9
-
None
-
lustre-2.12.9_3.llnl-3.t4.x86_64
toss-4.5-4
4.18.0-425.13.1.1toss.t4.x86_64
-
3
-
9223372036854775807
Description
Intermittent console messages like this:
[Wed Mar 15 16:54:55 2023] LNetError: 3119923:0:(o2iblnd_cb.c:1473:kiblnd_connect_peer()) Can't resolve addr for 192.168.112.4@o2ib17: -13 [Wed Mar 15 17:15:15 2023] LNetError: 3124466:0:(o2iblnd_cb.c:1473:kiblnd_connect_peer()) Can't resolve addr for 192.168.112.4@o2ib17: -13
along with other messages indicating LNet routes are down:
[Wed Mar 15 17:15:15 2023] LNetError: 31377:0:(lib-move.c:2005:lnet_handle_find_routed_path()) no route to 172.21.3.62@o2ib700 from <?>
and possibly other LustreError messages.
The error emitted by kiblnd_connect_peer() is coming from calls to rdma_resolve_addr(), like so:
kiblnd_connect_peer() -> kiblnd_resolve_addr() ->rdma_resolve_addr()
where the actual call looks like
/* look for a free privileged port */ for (port = PROT_SOCK-1; port > 0; port--) { srcaddr->sin_port = htons(port); rc = rdma_resolve_addr(cmid, (struct sockaddr *)srcaddr, (struct sockaddr *)dstaddr, timeout_ms);
Lustre 2.12.9 does not have either of these patches:
- 30b356a28b
LU-14296lnet: use an unbound cred in kiblnd_resolve_addr() - 1e4bd16acf
LU-14006o2ib: raise bind cap before resolving address
I can pull these patches onto our stack and push them to b2_12 for testing and review, but I don't understand two things:
(1) We see this routinely on one fairly small cluster (<100 nodes), and almost never on any other cluster (collectively, >5000 nodes). Do you know why this would be?
(2) I added some debugging and was able to determine that when I see this, some threads have CAP_NET_BIND_SERVICE and so the rdma_resolve_addr() calls succeed, and other threads do not have it. Is there a reason that all the lnet threads involved in connection setup would not all have the same capabilities?
Those questions make me wonder if the problem is really elsewhere, e.g. some code that drops capabilities and then fails to restore them after it's finished with the sensitive task.
For my records, my local ticket is TOSS5940
Attachments
Issue Links
- is related to
-
LU-14006 raise CAP_NET_BIND_SERVICE before calling rdma_resolve_addr()
- Resolved