[LU-11112] lnet: improve error msg in lnet_sock_create() Created: 02/Jul/18 Updated: 03/Jul/18 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Improvement | Priority: | Minor |
| Reporter: | Daniel Kobras (Inactive) | Assignee: | Sonia Sharma (Inactive) |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
The kernel_bind() call in lnet_sock_create() may fail either due to Background: We've encoutered an issue where LNET had picked a virtual IP address (used for non-Lustre services) for its local_ip, and lnet_sock_create would fail once the IP address was migrated to another node. The error message only included the port, but not the IP address, and so it took a while to correlate the events. Why LNET chose to pick this particular source address is a separate question we need to investigate, but for starters, improving the error message to include all relevant content seems to be a good idea to me. |
| Comments |
| Comment by Peter Jones [ 02/Jul/18 ] |
|
Sonia Could you please investigate? Thanks Peter |
| Comment by Peter Jones [ 02/Jul/18 ] |
|
Daniel Could you please push your proposed patch into Gerrit so it can be reviewed/landed? Peter |
| Comment by Daniel Kobras (Inactive) [ 02/Jul/18 ] |
|
It seems Gerrit has moved to a different IP address, and I cannot access it due to local firewall restrictions. Attaching the patch here while I try to sort things out. |
| Comment by Karsten Weiss [ 02/Jul/18 ] |
|
FWIW: This is the original lnet error message: LNetError: 2099:0:(lib-socket.c:455:lnet_sock_create()) Error trying to bind to port 1023: -99 LNetError: 2099:0:(lib-socket.c:455:lnet_sock_create()) Skipped 8 previous similar messages LNetError: 11e-e: Unexpected error -99 connecting to 192.168.10.6@tcp at host 192.168.10.6 on port 988 |
| Comment by Sonia Sharma (Inactive) [ 02/Jul/18 ] |
|
Hi Daniel In the lnet_sock_connect function, I see "INADDR_ANY" is assigned if the local_ip == 0. With "INDDR_ANY" in the bind call, the socket will be bound to all the local interfaces.
439 if (local_ip != 0 || local_port != 0) {
440 memset(&locaddr, 0, sizeof(locaddr));
441 locaddr.sin_family = AF_INET;
442 locaddr.sin_port = htons(local_port);
443 locaddr.sin_addr.s_addr = (local_ip == 0) ?
444 INADDR_ANY : htonl(local_ip);
Was this virtual address assigned to one of the interface on the node? It would help to understand if you know what particular action/command execution is resulting in this error. Thanks |
| Comment by Daniel Kobras (Inactive) [ 03/Jul/18 ] |
|
Hi Sonia! For the scope of this LU, my line of argument just goes:
The case given was only meant as an example to show that having the full information in the error output occasionally really matters.
Why LNET chose to open a connection with a fixed source IP address rather than just using INADDR_ANY isn't clear to me, yet. One should be able to reproduce it with
but that's a topic probably more suited to a separate LU (once we've collected more information about it). |
| Comment by Gerrit Updater [ 03/Jul/18 ] |
|
Daniel Kobras (d.kobras@science-computing.de) uploaded a new patch: https://review.whamcloud.com/32758 |