[LU-11112] lnet: improve error msg in lnet_sock_create() Created: 02/Jul/18  Updated: 03/Jul/18

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Minor
Reporter: Daniel Kobras (Inactive) Assignee: Sonia Sharma (Inactive)
Resolution: Unresolved Votes: 0
Labels: None

Attachments: File 0001-LU-11112-lnet-improve-error-msg-in-lnet_sock_create.patch    
Rank (Obsolete): 9223372036854775807

 Description   

The kernel_bind() call in lnet_sock_create() may fail either due to
problem with the local port, or the local IP address, but the error message currently only includes the port. It would be helpful if the message included both items when indicating a fatal error.

Background: We've encoutered an issue where LNET had picked a virtual IP address (used for non-Lustre services) for its local_ip, and lnet_sock_create would fail once the IP address was migrated to another node. The error message only included the port, but not the IP address, and so it took a while to correlate the events. Why LNET chose to pick this particular source address is a separate question we need to investigate, but for starters, improving the error message to include all relevant content seems to be a good idea to me.



 Comments   
Comment by Peter Jones [ 02/Jul/18 ]

Sonia

Could you please investigate?

Thanks

Peter

Comment by Peter Jones [ 02/Jul/18 ]

Daniel

Could you please push your proposed patch into Gerrit so it can be reviewed/landed?

Peter

Comment by Daniel Kobras (Inactive) [ 02/Jul/18 ]

It seems Gerrit has moved to a different IP address, and I cannot access it due to local firewall restrictions. Attaching the patch here while I try to sort things out.

Comment by Karsten Weiss [ 02/Jul/18 ]

FWIW: This is the original lnet error message:

LNetError: 2099:0:(lib-socket.c:455:lnet_sock_create()) Error trying to bind to port 1023: -99
LNetError: 2099:0:(lib-socket.c:455:lnet_sock_create()) Skipped 8 previous similar messages
LNetError: 11e-e: Unexpected error -99 connecting to 192.168.10.6@tcp at host 192.168.10.6 on port 988
Comment by Sonia Sharma (Inactive) [ 02/Jul/18 ]

Hi Daniel

In the lnet_sock_connect function, I see "INADDR_ANY" is assigned if the local_ip == 0. With "INDDR_ANY" in the bind call, the socket will be bound to all the local interfaces.

439         if (local_ip != 0 || local_port != 0) {
440                 memset(&locaddr, 0, sizeof(locaddr));
441                 locaddr.sin_family = AF_INET;
442                 locaddr.sin_port = htons(local_port);
443                 locaddr.sin_addr.s_addr = (local_ip == 0) ?
444                                           INADDR_ANY : htonl(local_ip);

Was this virtual address assigned to one of the interface on the node? It would help to understand if you know what particular action/command execution is resulting in this error.

Thanks
Sonia

Comment by Daniel Kobras (Inactive) [ 03/Jul/18 ]

Hi Sonia!

For the scope of this LU, my line of argument just goes:

  • lnet_sock_connect() can fail due to either of two arguments (local_ip and local_port);
  • the resulting error message just includes one of the arguments (local_port);
  • there exists at least one real-world case where it fails due to the other argument (local_ip);
  • hence the error message should be improved to include both arguments.

The case given was only meant as an example to show that having the full information in the error output occasionally really matters.

 

Why LNET chose to open a connection with a fixed source IP address rather than just using INADDR_ANY isn't clear to me, yet. One should be able to reproduce it with

  • assign IP alias to interface;
  • start LNET/Lustre;
  • remove IP alias from interface;

but that's a topic probably more suited to a separate LU (once we've collected more information about it).

Comment by Gerrit Updater [ 03/Jul/18 ]

Daniel Kobras (d.kobras@science-computing.de) uploaded a new patch: https://review.whamcloud.com/32758
Subject: LU-11112 lnet: improve error msg in lnet_sock_create()
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: a0b382caf10e590f112030cc528e1c5fdd470390

Generated at Sat Feb 10 02:41:03 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.