[LU-14615] can't add tcp nid Created: 14/Apr/21  Updated: 16/Sep/22  Resolved: 16/Sep/22

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.5, Lustre 2.12.6
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Mahmoud Hanafi Assignee: Amir Shehata (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

On a few host when adding tcp nid using

lnetctl lnet configure
lnetctl net add --net tcp --if ib1

We get this error

[1137072.940179] LNet: 39800:0:(config.c:1641:lnet_inet_enumerate()) lnet: Ignoring interface eth2: it's down
[1137072.950118] LNet: 39800:0:(config.c:1641:lnet_inet_enumerate()) Skipped 2 previous similar messages
[1137072.959931] LNet: Added LNI 10.151.27.21@tcp [8/256/0/180]
[1137072.959988] LNetError: 39814:0:(lib-socket.c:315:lnet_sock_listen()) Can't create socket: port 988 already in use
[1137072.970687] LNetError: 122-1: Can't start acceptor on port 988: port already in use
[1137072.970724] LNetError: 39800:0:(api-ni.c:3123:lnet_add_net_common()) Failed to start up acceptor thread
[1137073.977512] LNet: Removed LNI 10.151.27.21@tcp

Nothing is using that port

# lsof -i tcp@localhost:988
# lsof -i udp@localhost:988


 Comments   
Comment by Amir Shehata (Inactive) [ 15/Apr/21 ]

Are you using ib1 for an o2iblnd network as well?

Comment by Peter Jones [ 15/Apr/21 ]

What version is this Mahmoud?

Comment by Mahmoud Hanafi [ 15/Apr/21 ]

This is 2.12.5 and yes we are using o2ib on ib1 also. It worked on most of the node.  

Comment by Amir Shehata (Inactive) [ 16/Apr/21 ]

Can you check the service port for the o2iblnd? What is that set to? I'm thinking if it's 988 instead of 987, then you could run into this problem.

Comment by Mahmoud Hanafi [ 29/Apr/21 ]

We don't have o2ib listed in /etc/services.

How do I check the port. This also occurs randomly when nodes are rebooted. 

 

Comment by Amir Shehata (Inactive) [ 03/May/21 ]

Does this show any useful results

 netstat -lnp

the error in the description is printed if kernel_bind() returns EADDRINUSE.

I looked at the kernel_bind() code and it seems that a port can be shared in the following circumstances

// from  include/net/inet_hashtables.h
 45  *»·····1) Sockets bound to different interfaces may share a local port.                               
 46  *»·····   Failing that, goto test 2.                                                                  
 47  *»·····2) If all sockets have sk->sk_reuse set, and none of them are in                               
 48  *»·····   TCP_LISTEN state, the port may be shared.                                                   
 49  *»·····   Failing that, goto test 3.                                                                  
 50  *»·····3) If all sockets are bound to a specific inet_sk(sk)->rcv_saddr local                         
 51  *»·····   address, and none of them are the same, the port may be                                     
 52  *»·····   shared.                                                                                     
 53  *»·····   Failing this, the port cannot be shared. 

When we create the port we do set the SO_REUSEADDR and we bind to any address on the system

Some debugging steps I would take

  1. Try a different listening port instead of 988
  2. Does it happen on all privileged ports? Is the behaviour different if we use a non-privileged port?
  3. What's the ib status of the interface at the time of the bind? Could it be possible that the IB HCA hasn't fully initialized yet? We've seen cases when the IB stack might not have been initialized by the time we bring up the LND. I know you're using the interface for ethernet, but it's worth looking at the status of the card.
  4. Dump the results of the netstat -lnp and ibstatus at the beginning when a node is rebooted.

Are you able to consistently reproduce this problem?

 

Comment by Mahmoud Hanafi [ 04/May/21 ]

netstat show nothing. None else using that port.

  1. How do I try a different port.
  2. We don't privileged port configured. (why is it using a privilege port?)
    I don't see the same options for tcp
     options ko2iblnd require_privileged_port=0 use_privileged_port=0
     

3 . We have seen this issue but that give a different error and we can bring up the interface later.

When the node is in this state if we remove the tcp option from lustre.conf and try to load the module we the same error.

Comment by Amir Shehata (Inactive) [ 04/May/21 ]

you can set:

options lnet accept_port=XXX 

By non-privileged I was thinking anything above 1024.

The accept_port would need to be set consistently in order for the nodes to connect.

Another thing to look out for is the actual network interface not coming up or maybe the IPoIB is not finished configuring (not loaded) before LNet tries to bind to the port? (Maybe something like: https://unix.stackexchange.com/questions/126009/cause-a-script-to-execute-after-networking-has-started)

In Syslog you should see

 IPv6: ADDRCONF(NETDEV_CHANGE): ib0: link becomes ready

Or something to that effect. Does that happen before or after LNet throws the error?

Maybe if we can grab the entire syslog when this problem happens, we can look at the context for other clues.

Comment by Mahmoud Hanafi [ 16/Sep/22 ]

Please close this case

Generated at Sat Feb 10 03:11:16 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.