[LU-2491] Spurious LNet Error: Can't accept connection on "bad dst nid" Created: 13/Dec/12  Updated: 08/Jan/13  Resolved: 08/Jan/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: None

Type: Bug Priority: Blocker
Reporter: Prakash Surya (Inactive) Assignee: Isaac Huang (Inactive)
Resolution: Won't Fix Votes: 0
Labels: LB, sequoia, shh

Severity: 3
Rank (Obsolete): 5845

 Description   

I see the following LNet Error in the logs:

2012-12-13 13:09:08 LNetError: 21195:0:(o2iblnd_cb.c:2261:kiblnd_passive_connect()) Can't accept 172.20.13.43@o2ib500 on NA (ib0:0:172.20.5.2): bad dst nid 172.20.5.2@o2ib500
2012-12-13 13:09:08 LNetError: 21180:0:(o2iblnd_cb.c:2261:kiblnd_passive_connect()) Can't accept 172.20.14.153@o2ib500 on NA (ib0:0:172.20.5.2): bad dst nid 172.20.5.2@o2ib500
2012-12-13 13:09:08 LNet: Added LNI 172.20.5.2@o2ib500 [8/1024/0/180]
2012-12-13 13:09:09 LNET configured

Without looking at the code, it seems like LNet is denying the incoming connection because it is not yet configured. If that is the case, I don't think that warrants a console message. It should just silently refuse the connection until it is fully configured.



 Comments   
Comment by Isaac Huang (Inactive) [ 13/Dec/12 ]

Yes there's a small window after the o2iblnd has created a listening CMID but before the lnd_startup() call completes. Console errors shouldn't be used in such cases.

Comment by Isaac Huang (Inactive) [ 27/Dec/12 ]

I looked at the code, and it appeared not easy to do it properly. The CERROR() is shared by several similar error cases, and if simply changed to a CDEBUG() some important error cases would be muted too which actually deserves immediate attention. It's hard to single out the exact case here, i.e. an incoming connection comes while a matching interface is still being initialized, taking into consideration the upcoming lnet dynamic config project.

In short, a correct fix would involve quite some complexity, making the code harder to maintain in the long run. So I have to ask how many of these have been seen at LLNL to make it a concern for you guys?

I'd tend to think it's not a problem because:

  • The window during which it could happen shouldn't be longer than a couple of milliseconds.
  • When it does happen on a node, LNet is still initializing itself, so Lustre isn't running yet. The console messages can't cause other Lustre debug messages to go unnoticed, because Lustre isn't running yet.

Please let me know if I've missed something that makes it more problematic than I thought. Otherwise I'd prefer to leave it there and keep the code simple.

Comment by Prakash Surya (Inactive) [ 02/Jan/13 ]

Well, I'd like to see it fixed, but if it would cause a lot of added complexity to code which will be reworked with the upcoming LNET changes, I'm OK leaving it as is. It's more of an annoyance than what I'd call a problem.

Comment by Isaac Huang (Inactive) [ 02/Jan/13 ]

It's difficult to filter out exactly the spurious case only. Do you guys enable neterror console logging by default? If not, it'd be a good simple trade-off to just change the CERROR() into a CNETERR().

Comment by Prakash Surya (Inactive) [ 08/Jan/13 ]

IIRC, we do enable neterror by default. If it isn't worth the effort to filter the spurious case, lets just close this as "wont fix". This definitely shouldn't be a blocker, IMO.

Comment by Isaac Huang (Inactive) [ 08/Jan/13 ]

It's hard to filter out the exact spurious case without adding lots of complexity elsewhere.

Generated at Sat Feb 10 01:25:39 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.