[LU-2491] Spurious LNet Error: Can't accept connection on "bad dst nid" Created: 13/Dec/12 Updated: 08/Jan/13 Resolved: 08/Jan/13 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Prakash Surya (Inactive) | Assignee: | Isaac Huang (Inactive) |
| Resolution: | Won't Fix | Votes: | 0 |
| Labels: | LB, sequoia, shh | ||
| Severity: | 3 |
| Rank (Obsolete): | 5845 |
| Description |
|
I see the following LNet Error in the logs: 2012-12-13 13:09:08 LNetError: 21195:0:(o2iblnd_cb.c:2261:kiblnd_passive_connect()) Can't accept 172.20.13.43@o2ib500 on NA (ib0:0:172.20.5.2): bad dst nid 172.20.5.2@o2ib500 2012-12-13 13:09:08 LNetError: 21180:0:(o2iblnd_cb.c:2261:kiblnd_passive_connect()) Can't accept 172.20.14.153@o2ib500 on NA (ib0:0:172.20.5.2): bad dst nid 172.20.5.2@o2ib500 2012-12-13 13:09:08 LNet: Added LNI 172.20.5.2@o2ib500 [8/1024/0/180] 2012-12-13 13:09:09 LNET configured Without looking at the code, it seems like LNet is denying the incoming connection because it is not yet configured. If that is the case, I don't think that warrants a console message. It should just silently refuse the connection until it is fully configured. |
| Comments |
| Comment by Isaac Huang (Inactive) [ 13/Dec/12 ] |
|
Yes there's a small window after the o2iblnd has created a listening CMID but before the lnd_startup() call completes. Console errors shouldn't be used in such cases. |
| Comment by Isaac Huang (Inactive) [ 27/Dec/12 ] |
|
I looked at the code, and it appeared not easy to do it properly. The CERROR() is shared by several similar error cases, and if simply changed to a CDEBUG() some important error cases would be muted too which actually deserves immediate attention. It's hard to single out the exact case here, i.e. an incoming connection comes while a matching interface is still being initialized, taking into consideration the upcoming lnet dynamic config project. In short, a correct fix would involve quite some complexity, making the code harder to maintain in the long run. So I have to ask how many of these have been seen at LLNL to make it a concern for you guys? I'd tend to think it's not a problem because:
Please let me know if I've missed something that makes it more problematic than I thought. Otherwise I'd prefer to leave it there and keep the code simple. |
| Comment by Prakash Surya (Inactive) [ 02/Jan/13 ] |
|
Well, I'd like to see it fixed, but if it would cause a lot of added complexity to code which will be reworked with the upcoming LNET changes, I'm OK leaving it as is. It's more of an annoyance than what I'd call a problem. |
| Comment by Isaac Huang (Inactive) [ 02/Jan/13 ] |
|
It's difficult to filter out exactly the spurious case only. Do you guys enable neterror console logging by default? If not, it'd be a good simple trade-off to just change the CERROR() into a CNETERR(). |
| Comment by Prakash Surya (Inactive) [ 08/Jan/13 ] |
|
IIRC, we do enable neterror by default. If it isn't worth the effort to filter the spurious case, lets just close this as "wont fix". This definitely shouldn't be a blocker, IMO. |
| Comment by Isaac Huang (Inactive) [ 08/Jan/13 ] |
|
It's hard to filter out the exact spurious case without adding lots of complexity elsewhere. |