[LU-16393] o2iblnd: connections rejected before lnd startup is complete Created: 13/Dec/22 Updated: 08/Dec/23 Resolved: 19/Aug/23 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.16.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Serguei Smirnov | Assignee: | Serguei Smirnov |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
Before lnd startup is complete, there's a window of time when o2iblnd can reject connection requests similar to the following:
Nov 16 08:24:18 ai400x2vm-008 kernel: LNetError: 7758:0:(o2iblnd_cb.c:2480:kiblnd_passive_connect()) Can't accept conn from 172.16.16.12@o2ib on NA (ib0:0:172.16.0.192): bad dst nid 172.16.0.192@o2ib
Nov 16 08:24:19 ai400x2vm-008 kernel: LNetError: 7758:0:(o2iblnd_cb.c:2480:kiblnd_passive_connect()) Can't accept conn from 172.16.16.187@o2ib on NA (ib0:0:172.16.0.192): bad dst nid 172.16.0.192@o2ib
Nov 16 08:24:19 ai400x2vm-008 kernel: LNetError: 7758:0:(o2iblnd_cb.c:2480:kiblnd_passive_connect()) Skipped 54 previous similar messages
Nov 16 08:24:19 ai400x2vm-008 kernel: LNet: Added LNI 172.16.0.192@o2ib [32/5120/0/180]
Nov 16 08:24:19 ai400x2vm-008 kernel: LNet: Using FastReg for registration
Nov 16 08:24:20 ai400x2vm-008 kernel: LNetError: 7758:0:(o2iblnd_cb.c:2480:kiblnd_passive_connect()) Can't accept conn from 172.16.0.58@o2ib on NA (ib0:1:172.16.0.192): bad dst nid 172.16.0.192@o2ib
Nov 16 08:24:20 ai400x2vm-008 kernel: LNetError: 7758:0:(o2iblnd_cb.c:2480:kiblnd_passive_connect()) Skipped 180 previous similar messages
Nov 16 08:24:20 ai400x2vm-008 kernel: LNet: Added LNI 172.16.16.192@o2ib [32/5120/0/180]
Look into getting rid of this race condition. |
| Comments |
| Comment by Nathan Dauchy [ 17/Jan/23 ] |
|
Is it correct that a client with a rejected connection (during this race window on a server) would report an error message like the following? LNetError: 353407:0:(o2iblnd_cb.c:2951:kiblnd_rejected()) 192.168.23.45@o2ib rejected: o2iblnd fatal error
|
| Comment by Serguei Smirnov [ 18/Jan/23 ] |
|
Hi Nathan, Yes, my understanding is that this is correct. |
| Comment by Gerrit Updater [ 13/Jul/23 ] |
|
"Serguei Smirnov <ssmirnov@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51651 |
| Comment by Gerrit Updater [ 19/Aug/23 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/51651/ |
| Comment by Peter Jones [ 19/Aug/23 ] |
|
Landed for 2.16 |