[LU-16393] o2iblnd: connections rejected before lnd startup is complete Created: 13/Dec/22  Updated: 08/Dec/23  Resolved: 19/Aug/23

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.16.0

Type: Bug Priority: Minor
Reporter: Serguei Smirnov Assignee: Serguei Smirnov
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-17071 o2iblnd: Oops caused by IBLND_REJECT_... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Before lnd startup is complete, there's a window of time when o2iblnd can reject connection requests similar to the following:

 Nov 16 08:24:18 ai400x2vm-008 kernel: LNetError: 7758:0:(o2iblnd_cb.c:2480:kiblnd_passive_connect()) Can't accept conn from 172.16.16.12@o2ib on NA (ib0:0:172.16.0.192): bad dst nid 172.16.0.192@o2ib
Nov 16 08:24:19 ai400x2vm-008 kernel: LNetError: 7758:0:(o2iblnd_cb.c:2480:kiblnd_passive_connect()) Can't accept conn from 172.16.16.187@o2ib on NA (ib0:0:172.16.0.192): bad dst nid 172.16.0.192@o2ib
Nov 16 08:24:19 ai400x2vm-008 kernel: LNetError: 7758:0:(o2iblnd_cb.c:2480:kiblnd_passive_connect()) Skipped 54 previous similar messages
Nov 16 08:24:19 ai400x2vm-008 kernel: LNet: Added LNI 172.16.0.192@o2ib [32/5120/0/180]
Nov 16 08:24:19 ai400x2vm-008 kernel: LNet: Using FastReg for registration
Nov 16 08:24:20 ai400x2vm-008 kernel: LNetError: 7758:0:(o2iblnd_cb.c:2480:kiblnd_passive_connect()) Can't accept conn from 172.16.0.58@o2ib on NA (ib0:1:172.16.0.192): bad dst nid 172.16.0.192@o2ib
Nov 16 08:24:20 ai400x2vm-008 kernel: LNetError: 7758:0:(o2iblnd_cb.c:2480:kiblnd_passive_connect()) Skipped 180 previous similar messages
Nov 16 08:24:20 ai400x2vm-008 kernel: LNet: Added LNI 172.16.16.192@o2ib [32/5120/0/180]

Look into getting rid of this race condition.



 Comments   
Comment by Nathan Dauchy [ 17/Jan/23 ]

Is it correct that a client with a rejected connection (during this race window on a server) would report an error message like the following?

LNetError: 353407:0:(o2iblnd_cb.c:2951:kiblnd_rejected()) 192.168.23.45@o2ib rejected: o2iblnd fatal error

 

Comment by Serguei Smirnov [ 18/Jan/23 ]

Hi Nathan,

Yes, my understanding is that this is correct.

Comment by Gerrit Updater [ 13/Jul/23 ]

"Serguei Smirnov <ssmirnov@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51651
Subject: LU-16393 o2iblnd: add IBLND_REJECT_EARLY reject reason
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 93fe169eef88e8ab31acd01b8c5b3084f1de93ad

Comment by Gerrit Updater [ 19/Aug/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/51651/
Subject: LU-16393 o2iblnd: add IBLND_REJECT_EARLY reject reason
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 673ff86a84ad5d11cde24aa7411c45385ad1c633

Comment by Peter Jones [ 19/Aug/23 ]

Landed for 2.16

Generated at Sat Feb 10 03:26:38 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.