[LU-17258] socklnd connection type not established upon connection race Created: 02/Nov/23  Updated: 07/Feb/24

Status: Reopened
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.16.0

Type: Bug Priority: Minor
Reporter: Nikitas Angelinas Assignee: Serguei Smirnov
Resolution: Unresolved Votes: 0
Labels: None

Issue Links:
Related
is related to LU-17513 how does 'conns_per_peer' apply with ... Open
is related to LU-17515 dynamically shrink 'conns_per_peer' a... Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

The following assertion was triggered on one of our clusters:

socklnd_cb.c:1950:ksocknal_connect()) ASSERTION( (wanted & ((((1UL))) << (3))) != 0 ) failed:
socklnd_cb.c:1950:ksocknal_connect()) LBUG

From crash dumps, we can see that the conn_cb has been set with:

struct ksock_conn_cb {
...
ksnr_scheduled = 1,
ksnr_connecting = 1,
ksnr_connected = 10,
ksnr_deleted = 0,
ksnr_ctrl_conn_count = 1,
ksnr_blki_conn_count = 1,
ksnr_blko_conn_count = 0,
ksnr_conn_count = 2,
ksnr_max_conns = 8,
ksnr_busy_retry_count = 3
}

The debug log shows that a connection race between the two peers is being hit three times, which causes the ksnr_busy_retry_count = 3 in the conn_cb.

hornc has suggested a fix for this, which we will be submitting in a bit.



 Comments   
Comment by Gerrit Updater [ 02/Nov/23 ]

"Nikitas Angelinas <nikitas.angelinas@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52957
Subject: LU-17258 socklnd: ensure connection type established upon race
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 188a3a633dd2df8084722f95772831f46064fc12

Comment by Gerrit Updater [ 08/Nov/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/52957/
Subject: LU-17258 socklnd: ensure connection type established upon race
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 5afe3b0538c533c3cca370bc9c0901abccca299a

Comment by Peter Jones [ 09/Nov/23 ]

Landed for 2.16

Comment by Gerrit Updater [ 07/Feb/24 ]

"Serguei Smirnov <ssmirnov@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53955
Subject: LU-17258 socklnd: stop connecting on too many retries
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: f3d666c9f05fa174365fdc3b032b84f50781f36c

Generated at Sat Feb 10 03:33:57 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.