[LU-15860] ksocknal_add_peer() race results in extra ksock_conn_cb Created: 16/May/22  Updated: 11/Apr/23  Resolved: 27/Jun/22

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.16.0, Lustre 2.15.3

Type: Bug Priority: Minor
Reporter: Chris Horn Assignee: Chris Horn
Resolution: Fixed Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Seems there is a race where two ksock_conn_cb can be created:

Bad conn_cb
00000800:00000010:19.0:1652717291.344361:0:4361:0:(socklnd.c:170:ksocknal_create_peer()) alloc '(peer_ni)': 240 at ffff9215aa86b700 (tot 41715364).
00000800:00000010:19.0:1652717291.344362:0:4361:0:(socklnd.c:119:ksocknal_create_conn_cb()) alloc '(conn_cb)': 200 at ffff9215aa86b600 (tot 41715564).
00000800:00000200:19.1:1652717291.344363:0:4361:0:(socklnd.c:556:ksocknal_add_conn_cb_locked()) pre ffff9215aa86b700 1
00000800:00000200:19.1:1652717291.344364:0:4361:0:(socklnd.c:556:ksocknal_add_conn_cb_locked()) post ffff9215aa86b700 2
00000800:00000200:19.0:1652717291.344365:0:4361:0:(socklnd.c:239:ksocknal_find_peer_locked()) got peer_ni [ffff9215aa86b700] -> 12345-172.18.2.8@tcp (2)
00000800:00000200:19.1:1652717291.344366:0:4361:0:(socklnd.c:239:ksocknal_find_peer_locked()) got peer_ni [ffff9215aa86b700] -> 12345-172.18.2.8@tcp (2)
00000800:00000200:19.1:1652717291.344367:0:4361:0:(socklnd_cb.c:645:ksocknal_launch_connection_locked()) pre ffff9215aa86b600 1
00000800:00000200:19.1:1652717291.344368:0:4361:0:(socklnd_cb.c:645:ksocknal_launch_connection_locked()) post ffff9215aa86b600 2

Good conn_cb
00000800:00000010:16.0:1652717291.344365:0:4360:0:(socklnd.c:119:ksocknal_create_conn_cb()) alloc '(conn_cb)': 200 at ffff9215aca22600 (tot 41715508).
...
00000800:00000200:16.1:1652717291.344371:0:4360:0:(socklnd.c:239:ksocknal_find_peer_locked()) got peer_ni [ffff9215aa86b700] -> 12345-172.18.2.8@tcp (2)
00000800:00000200:16.1:1652717291.344375:0:4360:0:(socklnd.c:556:ksocknal_add_conn_cb_locked()) pre ffff9215aa86b700 2
00000800:00000200:16.1:1652717291.344375:0:4360:0:(socklnd.c:556:ksocknal_add_conn_cb_locked()) post ffff9215aa86b700 3
00000800:00000200:16.0:1652717291.344377:0:4360:0:(socklnd.c:239:ksocknal_find_peer_locked()) got peer_ni [ffff9215aa86b700] -> 12345-172.18.2.8@tcp (3)
00000800:00000200:16.1:1652717291.344378:0:4360:0:(socklnd.c:239:ksocknal_find_peer_locked()) got peer_ni [ffff9215aa86b700] -> 12345-172.18.2.8@tcp (3)
00000800:00000200:16.1:1652717291.344378:0:4360:0:(socklnd_cb.c:645:ksocknal_launch_connection_locked()) pre ffff9215aca22600 1
00000800:00000200:16.1:1652717291.344379:0:4360:0:(socklnd_cb.c:645:ksocknal_launch_connection_locked()) post ffff9215aca22600 2

The second one overwrites the first in ksocknal_add_peer()->ksocknal_add_conn_cb_locked(). The first one gets stuck and is never freed on shutdown.



 Comments   
Comment by Gerrit Updater [ 16/May/22 ]

"Chris Horn <chris.horn@hpe.com>" uploaded a new patch: https://review.whamcloud.com/47361
Subject: LU-15860 socklnd: Duplicate ksock_conn_cb
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 1feb1708ac65b2aa89d20a06988734d2ff807ec7

Comment by Gerrit Updater [ 27/Jun/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/47361/
Subject: LU-15860 socklnd: Duplicate ksock_conn_cb
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 0c91d49a44e1214b5c65d4a557f6969b3d217881

Comment by Peter Jones [ 27/Jun/22 ]

Landed for 2.16

Comment by Gerrit Updater [ 18/Oct/22 ]

"James Simmons <jsimmons@infradead.org>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/48911
Subject: LU-15860 socklnd: Duplicate ksock_conn_cb
Project: fs/lustre-release
Branch: b2_15
Current Patch Set: 1
Commit: 04d225733104ad973fc4da82e9e4c8eed4677d8a

Comment by Gerrit Updater [ 11/Apr/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/48911/
Subject: LU-15860 socklnd: Duplicate ksock_conn_cb
Project: fs/lustre-release
Branch: b2_15
Current Patch Set:
Commit: ea34ee7b40271ec23b6d9ed916a43971dd73fad5

Generated at Sat Feb 10 03:21:55 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.