[LU-15860] ksocknal_add_peer() race results in extra ksock_conn_cb Created: 16/May/22 Updated: 11/Apr/23 Resolved: 27/Jun/22 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.16.0, Lustre 2.15.3 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Chris Horn | Assignee: | Chris Horn |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
Seems there is a race where two ksock_conn_cb can be created: Bad conn_cb 00000800:00000010:19.0:1652717291.344361:0:4361:0:(socklnd.c:170:ksocknal_create_peer()) alloc '(peer_ni)': 240 at ffff9215aa86b700 (tot 41715364). 00000800:00000010:19.0:1652717291.344362:0:4361:0:(socklnd.c:119:ksocknal_create_conn_cb()) alloc '(conn_cb)': 200 at ffff9215aa86b600 (tot 41715564). 00000800:00000200:19.1:1652717291.344363:0:4361:0:(socklnd.c:556:ksocknal_add_conn_cb_locked()) pre ffff9215aa86b700 1 00000800:00000200:19.1:1652717291.344364:0:4361:0:(socklnd.c:556:ksocknal_add_conn_cb_locked()) post ffff9215aa86b700 2 00000800:00000200:19.0:1652717291.344365:0:4361:0:(socklnd.c:239:ksocknal_find_peer_locked()) got peer_ni [ffff9215aa86b700] -> 12345-172.18.2.8@tcp (2) 00000800:00000200:19.1:1652717291.344366:0:4361:0:(socklnd.c:239:ksocknal_find_peer_locked()) got peer_ni [ffff9215aa86b700] -> 12345-172.18.2.8@tcp (2) 00000800:00000200:19.1:1652717291.344367:0:4361:0:(socklnd_cb.c:645:ksocknal_launch_connection_locked()) pre ffff9215aa86b600 1 00000800:00000200:19.1:1652717291.344368:0:4361:0:(socklnd_cb.c:645:ksocknal_launch_connection_locked()) post ffff9215aa86b600 2 Good conn_cb 00000800:00000010:16.0:1652717291.344365:0:4360:0:(socklnd.c:119:ksocknal_create_conn_cb()) alloc '(conn_cb)': 200 at ffff9215aca22600 (tot 41715508). ... 00000800:00000200:16.1:1652717291.344371:0:4360:0:(socklnd.c:239:ksocknal_find_peer_locked()) got peer_ni [ffff9215aa86b700] -> 12345-172.18.2.8@tcp (2) 00000800:00000200:16.1:1652717291.344375:0:4360:0:(socklnd.c:556:ksocknal_add_conn_cb_locked()) pre ffff9215aa86b700 2 00000800:00000200:16.1:1652717291.344375:0:4360:0:(socklnd.c:556:ksocknal_add_conn_cb_locked()) post ffff9215aa86b700 3 00000800:00000200:16.0:1652717291.344377:0:4360:0:(socklnd.c:239:ksocknal_find_peer_locked()) got peer_ni [ffff9215aa86b700] -> 12345-172.18.2.8@tcp (3) 00000800:00000200:16.1:1652717291.344378:0:4360:0:(socklnd.c:239:ksocknal_find_peer_locked()) got peer_ni [ffff9215aa86b700] -> 12345-172.18.2.8@tcp (3) 00000800:00000200:16.1:1652717291.344378:0:4360:0:(socklnd_cb.c:645:ksocknal_launch_connection_locked()) pre ffff9215aca22600 1 00000800:00000200:16.1:1652717291.344379:0:4360:0:(socklnd_cb.c:645:ksocknal_launch_connection_locked()) post ffff9215aca22600 2 The second one overwrites the first in ksocknal_add_peer()->ksocknal_add_conn_cb_locked(). The first one gets stuck and is never freed on shutdown. |
| Comments |
| Comment by Gerrit Updater [ 16/May/22 ] |
|
"Chris Horn <chris.horn@hpe.com>" uploaded a new patch: https://review.whamcloud.com/47361 |
| Comment by Gerrit Updater [ 27/Jun/22 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/47361/ |
| Comment by Peter Jones [ 27/Jun/22 ] |
|
Landed for 2.16 |
| Comment by Gerrit Updater [ 18/Oct/22 ] |
|
"James Simmons <jsimmons@infradead.org>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/48911 |
| Comment by Gerrit Updater [ 11/Apr/23 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/48911/ |