[LU-15618] ksock_conn ref leak on shutdown Created: 04/Mar/22 Updated: 28/Nov/22 Resolved: 11/Jun/22 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.16.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Chris Horn | Assignee: | Chris Horn |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||
| Description |
|
This is a bug with: commit 47b7b319783f27023b0cefe54a2a2eea678284f2
Author: Doug Oucharek <doug.s.oucharek@intel.com>
Date: Wed Mar 2 12:08:00 2016 +0800
LU-8106 lnet: Do not drop message when shutting down LNet
That changes makes it so that if we fail to lookup the peer NI w/ESHUTDOWN, then lnet_parse() returns 0 instead of dropping the message. This can lead to a situation where ksocknal_process_receive() isn't aware that anything went wrong with lnet_parse() and so an extra ref is left on the ksock_conn. ksocknal_conn_addref(conn); /* ++ref while parsing */
rc = lnet_parse(conn->ksnc_peer->ksnp_ni,
&hdr,
&conn->ksnc_peer->ksnp_id.nid,
conn, 0);
if (rc < 0) {
/* I just received garbage: give up on this conn */
ksocknal_new_packet(conn, 0);
ksocknal_close_conn_and_siblings(conn, rc);
CDEBUG(D_NET, "pre %p %u\n", conn,
refcount_read(&conn->ksnc_conn_refcount));
ksocknal_conn_decref(conn);
return (-EPROTO);
}
<<<< REF LEAKED >>>>
This prevents ksocklnd from shutting down, as it gets stuck waiting forever for the associated peer_ni to be destroyed. A symptom of this could be message like this printed to the console log: 00000800:00000200:0.0:1646425022.404025:0:6589:0:(socklnd.c:2369:ksocknal_shutdown()) waiting for 2 peers to disconnect 00000800:00000200:0.0:1646425023.404594:0:6589:0:(socklnd.c:2369:ksocknal_shutdown()) waiting for 2 peers to disconnect 00000800:00000200:0.0:1646425024.405196:0:6589:0:(socklnd.c:2369:ksocknal_shutdown()) waiting for 2 peers to disconnect 00000800:00000200:0.0:1646425025.405475:0:6589:0:(socklnd.c:2369:ksocknal_shutdown()) waiting for 2 peers to disconnect 00000800:00000200:0.0:1646425026.406188:0:6589:0:(socklnd.c:2369:ksocknal_shutdown()) waiting for 2 peers to disconnect 00000800:00000200:0.0:1646425027.407158:0:6589:0:(socklnd.c:2369:ksocknal_shutdown()) waiting for 2 peers to disconnect 00000800:00000200:0.0:1646425028.407259:0:6589:0:(socklnd.c:2369:ksocknal_shutdown()) waiting for 2 peers to disconnect 00000800:00000200:0.0:1646425029.407153:0:6589:0:(socklnd.c:2369:ksocknal_shutdown()) waiting for 2 peers to disconnect 00000800:00000200:0.0:1646425030.407047:0:6589:0:(socklnd.c:2369:ksocknal_shutdown()) waiting for 2 peers to disconnect 00000800:00000200:0.0:1646425031.407210:0:6589:0:(socklnd.c:2369:ksocknal_shutdown()) waiting for 2 peers to disconnect It seems more appropriate to have lnet_parse() return <0 in this case to signal to LNDs that the parse failed. They can handle it in the same way as EPROTO errors, etc. |
| Comments |
| Comment by Gerrit Updater [ 04/Mar/22 ] |
|
"Chris Horn <chris.horn@hpe.com>" uploaded a new patch: https://review.whamcloud.com/46711 |
| Comment by Cyril Bordage [ 07/Mar/22 ] |
|
Is this only for ksock? |
| Comment by Chris Horn [ 07/Mar/22 ] |
|
The change impacts all LNDs, but I haven't observed or figured out if this bug causes a problem with other LNDs. |
| Comment by Chris Horn [ 07/Mar/22 ] |
|
If I run the reproducer with o2iblnd I see similar symptom, but I haven't root caused it to the same bug. |
| Comment by Chris Horn [ 11/Mar/22 ] |
|
If I run the reproducer on o2iblnd with the fix applied then I can't reproduce the symptoms. So it seems like the fix applies for ko2iblnd as well. |
| Comment by Andreas Dilger [ 06/Jun/22 ] |
|
Chris, any idea why this started failing recently, even though the patch is 6y old? |
| Comment by Gerrit Updater [ 11/Jun/22 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/46711/ |
| Comment by Peter Jones [ 11/Jun/22 ] |
|
Landed for 2.16 |
| Comment by Gerrit Updater [ 28/Nov/22 ] |
|
"Olaf Faaland <faaland1@llnl.gov>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49259 |