Details
-
Bug
-
Resolution: Fixed
-
Minor
-
None
-
None
-
3
-
9223372036854775807
Description
This is a bug with:
commit 47b7b319783f27023b0cefe54a2a2eea678284f2 Author: Doug Oucharek <doug.s.oucharek@intel.com> Date: Wed Mar 2 12:08:00 2016 +0800 LU-8106 lnet: Do not drop message when shutting down LNet
That changes makes it so that if we fail to lookup the peer NI w/ESHUTDOWN, then lnet_parse() returns 0 instead of dropping the message. This can lead to a situation where ksocknal_process_receive() isn't aware that anything went wrong with lnet_parse() and so an extra ref is left on the ksock_conn.
ksocknal_conn_addref(conn); /* ++ref while parsing */ rc = lnet_parse(conn->ksnc_peer->ksnp_ni, &hdr, &conn->ksnc_peer->ksnp_id.nid, conn, 0); if (rc < 0) { /* I just received garbage: give up on this conn */ ksocknal_new_packet(conn, 0); ksocknal_close_conn_and_siblings(conn, rc); CDEBUG(D_NET, "pre %p %u\n", conn, refcount_read(&conn->ksnc_conn_refcount)); ksocknal_conn_decref(conn); return (-EPROTO); } <<<< REF LEAKED >>>>
This prevents ksocklnd from shutting down, as it gets stuck waiting forever for the associated peer_ni to be destroyed. A symptom of this could be message like this printed to the console log:
00000800:00000200:0.0:1646425022.404025:0:6589:0:(socklnd.c:2369:ksocknal_shutdown()) waiting for 2 peers to disconnect 00000800:00000200:0.0:1646425023.404594:0:6589:0:(socklnd.c:2369:ksocknal_shutdown()) waiting for 2 peers to disconnect 00000800:00000200:0.0:1646425024.405196:0:6589:0:(socklnd.c:2369:ksocknal_shutdown()) waiting for 2 peers to disconnect 00000800:00000200:0.0:1646425025.405475:0:6589:0:(socklnd.c:2369:ksocknal_shutdown()) waiting for 2 peers to disconnect 00000800:00000200:0.0:1646425026.406188:0:6589:0:(socklnd.c:2369:ksocknal_shutdown()) waiting for 2 peers to disconnect 00000800:00000200:0.0:1646425027.407158:0:6589:0:(socklnd.c:2369:ksocknal_shutdown()) waiting for 2 peers to disconnect 00000800:00000200:0.0:1646425028.407259:0:6589:0:(socklnd.c:2369:ksocknal_shutdown()) waiting for 2 peers to disconnect 00000800:00000200:0.0:1646425029.407153:0:6589:0:(socklnd.c:2369:ksocknal_shutdown()) waiting for 2 peers to disconnect 00000800:00000200:0.0:1646425030.407047:0:6589:0:(socklnd.c:2369:ksocknal_shutdown()) waiting for 2 peers to disconnect 00000800:00000200:0.0:1646425031.407210:0:6589:0:(socklnd.c:2369:ksocknal_shutdown()) waiting for 2 peers to disconnect
It seems more appropriate to have lnet_parse() return <0 in this case to signal to LNDs that the parse failed. They can handle it in the same way as EPROTO errors, etc.