[LU-15618] ksock_conn ref leak on shutdown Created: 04/Mar/22  Updated: 28/Nov/22  Resolved: 11/Jun/22

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.16.0

Type: Bug Priority: Minor
Reporter: Chris Horn Assignee: Chris Horn
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Duplicate
is duplicated by LU-12148 conf-sanity test_64: timed out Closed
is duplicated by LU-13218 conf-sanity test 98 hangs in socknal_... Closed
Related
is related to LU-8106 kiblnd_pool_alloc_node() crashed beca... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This is a bug with:

commit 47b7b319783f27023b0cefe54a2a2eea678284f2
Author: Doug Oucharek <doug.s.oucharek@intel.com>
Date:   Wed Mar 2 12:08:00 2016 +0800

    LU-8106 lnet: Do not drop message when shutting down LNet

That changes makes it so that if we fail to lookup the peer NI w/ESHUTDOWN, then lnet_parse() returns 0 instead of dropping the message. This can lead to a situation where ksocknal_process_receive() isn't aware that anything went wrong with lnet_parse() and so an extra ref is left on the ksock_conn.

                ksocknal_conn_addref(conn);     /* ++ref while parsing */


                rc = lnet_parse(conn->ksnc_peer->ksnp_ni,
                                &hdr,
                                &conn->ksnc_peer->ksnp_id.nid,
                                conn, 0);
                if (rc < 0) {
                        /* I just received garbage: give up on this conn */
                        ksocknal_new_packet(conn, 0);
                        ksocknal_close_conn_and_siblings(conn, rc);
                        CDEBUG(D_NET, "pre %p %u\n", conn,
                               refcount_read(&conn->ksnc_conn_refcount));
                        ksocknal_conn_decref(conn);
                        return (-EPROTO);
                }
                <<<< REF LEAKED >>>>

This prevents ksocklnd from shutting down, as it gets stuck waiting forever for the associated peer_ni to be destroyed. A symptom of this could be message like this printed to the console log:

00000800:00000200:0.0:1646425022.404025:0:6589:0:(socklnd.c:2369:ksocknal_shutdown()) waiting for 2 peers to disconnect
00000800:00000200:0.0:1646425023.404594:0:6589:0:(socklnd.c:2369:ksocknal_shutdown()) waiting for 2 peers to disconnect
00000800:00000200:0.0:1646425024.405196:0:6589:0:(socklnd.c:2369:ksocknal_shutdown()) waiting for 2 peers to disconnect
00000800:00000200:0.0:1646425025.405475:0:6589:0:(socklnd.c:2369:ksocknal_shutdown()) waiting for 2 peers to disconnect
00000800:00000200:0.0:1646425026.406188:0:6589:0:(socklnd.c:2369:ksocknal_shutdown()) waiting for 2 peers to disconnect
00000800:00000200:0.0:1646425027.407158:0:6589:0:(socklnd.c:2369:ksocknal_shutdown()) waiting for 2 peers to disconnect
00000800:00000200:0.0:1646425028.407259:0:6589:0:(socklnd.c:2369:ksocknal_shutdown()) waiting for 2 peers to disconnect
00000800:00000200:0.0:1646425029.407153:0:6589:0:(socklnd.c:2369:ksocknal_shutdown()) waiting for 2 peers to disconnect
00000800:00000200:0.0:1646425030.407047:0:6589:0:(socklnd.c:2369:ksocknal_shutdown()) waiting for 2 peers to disconnect
00000800:00000200:0.0:1646425031.407210:0:6589:0:(socklnd.c:2369:ksocknal_shutdown()) waiting for 2 peers to disconnect

It seems more appropriate to have lnet_parse() return <0 in this case to signal to LNDs that the parse failed. They can handle it in the same way as EPROTO errors, etc.



 Comments   
Comment by Gerrit Updater [ 04/Mar/22 ]

"Chris Horn <chris.horn@hpe.com>" uploaded a new patch: https://review.whamcloud.com/46711
Subject: LU-15618 lnet: Return ESHUTDOWN in lnet_parse()
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 88092ffe31af313abefd70df88d419a8f0651954

Comment by Cyril Bordage [ 07/Mar/22 ]

Is this only for ksock?

Comment by Chris Horn [ 07/Mar/22 ]

The change impacts all LNDs, but I haven't observed or figured out if this bug causes a problem with other LNDs.

Comment by Chris Horn [ 07/Mar/22 ]

If I run the reproducer with o2iblnd I see similar symptom, but I haven't root caused it to the same bug.

Comment by Chris Horn [ 11/Mar/22 ]

If I run the reproducer on o2iblnd with the fix applied then I can't reproduce the symptoms. So it seems like the fix applies for ko2iblnd as well.

Comment by Andreas Dilger [ 06/Jun/22 ]

Chris, any idea why this started failing recently, even though the patch is 6y old?

Comment by Gerrit Updater [ 11/Jun/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/46711/
Subject: LU-15618 lnet: Return ESHUTDOWN in lnet_parse()
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 4fbd0705a3d25bbc85e953f81e697e5006b215ce

Comment by Peter Jones [ 11/Jun/22 ]

Landed for 2.16

Comment by Gerrit Updater [ 28/Nov/22 ]

"Olaf Faaland <faaland1@llnl.gov>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49259
Subject: LU-15618 lnet: Return ESHUTDOWN in lnet_parse()
Project: fs/lustre-release
Branch: b2_15
Current Patch Set: 1
Commit: 28788d5e91eae26dc562f085f8577fd8c2813718

Generated at Sat Feb 10 03:19:52 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.