Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-15618

ksock_conn ref leak on shutdown

    XMLWordPrintable

Details

    • 3
    • 9223372036854775807

    Description

      This is a bug with:

      commit 47b7b319783f27023b0cefe54a2a2eea678284f2
      Author: Doug Oucharek <doug.s.oucharek@intel.com>
      Date:   Wed Mar 2 12:08:00 2016 +0800
      
          LU-8106 lnet: Do not drop message when shutting down LNet
      

      That changes makes it so that if we fail to lookup the peer NI w/ESHUTDOWN, then lnet_parse() returns 0 instead of dropping the message. This can lead to a situation where ksocknal_process_receive() isn't aware that anything went wrong with lnet_parse() and so an extra ref is left on the ksock_conn.

                      ksocknal_conn_addref(conn);     /* ++ref while parsing */
      
      
                      rc = lnet_parse(conn->ksnc_peer->ksnp_ni,
                                      &hdr,
                                      &conn->ksnc_peer->ksnp_id.nid,
                                      conn, 0);
                      if (rc < 0) {
                              /* I just received garbage: give up on this conn */
                              ksocknal_new_packet(conn, 0);
                              ksocknal_close_conn_and_siblings(conn, rc);
                              CDEBUG(D_NET, "pre %p %u\n", conn,
                                     refcount_read(&conn->ksnc_conn_refcount));
                              ksocknal_conn_decref(conn);
                              return (-EPROTO);
                      }
                      <<<< REF LEAKED >>>>
      

      This prevents ksocklnd from shutting down, as it gets stuck waiting forever for the associated peer_ni to be destroyed. A symptom of this could be message like this printed to the console log:

      00000800:00000200:0.0:1646425022.404025:0:6589:0:(socklnd.c:2369:ksocknal_shutdown()) waiting for 2 peers to disconnect
      00000800:00000200:0.0:1646425023.404594:0:6589:0:(socklnd.c:2369:ksocknal_shutdown()) waiting for 2 peers to disconnect
      00000800:00000200:0.0:1646425024.405196:0:6589:0:(socklnd.c:2369:ksocknal_shutdown()) waiting for 2 peers to disconnect
      00000800:00000200:0.0:1646425025.405475:0:6589:0:(socklnd.c:2369:ksocknal_shutdown()) waiting for 2 peers to disconnect
      00000800:00000200:0.0:1646425026.406188:0:6589:0:(socklnd.c:2369:ksocknal_shutdown()) waiting for 2 peers to disconnect
      00000800:00000200:0.0:1646425027.407158:0:6589:0:(socklnd.c:2369:ksocknal_shutdown()) waiting for 2 peers to disconnect
      00000800:00000200:0.0:1646425028.407259:0:6589:0:(socklnd.c:2369:ksocknal_shutdown()) waiting for 2 peers to disconnect
      00000800:00000200:0.0:1646425029.407153:0:6589:0:(socklnd.c:2369:ksocknal_shutdown()) waiting for 2 peers to disconnect
      00000800:00000200:0.0:1646425030.407047:0:6589:0:(socklnd.c:2369:ksocknal_shutdown()) waiting for 2 peers to disconnect
      00000800:00000200:0.0:1646425031.407210:0:6589:0:(socklnd.c:2369:ksocknal_shutdown()) waiting for 2 peers to disconnect
      

      It seems more appropriate to have lnet_parse() return <0 in this case to signal to LNDs that the parse failed. They can handle it in the same way as EPROTO errors, etc.

      Attachments

        Issue Links

          Activity

            People

              hornc Chris Horn
              hornc Chris Horn
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: