Details

    • 3
    • 9223372036854775807

    Description

      This is a bug with:

      commit 47b7b319783f27023b0cefe54a2a2eea678284f2
      Author: Doug Oucharek <doug.s.oucharek@intel.com>
      Date:   Wed Mar 2 12:08:00 2016 +0800
      
          LU-8106 lnet: Do not drop message when shutting down LNet
      

      That changes makes it so that if we fail to lookup the peer NI w/ESHUTDOWN, then lnet_parse() returns 0 instead of dropping the message. This can lead to a situation where ksocknal_process_receive() isn't aware that anything went wrong with lnet_parse() and so an extra ref is left on the ksock_conn.

                      ksocknal_conn_addref(conn);     /* ++ref while parsing */
      
      
                      rc = lnet_parse(conn->ksnc_peer->ksnp_ni,
                                      &hdr,
                                      &conn->ksnc_peer->ksnp_id.nid,
                                      conn, 0);
                      if (rc < 0) {
                              /* I just received garbage: give up on this conn */
                              ksocknal_new_packet(conn, 0);
                              ksocknal_close_conn_and_siblings(conn, rc);
                              CDEBUG(D_NET, "pre %p %u\n", conn,
                                     refcount_read(&conn->ksnc_conn_refcount));
                              ksocknal_conn_decref(conn);
                              return (-EPROTO);
                      }
                      <<<< REF LEAKED >>>>
      

      This prevents ksocklnd from shutting down, as it gets stuck waiting forever for the associated peer_ni to be destroyed. A symptom of this could be message like this printed to the console log:

      00000800:00000200:0.0:1646425022.404025:0:6589:0:(socklnd.c:2369:ksocknal_shutdown()) waiting for 2 peers to disconnect
      00000800:00000200:0.0:1646425023.404594:0:6589:0:(socklnd.c:2369:ksocknal_shutdown()) waiting for 2 peers to disconnect
      00000800:00000200:0.0:1646425024.405196:0:6589:0:(socklnd.c:2369:ksocknal_shutdown()) waiting for 2 peers to disconnect
      00000800:00000200:0.0:1646425025.405475:0:6589:0:(socklnd.c:2369:ksocknal_shutdown()) waiting for 2 peers to disconnect
      00000800:00000200:0.0:1646425026.406188:0:6589:0:(socklnd.c:2369:ksocknal_shutdown()) waiting for 2 peers to disconnect
      00000800:00000200:0.0:1646425027.407158:0:6589:0:(socklnd.c:2369:ksocknal_shutdown()) waiting for 2 peers to disconnect
      00000800:00000200:0.0:1646425028.407259:0:6589:0:(socklnd.c:2369:ksocknal_shutdown()) waiting for 2 peers to disconnect
      00000800:00000200:0.0:1646425029.407153:0:6589:0:(socklnd.c:2369:ksocknal_shutdown()) waiting for 2 peers to disconnect
      00000800:00000200:0.0:1646425030.407047:0:6589:0:(socklnd.c:2369:ksocknal_shutdown()) waiting for 2 peers to disconnect
      00000800:00000200:0.0:1646425031.407210:0:6589:0:(socklnd.c:2369:ksocknal_shutdown()) waiting for 2 peers to disconnect
      

      It seems more appropriate to have lnet_parse() return <0 in this case to signal to LNDs that the parse failed. They can handle it in the same way as EPROTO errors, etc.

      Attachments

        Issue Links

          Activity

            [LU-15618] ksock_conn ref leak on shutdown

            People

              hornc Chris Horn
              hornc Chris Horn
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: