Details

    • 3
    • 9223372036854775807

    Description

      This is a bug with:

      commit 47b7b319783f27023b0cefe54a2a2eea678284f2
      Author: Doug Oucharek <doug.s.oucharek@intel.com>
      Date:   Wed Mar 2 12:08:00 2016 +0800
      
          LU-8106 lnet: Do not drop message when shutting down LNet
      

      That changes makes it so that if we fail to lookup the peer NI w/ESHUTDOWN, then lnet_parse() returns 0 instead of dropping the message. This can lead to a situation where ksocknal_process_receive() isn't aware that anything went wrong with lnet_parse() and so an extra ref is left on the ksock_conn.

                      ksocknal_conn_addref(conn);     /* ++ref while parsing */
      
      
                      rc = lnet_parse(conn->ksnc_peer->ksnp_ni,
                                      &hdr,
                                      &conn->ksnc_peer->ksnp_id.nid,
                                      conn, 0);
                      if (rc < 0) {
                              /* I just received garbage: give up on this conn */
                              ksocknal_new_packet(conn, 0);
                              ksocknal_close_conn_and_siblings(conn, rc);
                              CDEBUG(D_NET, "pre %p %u\n", conn,
                                     refcount_read(&conn->ksnc_conn_refcount));
                              ksocknal_conn_decref(conn);
                              return (-EPROTO);
                      }
                      <<<< REF LEAKED >>>>
      

      This prevents ksocklnd from shutting down, as it gets stuck waiting forever for the associated peer_ni to be destroyed. A symptom of this could be message like this printed to the console log:

      00000800:00000200:0.0:1646425022.404025:0:6589:0:(socklnd.c:2369:ksocknal_shutdown()) waiting for 2 peers to disconnect
      00000800:00000200:0.0:1646425023.404594:0:6589:0:(socklnd.c:2369:ksocknal_shutdown()) waiting for 2 peers to disconnect
      00000800:00000200:0.0:1646425024.405196:0:6589:0:(socklnd.c:2369:ksocknal_shutdown()) waiting for 2 peers to disconnect
      00000800:00000200:0.0:1646425025.405475:0:6589:0:(socklnd.c:2369:ksocknal_shutdown()) waiting for 2 peers to disconnect
      00000800:00000200:0.0:1646425026.406188:0:6589:0:(socklnd.c:2369:ksocknal_shutdown()) waiting for 2 peers to disconnect
      00000800:00000200:0.0:1646425027.407158:0:6589:0:(socklnd.c:2369:ksocknal_shutdown()) waiting for 2 peers to disconnect
      00000800:00000200:0.0:1646425028.407259:0:6589:0:(socklnd.c:2369:ksocknal_shutdown()) waiting for 2 peers to disconnect
      00000800:00000200:0.0:1646425029.407153:0:6589:0:(socklnd.c:2369:ksocknal_shutdown()) waiting for 2 peers to disconnect
      00000800:00000200:0.0:1646425030.407047:0:6589:0:(socklnd.c:2369:ksocknal_shutdown()) waiting for 2 peers to disconnect
      00000800:00000200:0.0:1646425031.407210:0:6589:0:(socklnd.c:2369:ksocknal_shutdown()) waiting for 2 peers to disconnect
      

      It seems more appropriate to have lnet_parse() return <0 in this case to signal to LNDs that the parse failed. They can handle it in the same way as EPROTO errors, etc.

      Attachments

        Issue Links

          Activity

            [LU-15618] ksock_conn ref leak on shutdown

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49259/
            Subject: LU-15618 lnet: Return ESHUTDOWN in lnet_parse()
            Project: fs/lustre-release
            Branch: b2_15
            Current Patch Set:
            Commit: bd7d65bf7a868d914484e43f7fbcb0b42d7d9e25

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49259/ Subject: LU-15618 lnet: Return ESHUTDOWN in lnet_parse() Project: fs/lustre-release Branch: b2_15 Current Patch Set: Commit: bd7d65bf7a868d914484e43f7fbcb0b42d7d9e25

            "Olaf Faaland <faaland1@llnl.gov>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49259
            Subject: LU-15618 lnet: Return ESHUTDOWN in lnet_parse()
            Project: fs/lustre-release
            Branch: b2_15
            Current Patch Set: 1
            Commit: 28788d5e91eae26dc562f085f8577fd8c2813718

            gerrit Gerrit Updater added a comment - "Olaf Faaland <faaland1@llnl.gov>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49259 Subject: LU-15618 lnet: Return ESHUTDOWN in lnet_parse() Project: fs/lustre-release Branch: b2_15 Current Patch Set: 1 Commit: 28788d5e91eae26dc562f085f8577fd8c2813718
            pjones Peter Jones added a comment -

            Landed for 2.16

            pjones Peter Jones added a comment - Landed for 2.16

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/46711/
            Subject: LU-15618 lnet: Return ESHUTDOWN in lnet_parse()
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 4fbd0705a3d25bbc85e953f81e697e5006b215ce

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/46711/ Subject: LU-15618 lnet: Return ESHUTDOWN in lnet_parse() Project: fs/lustre-release Branch: master Current Patch Set: Commit: 4fbd0705a3d25bbc85e953f81e697e5006b215ce

            Chris, any idea why this started failing recently, even though the patch is 6y old?

            adilger Andreas Dilger added a comment - Chris, any idea why this started failing recently, even though the patch is 6y old?
            hornc Chris Horn added a comment -

            If I run the reproducer on o2iblnd with the fix applied then I can't reproduce the symptoms. So it seems like the fix applies for ko2iblnd as well.

            hornc Chris Horn added a comment - If I run the reproducer on o2iblnd with the fix applied then I can't reproduce the symptoms. So it seems like the fix applies for ko2iblnd as well.
            hornc Chris Horn added a comment -

            If I run the reproducer with o2iblnd I see similar symptom, but I haven't root caused it to the same bug.

            hornc Chris Horn added a comment - If I run the reproducer with o2iblnd I see similar symptom, but I haven't root caused it to the same bug.
            hornc Chris Horn added a comment -

            The change impacts all LNDs, but I haven't observed or figured out if this bug causes a problem with other LNDs.

            hornc Chris Horn added a comment - The change impacts all LNDs, but I haven't observed or figured out if this bug causes a problem with other LNDs.

            Is this only for ksock?

            cbordage Cyril Bordage added a comment - Is this only for ksock?

            "Chris Horn <chris.horn@hpe.com>" uploaded a new patch: https://review.whamcloud.com/46711
            Subject: LU-15618 lnet: Return ESHUTDOWN in lnet_parse()
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 88092ffe31af313abefd70df88d419a8f0651954

            gerrit Gerrit Updater added a comment - "Chris Horn <chris.horn@hpe.com>" uploaded a new patch: https://review.whamcloud.com/46711 Subject: LU-15618 lnet: Return ESHUTDOWN in lnet_parse() Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 88092ffe31af313abefd70df88d419a8f0651954

            People

              hornc Chris Horn
              hornc Chris Horn
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: