Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-16191

ksocklnd tries to open connections forever if there is a mismatch between conns_per_peer

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.16.0
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      If there is a mismatch between conns_per_peer then the peer with larger conns_per_peer will continually try to create additional connections to the peer with lower conns_per_peer. These connection requests will be rejected by the peer with lower conns_per_peer.

      In this test "n00" is running 2.15 and "n03" is running 2.12. I'm issuing a single ping from n00 to n03. There is no Lustre, so there is no other LNet traffic other than this single ping:

      cassini-hosta:~ # pdsh -w n0[0,3] 'lctl --net tcp conn_list'
      n03: <no connections>
      n00: <no connections>
      cassini-hosta:~ # lctl ping 172.18.2.4@tcp
      12345-0@lo
      12345-172.18.2.4@tcp
      cassini-hosta:~ # pdsh -w n0[0,3] 'lctl --net tcp conn_list' | dshbak -c
      ----------------
      n00
      ----------------
      12345-172.18.2.4@tcp O[2]172.18.2.1->172.18.2.4:988 332800/131072 nonagle
      12345-172.18.2.4@tcp I[2]172.18.2.1->172.18.2.4:988 332800/131072 nonagle
      12345-172.18.2.4@tcp C[2]172.18.2.1->172.18.2.4:988 332800/131072 nonagle
      ----------------
      n03
      ----------------
      12345-172.18.2.1@tcp I[3]s-lmo-gaz38b->172.18.2.1:1021 332800/235392 nonagle
      12345-172.18.2.1@tcp O[3]s-lmo-gaz38b->172.18.2.1:1022 332800/235392 nonagle
      12345-172.18.2.1@tcp C[3]s-lmo-gaz38b->172.18.2.1:1023 332800/235392 nonagle
      cassini-hosta:~ # lctl dk > /tmp/dk.log
      cassini-hosta:~ # grep -c ksocknal_create_conn /tmp/dk.log
      131
      cassini-hosta:~ # sleep 30; lctl dk > /tmp/dk.log2
      cassini-hosta:~ # grep -c ksocknal_create_conn /tmp/dk.log2
      350
      cassini-hosta:~ #
      

      conns_per_peer on n00 (the 2.15 node) is default 0 and ends up at 4 because of link speed:

      cassini-hosta:~ # lnetctl net show --net tcp -v | grep -e net -e conns
      net:
          - net type: tcp
                    conns_per_peer: 4
      cassini-hosta:~ #
      

      debug log on n00 is mostly just this repeated (with +net and +malloc):

      00000800:00000200:24.0:1663961292.792229:0:17357:0:(socklnd.c:946:ksocknal_create_conn()) ksocknal_send_hello conn 0000000089a5bbf4 returned 0
      00000800:00000200:24.0:1663961292.792396:0:17357:0:(socklnd.c:958:ksocknal_create_conn()) ksocknal_recv_hello conn 0000000089a5bbf4 returned 114
      00000800:00000200:24.0:1663961292.792405:0:17357:0:(socklnd.c:1237:ksocknal_create_conn()) Not creating conn 12345-172.18.2.4@tcp(0000000089a5bbf4) type 2: lost conn race
      00000800:00000010:24.0:1663961292.792407:0:17357:0:(socklnd.c:1267:ksocknal_create_conn()) kfreed 'hello': 144 at 0000000051a6b08c (tot 1492135).
      00000800:00000010:24.0:1663961292.792409:0:17357:0:(socklnd.c:1269:ksocknal_create_conn()) kfreed 'conn': 4712 at 0000000089a5bbf4 (tot 1487423).
      
      cassini-hosta:~ # grep -c 'Not creating conn' /tmp/dk.log2
      50
      cassini-hosta:~ # grep 'Not creating conn' /tmp/dk.log2 | grep -c 'type 2:'
      50
      cassini-hosta:~ #
      

      If I repeat the test with conns_per_peer=1 on the 2.15 node then we don't see the extra calls to ksocknal_create_conn():

      cassini-hosta:~ # lctl ping 172.18.2.4@tcp
      12345-0@lo
      12345-172.18.2.4@tcp
      cassini-hosta:~ # lctl dk > /tmp/dk.log4
      cassini-hosta:~ # sleep 30; lctl dk > /tmp/dk.log5
      cassini-hosta:~ # grep -c ksocknal_create_conn /tmp/dk.log5
      0
      cassini-hosta:~ # lnetctl net show --net tcp -v | grep -e net -e conns
      net:
          - net type: tcp
                    conns_per_peer: 1
      cassini-hosta:~ #
      

      Attachments

        Activity

          People

            ssmirnov Serguei Smirnov
            hornc Chris Horn
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: