Details
-
Bug
-
Resolution: Fixed
-
Minor
-
None
-
None
-
3
-
9223372036854775807
Description
If there is a mismatch between conns_per_peer then the peer with larger conns_per_peer will continually try to create additional connections to the peer with lower conns_per_peer. These connection requests will be rejected by the peer with lower conns_per_peer.
In this test "n00" is running 2.15 and "n03" is running 2.12. I'm issuing a single ping from n00 to n03. There is no Lustre, so there is no other LNet traffic other than this single ping:
cassini-hosta:~ # pdsh -w n0[0,3] 'lctl --net tcp conn_list' n03: <no connections> n00: <no connections> cassini-hosta:~ # lctl ping 172.18.2.4@tcp 12345-0@lo 12345-172.18.2.4@tcp cassini-hosta:~ # pdsh -w n0[0,3] 'lctl --net tcp conn_list' | dshbak -c ---------------- n00 ---------------- 12345-172.18.2.4@tcp O[2]172.18.2.1->172.18.2.4:988 332800/131072 nonagle 12345-172.18.2.4@tcp I[2]172.18.2.1->172.18.2.4:988 332800/131072 nonagle 12345-172.18.2.4@tcp C[2]172.18.2.1->172.18.2.4:988 332800/131072 nonagle ---------------- n03 ---------------- 12345-172.18.2.1@tcp I[3]s-lmo-gaz38b->172.18.2.1:1021 332800/235392 nonagle 12345-172.18.2.1@tcp O[3]s-lmo-gaz38b->172.18.2.1:1022 332800/235392 nonagle 12345-172.18.2.1@tcp C[3]s-lmo-gaz38b->172.18.2.1:1023 332800/235392 nonagle cassini-hosta:~ # lctl dk > /tmp/dk.log cassini-hosta:~ # grep -c ksocknal_create_conn /tmp/dk.log 131 cassini-hosta:~ # sleep 30; lctl dk > /tmp/dk.log2 cassini-hosta:~ # grep -c ksocknal_create_conn /tmp/dk.log2 350 cassini-hosta:~ #
conns_per_peer on n00 (the 2.15 node) is default 0 and ends up at 4 because of link speed:
cassini-hosta:~ # lnetctl net show --net tcp -v | grep -e net -e conns net: - net type: tcp conns_per_peer: 4 cassini-hosta:~ #
debug log on n00 is mostly just this repeated (with +net and +malloc):
00000800:00000200:24.0:1663961292.792229:0:17357:0:(socklnd.c:946:ksocknal_create_conn()) ksocknal_send_hello conn 0000000089a5bbf4 returned 0 00000800:00000200:24.0:1663961292.792396:0:17357:0:(socklnd.c:958:ksocknal_create_conn()) ksocknal_recv_hello conn 0000000089a5bbf4 returned 114 00000800:00000200:24.0:1663961292.792405:0:17357:0:(socklnd.c:1237:ksocknal_create_conn()) Not creating conn 12345-172.18.2.4@tcp(0000000089a5bbf4) type 2: lost conn race 00000800:00000010:24.0:1663961292.792407:0:17357:0:(socklnd.c:1267:ksocknal_create_conn()) kfreed 'hello': 144 at 0000000051a6b08c (tot 1492135). 00000800:00000010:24.0:1663961292.792409:0:17357:0:(socklnd.c:1269:ksocknal_create_conn()) kfreed 'conn': 4712 at 0000000089a5bbf4 (tot 1487423). cassini-hosta:~ # grep -c 'Not creating conn' /tmp/dk.log2 50 cassini-hosta:~ # grep 'Not creating conn' /tmp/dk.log2 | grep -c 'type 2:' 50 cassini-hosta:~ #
If I repeat the test with conns_per_peer=1 on the 2.15 node then we don't see the extra calls to ksocknal_create_conn():
cassini-hosta:~ # lctl ping 172.18.2.4@tcp 12345-0@lo 12345-172.18.2.4@tcp cassini-hosta:~ # lctl dk > /tmp/dk.log4 cassini-hosta:~ # sleep 30; lctl dk > /tmp/dk.log5 cassini-hosta:~ # grep -c ksocknal_create_conn /tmp/dk.log5 0 cassini-hosta:~ # lnetctl net show --net tcp -v | grep -e net -e conns net: - net type: tcp conns_per_peer: 1 cassini-hosta:~ #