[LU-16191] ksocklnd tries to open connections forever if there is a mismatch between conns_per_peer Created: 26/Sep/22 Updated: 17/Feb/23 Resolved: 17/Feb/23 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.16.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Chris Horn | Assignee: | Serguei Smirnov |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
If there is a mismatch between conns_per_peer then the peer with larger conns_per_peer will continually try to create additional connections to the peer with lower conns_per_peer. These connection requests will be rejected by the peer with lower conns_per_peer. In this test "n00" is running 2.15 and "n03" is running 2.12. I'm issuing a single ping from n00 to n03. There is no Lustre, so there is no other LNet traffic other than this single ping: cassini-hosta:~ # pdsh -w n0[0,3] 'lctl --net tcp conn_list' n03: <no connections> n00: <no connections> cassini-hosta:~ # lctl ping 172.18.2.4@tcp 12345-0@lo 12345-172.18.2.4@tcp cassini-hosta:~ # pdsh -w n0[0,3] 'lctl --net tcp conn_list' | dshbak -c ---------------- n00 ---------------- 12345-172.18.2.4@tcp O[2]172.18.2.1->172.18.2.4:988 332800/131072 nonagle 12345-172.18.2.4@tcp I[2]172.18.2.1->172.18.2.4:988 332800/131072 nonagle 12345-172.18.2.4@tcp C[2]172.18.2.1->172.18.2.4:988 332800/131072 nonagle ---------------- n03 ---------------- 12345-172.18.2.1@tcp I[3]s-lmo-gaz38b->172.18.2.1:1021 332800/235392 nonagle 12345-172.18.2.1@tcp O[3]s-lmo-gaz38b->172.18.2.1:1022 332800/235392 nonagle 12345-172.18.2.1@tcp C[3]s-lmo-gaz38b->172.18.2.1:1023 332800/235392 nonagle cassini-hosta:~ # lctl dk > /tmp/dk.log cassini-hosta:~ # grep -c ksocknal_create_conn /tmp/dk.log 131 cassini-hosta:~ # sleep 30; lctl dk > /tmp/dk.log2 cassini-hosta:~ # grep -c ksocknal_create_conn /tmp/dk.log2 350 cassini-hosta:~ # conns_per_peer on n00 (the 2.15 node) is default 0 and ends up at 4 because of link speed: cassini-hosta:~ # lnetctl net show --net tcp -v | grep -e net -e conns
net:
- net type: tcp
conns_per_peer: 4
cassini-hosta:~ #
debug log on n00 is mostly just this repeated (with +net and +malloc): 00000800:00000200:24.0:1663961292.792229:0:17357:0:(socklnd.c:946:ksocknal_create_conn()) ksocknal_send_hello conn 0000000089a5bbf4 returned 0 00000800:00000200:24.0:1663961292.792396:0:17357:0:(socklnd.c:958:ksocknal_create_conn()) ksocknal_recv_hello conn 0000000089a5bbf4 returned 114 00000800:00000200:24.0:1663961292.792405:0:17357:0:(socklnd.c:1237:ksocknal_create_conn()) Not creating conn 12345-172.18.2.4@tcp(0000000089a5bbf4) type 2: lost conn race 00000800:00000010:24.0:1663961292.792407:0:17357:0:(socklnd.c:1267:ksocknal_create_conn()) kfreed 'hello': 144 at 0000000051a6b08c (tot 1492135). 00000800:00000010:24.0:1663961292.792409:0:17357:0:(socklnd.c:1269:ksocknal_create_conn()) kfreed 'conn': 4712 at 0000000089a5bbf4 (tot 1487423). cassini-hosta:~ # grep -c 'Not creating conn' /tmp/dk.log2 50 cassini-hosta:~ # grep 'Not creating conn' /tmp/dk.log2 | grep -c 'type 2:' 50 cassini-hosta:~ # If I repeat the test with conns_per_peer=1 on the 2.15 node then we don't see the extra calls to ksocknal_create_conn(): cassini-hosta:~ # lctl ping 172.18.2.4@tcp
12345-0@lo
12345-172.18.2.4@tcp
cassini-hosta:~ # lctl dk > /tmp/dk.log4
cassini-hosta:~ # sleep 30; lctl dk > /tmp/dk.log5
cassini-hosta:~ # grep -c ksocknal_create_conn /tmp/dk.log5
0
cassini-hosta:~ # lnetctl net show --net tcp -v | grep -e net -e conns
net:
- net type: tcp
conns_per_peer: 1
cassini-hosta:~ #
|
| Comments |
| Comment by Andreas Dilger [ 26/Sep/22 ] |
|
Definitely it doesn't make sense to try too hard to add extra connections. In addition to the conns_per_peer mismatch (which might also happen in the future if the speed->conns heuristic changes), there could be TCP port limits if there are too many peers.
If the second or later connection to a peer fails to be established then the node shouldn't loop trying to reconnect. Maybe retry with exponential backoff? |
| Comment by Gerrit Updater [ 26/Sep/22 ] |
|
"Serguei Smirnov <ssmirnov@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/48664 |
| Comment by Gerrit Updater [ 10/Oct/22 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/48664/ |
| Comment by Peter Jones [ 10/Oct/22 ] |
|
Landed for 2.16 |
| Comment by James A Simmons [ 18/Oct/22 ] |
|
With the latest 2.15 LTS with these patches we still hit this bug on our production machine: [408577.737098] LNetError: 16137:0:(socklnd_cb.c:1985:ksocknal_connect()) ASSERTION( (wanted & (1 << 3)) != 0 ) failed: [408577.748695] LNetError: 16137:0:(socklnd_cb.c:1985:ksocknal_connect()) LBUG [408577.756620] Kernel panic - not syncing: LBUG in interrupt. [408577.765600] CPU: 14 PID: 16137 Comm: socknal_cd01 Kdump: loaded Tainted: P OE ------------ T 3.10.0-1160.71.1.el7.x86_64 #1 [408577.779586] Hardware name: Dell Inc. PowerEdge R640/0RGP26, BIOS 2.14.2 03/21/2022 [408577.788151] Call Trace: [408577.791600] [<ffffffffa35865c9>] dump_stack+0x19/0x1b [408577.797718] [<ffffffffa35802d1>] panic+0xe8/0x21f [408577.803495] [<ffffffffc0d0b8bd>] lbug_with_loc+0x8d/0xa0 [libcfs] [408577.810638] [<ffffffffc0e2ffbb>] ksocknal_connd+0xc4b/0xd80 [ksocklnd] [408577.818201] [<ffffffffa2edaf20>] ? wake_up_state+0x20/0x20 [408577.824710] [<ffffffffc0e2f370>] ? ksocknal_thread_fini+0x30/0x30 [ksocklnd] [408577.832775] [<ffffffffa2ec5f91>] kthread+0xd1/0xe0 [408577.838575] [<ffffffffa2ec5ec0>] ? insert_kthread_work+0x40/0x40 [408577.845571] [<ffffffffa3599ddd>] ret_from_fork_nospec_begin+0x7/0x21 [408577.852903] [<ffffffffa2ec5ec0>] ? insert_kthread_work+0x40/0x40 |
| Comment by Chris Horn [ 18/Oct/22 ] |
|
jsimmons That looks like a new issue. Are you saying it is a regression with this patch? |
| Comment by James A Simmons [ 18/Oct/22 ] |
|
We saw this before the patches of this work landed. Since it touched this area I was hoping it was fixed by this work. You think its a separate issue? |
| Comment by Chris Horn [ 18/Oct/22 ] |
|
Yes, this ticket is about behavior when conns_per_peer does not match between two peers. It is not about any ASSERTION. |
| Comment by Serguei Smirnov [ 18/Oct/22 ] |
|
James, does the build you're using include this fix?: https://review.whamcloud.com/47361 |
| Comment by James A Simmons [ 18/Oct/22 ] |
|
Let me try that. Will report back. |
| Comment by James A Simmons [ 18/Oct/22 ] |
|
Doesn''t apply for 2.12. The dynamic conn_per_peer doesn't exist for 2.12. |
| Comment by Chris Horn [ 17/Feb/23 ] |
|
pjones I think this can be closed |
| Comment by Peter Jones [ 17/Feb/23 ] |
|
ok thanks - feel free to close things yourself in future though - you've got permissions |