[LU-16191] ksocklnd tries to open connections forever if there is a mismatch between conns_per_peer Created: 26/Sep/22  Updated: 17/Feb/23  Resolved: 17/Feb/23

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.16.0

Type: Bug Priority: Minor
Reporter: Chris Horn Assignee: Serguei Smirnov
Resolution: Fixed Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

If there is a mismatch between conns_per_peer then the peer with larger conns_per_peer will continually try to create additional connections to the peer with lower conns_per_peer. These connection requests will be rejected by the peer with lower conns_per_peer.

In this test "n00" is running 2.15 and "n03" is running 2.12. I'm issuing a single ping from n00 to n03. There is no Lustre, so there is no other LNet traffic other than this single ping:

cassini-hosta:~ # pdsh -w n0[0,3] 'lctl --net tcp conn_list'
n03: <no connections>
n00: <no connections>
cassini-hosta:~ # lctl ping 172.18.2.4@tcp
12345-0@lo
12345-172.18.2.4@tcp
cassini-hosta:~ # pdsh -w n0[0,3] 'lctl --net tcp conn_list' | dshbak -c
----------------
n00
----------------
12345-172.18.2.4@tcp O[2]172.18.2.1->172.18.2.4:988 332800/131072 nonagle
12345-172.18.2.4@tcp I[2]172.18.2.1->172.18.2.4:988 332800/131072 nonagle
12345-172.18.2.4@tcp C[2]172.18.2.1->172.18.2.4:988 332800/131072 nonagle
----------------
n03
----------------
12345-172.18.2.1@tcp I[3]s-lmo-gaz38b->172.18.2.1:1021 332800/235392 nonagle
12345-172.18.2.1@tcp O[3]s-lmo-gaz38b->172.18.2.1:1022 332800/235392 nonagle
12345-172.18.2.1@tcp C[3]s-lmo-gaz38b->172.18.2.1:1023 332800/235392 nonagle
cassini-hosta:~ # lctl dk > /tmp/dk.log
cassini-hosta:~ # grep -c ksocknal_create_conn /tmp/dk.log
131
cassini-hosta:~ # sleep 30; lctl dk > /tmp/dk.log2
cassini-hosta:~ # grep -c ksocknal_create_conn /tmp/dk.log2
350
cassini-hosta:~ #

conns_per_peer on n00 (the 2.15 node) is default 0 and ends up at 4 because of link speed:

cassini-hosta:~ # lnetctl net show --net tcp -v | grep -e net -e conns
net:
    - net type: tcp
              conns_per_peer: 4
cassini-hosta:~ #

debug log on n00 is mostly just this repeated (with +net and +malloc):

00000800:00000200:24.0:1663961292.792229:0:17357:0:(socklnd.c:946:ksocknal_create_conn()) ksocknal_send_hello conn 0000000089a5bbf4 returned 0
00000800:00000200:24.0:1663961292.792396:0:17357:0:(socklnd.c:958:ksocknal_create_conn()) ksocknal_recv_hello conn 0000000089a5bbf4 returned 114
00000800:00000200:24.0:1663961292.792405:0:17357:0:(socklnd.c:1237:ksocknal_create_conn()) Not creating conn 12345-172.18.2.4@tcp(0000000089a5bbf4) type 2: lost conn race
00000800:00000010:24.0:1663961292.792407:0:17357:0:(socklnd.c:1267:ksocknal_create_conn()) kfreed 'hello': 144 at 0000000051a6b08c (tot 1492135).
00000800:00000010:24.0:1663961292.792409:0:17357:0:(socklnd.c:1269:ksocknal_create_conn()) kfreed 'conn': 4712 at 0000000089a5bbf4 (tot 1487423).

cassini-hosta:~ # grep -c 'Not creating conn' /tmp/dk.log2
50
cassini-hosta:~ # grep 'Not creating conn' /tmp/dk.log2 | grep -c 'type 2:'
50
cassini-hosta:~ #

If I repeat the test with conns_per_peer=1 on the 2.15 node then we don't see the extra calls to ksocknal_create_conn():

cassini-hosta:~ # lctl ping 172.18.2.4@tcp
12345-0@lo
12345-172.18.2.4@tcp
cassini-hosta:~ # lctl dk > /tmp/dk.log4
cassini-hosta:~ # sleep 30; lctl dk > /tmp/dk.log5
cassini-hosta:~ # grep -c ksocknal_create_conn /tmp/dk.log5
0
cassini-hosta:~ # lnetctl net show --net tcp -v | grep -e net -e conns
net:
    - net type: tcp
              conns_per_peer: 1
cassini-hosta:~ #


 Comments   
Comment by Andreas Dilger [ 26/Sep/22 ]

Definitely it doesn't make sense to try too hard to add extra connections.  In addition to the conns_per_peer mismatch (which might also happen in the future if the speed->conns heuristic changes), there could be TCP port limits if there are too many peers.

 

If the second or later connection to a peer fails to be established then the node shouldn't loop trying to reconnect.  Maybe retry with exponential backoff?

Comment by Gerrit Updater [ 26/Sep/22 ]

"Serguei Smirnov <ssmirnov@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/48664
Subject: LU-16191 socklnd: limit retries on conns_per_peer mismatch
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: ca622a6d036b5c2f3272184e10d30906370227ea

Comment by Gerrit Updater [ 10/Oct/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/48664/
Subject: LU-16191 socklnd: limit retries on conns_per_peer mismatch
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: da893c6c9707ca3b2e7532d05f754fccf1cffc74

Comment by Peter Jones [ 10/Oct/22 ]

Landed for 2.16

Comment by James A Simmons [ 18/Oct/22 ]

With the latest 2.15 LTS with these patches we still hit this bug on our production machine:

[408577.737098] LNetError: 16137:0:(socklnd_cb.c:1985:ksocknal_connect()) ASSERTION( (wanted & (1 << 3)) != 0 ) failed:

[408577.748695] LNetError: 16137:0:(socklnd_cb.c:1985:ksocknal_connect()) LBUG [408577.756620] Kernel panic - not syncing: LBUG in interrupt.

[408577.765600] CPU: 14 PID: 16137 Comm: socknal_cd01 Kdump: loaded Tainted: P OE ------------ T 3.10.0-1160.71.1.el7.x86_64 #1

[408577.779586] Hardware name: Dell Inc. PowerEdge R640/0RGP26, BIOS 2.14.2 03/21/2022 [408577.788151] Call Trace:

[408577.791600] [<ffffffffa35865c9>] dump_stack+0x19/0x1b [408577.797718] [<ffffffffa35802d1>] panic+0xe8/0x21f

[408577.803495] [<ffffffffc0d0b8bd>] lbug_with_loc+0x8d/0xa0 [libcfs] [408577.810638] [<ffffffffc0e2ffbb>] ksocknal_connd+0xc4b/0xd80 [ksocklnd]

[408577.818201] [<ffffffffa2edaf20>] ? wake_up_state+0x20/0x20

[408577.824710] [<ffffffffc0e2f370>] ? ksocknal_thread_fini+0x30/0x30 [ksocklnd] [408577.832775] [<ffffffffa2ec5f91>] kthread+0xd1/0xe0

[408577.838575] [<ffffffffa2ec5ec0>] ? insert_kthread_work+0x40/0x40

[408577.845571] [<ffffffffa3599ddd>] ret_from_fork_nospec_begin+0x7/0x21 [408577.852903] [<ffffffffa2ec5ec0>] ? insert_kthread_work+0x40/0x40

Comment by Chris Horn [ 18/Oct/22 ]

jsimmons That looks like a new issue. Are you saying it is a regression with this patch?

Comment by James A Simmons [ 18/Oct/22 ]

We saw this before the patches of this work landed. Since it touched this area I was hoping it was fixed by this work. You think its a separate issue?

Comment by Chris Horn [ 18/Oct/22 ]

Yes, this ticket is about behavior when conns_per_peer does not match between two peers. It is not about any ASSERTION.

Comment by Serguei Smirnov [ 18/Oct/22 ]

James, does the build you're using include this fix?: https://review.whamcloud.com/47361

Comment by James A Simmons [ 18/Oct/22 ]

Let me try that. Will report back.

Comment by James A Simmons [ 18/Oct/22 ]

Doesn''t apply for 2.12. The dynamic conn_per_peer doesn't exist for 2.12.

Comment by Chris Horn [ 17/Feb/23 ]

pjones I think this can be closed

Comment by Peter Jones [ 17/Feb/23 ]

ok thanks - feel free to close things yourself in future though - you've got permissions

Generated at Sat Feb 10 03:24:49 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.