socklnd needs improved interface selection and configuration (LU-14064)

[LU-12815] Create multiple TCP sockets per SockLND Created: 27/Sep/19  Updated: 20/Sep/22  Resolved: 18/Aug/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.15.0, Lustre 2.12.10

Type: Technical task Priority: Minor
Reporter: Andreas Dilger Assignee: Amir Shehata (Inactive)
Resolution: Fixed Votes: 2
Labels: performance

Issue Links:
Duplicate
is duplicated by LU-14293 Poor lnet/ksocklnd(?) performance on ... Resolved
Related
is related to LU-14676 Better hash distribution to different... Resolved
Rank (Obsolete): 9223372036854775807

 Description   

For high-bandwidth Ethernet interfaces (e.g. 100GigE), it would be useful to create multiple TCP connections per interface for bulk transfers in order to maximize performance (i.e. conns_per_peer=4 for socklnd in addition to o2iblnd). We already have three separate TCP connections per LND - read, write, and small message.

For large clusters this may be problematic because of the number of TCP connections to a server, but for smaller configurations this could be very useful.



 Comments   
Comment by Amir Shehata (Inactive) [ 27/Sep/19 ]

yes the conns_per_peer would be a good parameter to use.

I looked at the ksocklnd, and currently there can exist only one unique route between two peers. This in effect translates to one tcp connection between the peers. I don't see a reason though why we can't create multiple tcp connections per peer, and when we select which connection to send from, we can iterate over these connections.

However, come to think about it, there is a way to do it, albeit a bit more configuration. What they can do is create multiple virtual interfaces which use the same physical interface, then they can use Multi-Rail to group all these connections. The result will be that socklnd will create multiple connections, one to each of the virtual interfaces.

Ex:

ifconfig eth0:0 <ip>
ifconfig eth0:1 <ip>
ifconfig eth0:2 <ip>
ifconfig eth0:3 <ip>

lnetctl net add --net tcp --if eth0:0,eth0:1,eth0:2,eth0:3

The rest will be taken care of by the Multi-Rail algorithm.

Would that be a sufficient solution?

Comment by Shuichi Ihara [ 28/Sep/19 ]

Yes, that workaround is exact what I did and confirmed bump up performance but setting up four logical interfaces on all clients were very annoying and not good idea to create logical interfaces only for lustre. cons_per_peer is able to simplify configuration and improves performance.

Comment by Amir Shehata (Inactive) [ 09/Oct/20 ]

Serguei is currently looking at this.

Comment by Chris Hunter (Inactive) [ 14/Oct/20 ]

Some ethernet drivers allow alternate hashing methods to better utilize adapter receive queues for small number of incoming TCP streams

https://docs.mellanox.com/display/MLNXOFEDv473290/RSS+Support

Comment by Raphael Druon [ 11/Dec/20 ]

Any update on this issue ?

Comment by Amir Shehata (Inactive) [ 11/Dec/20 ]

It's currently under development

Comment by Gerrit Updater [ 19/Dec/20 ]

Serguei Smirnov (ssmirnov@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/41056
Subject: LU-12815 socklnd: add conns_per_peer parameter
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: d69314d930b84927bd96351c185b7cef42073d3b

Comment by Gerrit Updater [ 08/Jan/21 ]

James Simmons (jsimmons@infradead.org) uploaded a new patch: https://review.whamcloud.com/41181
Subject: LU-12815 socklnd: add conns_per_peer parameter
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 88dd982736c6e681d395c8d5509f030aec1a3289

Comment by Andreas Dilger [ 11/Jan/21 ]

Amir or Serguei, can you please send an email to lustre-discuss (CC lustre-devel) asking if anyone there (or their users) is using the use_tcp_bonding option in production?

Comment by Andreas Dilger [ 11/Jan/21 ]

Does it make sense (in a later patch) to dynamically tune the conns_per_peer value depending on the network performance? It is always better to avoid the need for tuning if possible.

Is it possible to detect from socklnd what the underlying Ethernet device is (e.g. 100GigE) and set conns_per_peer automatically (either at startup or at runtime) unless it is otherwise specified?

Comment by Andreas Dilger [ 21/Jan/21 ]

Running ethtool on the Ethernet device reports the available and current interface speed, so it seems at least possible that we could get this same information in ksocklnd.c to set the default value of conns_per_peer based on the link speed:

# ethtool enp0s3
Settings for enp0s3:
        Supported link modes:   10baseT/Half 10baseT/Full 
                                100baseT/Half 100baseT/Full 
                                1000baseT/Full 
        Advertised link modes:  10baseT/Half 10baseT/Full 
                                100baseT/Half 100baseT/Full 
                                1000baseT/Full 
        Speed: 1000Mb/s

Running strace ethtool shows it is calling ioctl(SIOCETHTOOL), which also is accessible internally via dev_ioctl():

static int ethtool_ioctl(struct net *net, struct compat_ifreq __user *ifr32)
{
        :
        ret = dev_ioctl(net, SIOCETHTOOL, &ifr, NULL);

int dev_ioctl(struct net *net, unsigned int cmd, struct ifreq *ifr, bool *need_copyout)
{
        :
        case SIOCETHTOOL:
                dev_load(net, ifr->ifr_name);
                rtnl_lock();
                ret = dev_ethtool(net, ifr);
                rtnl_unlock();

but there may be some more fine-grained method in the kernel to determine the current speed of the interface.

I'm thinking, based on the stats from Shuici above and LU-14293, we want about conns_per_peer=4-6 for 100GbE. The following functions provide a reasonable default value for conns_per_peer (calculations done manually):

Speed ilog2(Gbps) ilog2(Gbps) / 2 (ilog2(Gbps) + 1) / 2 ilog2(Gbps / 2) ilog2(Gbps) / 2 + 1
1Gbps 0 0 0 0 1
2Gbps 1 0 1 0 1
4Gbps 2 1 1 1 2
8Gbps 3 1 2 2 2
10Gbps 3 1 2 2 2
16Gbps 4 2 2 3 3
32Gbps 5 2 3 4 3
50Gbps 5 2 3 4 3
64Gbps 6 3 4 5 4
100Gbps 6 3 4 5 4
128Gbps 7 3 4 6 4
200Gbps 7 3 4 6 4
256Gbps 8 4 5 7 5

I believe conns_per_peer=0 and conns_per_peer=1 is functionally equivalent. In any case, I think either the third or fourth ("log2(speed in multiples of 2Gbps)") or fifth formulas ("log4(Gbps)+1") provide a good starting point. That would give us conns_per_peer=4/5 at 100Gbps. We should probably prefer the lower value ((ilog2(Gbps) + 1) / 2 or (ilog2(Gbps) / 2 + 1), since we have to balance single-client performance against the number of sockets created to a server. Users can always specify a better value if they have a preference, but typically they will never touch it so better to have something useful (even if not perfect for every situation) rather than repeated complaints about 100GbE being slow.

Comment by Chris Hunter (Inactive) [ 21/Jan/21 ]

The linux kernel implements features to distribute network packets/TCP fragments over multiple CPU cores. Usually distribution is decided via "hash function" based on incoming TCP port and sender IP address.

In theory a network interface should achieve line rate on a single TCP port with streams from multiple IP addresses.

Comment by James A Simmons [ 21/Jan/21 ]

For ORNL we found conns_per_peer=8 gave the best results for 100Gbps.

Comment by Andreas Dilger [ 21/Jan/21 ]

Chris, the problem here is that without conns_per_peer there is only a single port for socklnd on a single IP address. Creating multiple sockets on the client avoids that issue, and is much less complex than creating multiple virtual interfaces. With multiple clients a server is less likely to have a problem, but if only a single client is reading/writing (which some workloads do), then it would again be limited performance without multiple sockets.

Comment by Andreas Dilger [ 21/Jan/21 ]

James, looking at LU-14293 it seems like the peak performance could be hit with 6 connections, but no results were shown between 4 and 8. Also note that socklnd creates 3x TCP connections per peer (read, write, small message), to allow different tunings and avoid congestion.

It might be worthwhile to see if multiple small message connections is useful or not, so that the total connections would be (2x conns_per_peer +1), but that optimization is probably only needed if we start running out of ports (2500 clients with 8x3 connections). I'm hoping nobody is building a giant cluster with that many Ethernet cards and not using RoCE or similar).

Comment by James A Simmons [ 21/Jan/21 ]

You mean like our next machine

Comment by Serguei Smirnov [ 21/Jan/21 ]

Andreas, 

To clarify, the socklnd conns_per_peer currently results in 2xconns_per_peer + 1 tcp connections per peer as only bulk_in and bulk_out conn types are multiplied. 

Comment by Chris Hunter (Inactive) [ 22/Jan/21 ]

Chris, the problem here is that without conns_per_peer there is only a single port for socklnd on a single IP address. Creating multiple sockets on the client avoids that issue, and is much less complex than creating multiple virtual interfaces.
With multiple clients a server is less likely to have a problem, but if only a single client is reading/writing (which some workloads do), then it would again be limited performance without multiple sockets.

Thanks Andreas, as stated in the description this feature is intended for systems with small number of clients. It does not appear to benefit systems at scale.

Is possible to know ahead of time which TCP ports conns_per_peer will use (ie. to adjust firewalls) ?

The problem described, insufficient incoming TCP streams to achieve network line rate, is not unique to Lustre.

Comment by Andreas Dilger [ 23/Jan/21 ]

The target port for connections will always be the same, 988, which it is for all new connections, and the actually-assigned port is mostly irrelevant. This is not any different from multiple clients connecting separately.

Comment by Aurelien Degremont (Inactive) [ 25/Jan/21 ]

which it is for all new connections, and the actually-assigned port is mostly irrelevant. This is not any different from multiple clients connecting separately.

If I remember correctly this is not totally true. There is still this rare behavior where a TCP socket is broken for some reasons and the first node to need it is the server, trying to send a ldlm callback. When the server detects the tcp connection for the reverse import is broken, it re-establish it itself, creating a server->client socket.

(If the client needs this connection before, (ie: obd_ping), it will re-establish it normally and the server will use this connection as usual). This likely impact the metadata socket, not the bulk I/O sockets.

Comment by Gerrit Updater [ 09/Feb/21 ]

Serguei Smirnov (ssmirnov@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/41463
Subject: LU-12815 socklnd: allow dynamic setting of conns_per_peer
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: fe3db0979f34afc5139fdc1b6b9ab6eace5cfde4

Comment by Andreas Dilger [ 10/Feb/21 ]

There is still this rare behavior where a TCP socket is broken for some reasons and the first node to need it is the server, trying to send a ldlm callback. When the server detects the tcp connection for the reverse import is broken, it re-establish it itself, creating a server->client socket.

That is true, but in this case I still believe that the server will use target port 988 on the client (not sure of source port), and the client will need to allow new connections on port 988 for most reliable behavior. In many cases, it is possible for the client to function properly without allowing any incoming connections, but as you write there may be rare cases that the server needs to initiate a connection, and without that the client may occasionally be evicted. For some sites that may be preferable to having an open port in the firewall. IIRC, there may even be a parameter to disable server->client connections, but I don't recall the details.

Comment by Aurelien Degremont (Inactive) [ 11/Feb/21 ]

That's correct. I wanted to warn about this often unknown use case (I was really surprised when I discovered it), with any potential impact of this feature.

Agreed that some sites could prefer risking evictions than changing firewall rules. However, this is surprising behavior and sites tends to reduce as much as possible evictions and having a "normal case" where evictions will happen but are expected, just because of unknown traffic rules is not the best option. This could cause additional JIRA tickets. But this problem already exists and is independent of this ticket. I just wanted to avoid increasing the issue. Does multiple TCP sockets featurer will only apply to Bulk I/O sockets?

 

Comment by Andreas Dilger [ 11/Feb/21 ]

Does multiple TCP sockets featurer will only apply to Bulk I/O sockets?

Correct, this is only for bulk sockets.

Comment by Gerrit Updater [ 05/May/21 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/41056/
Subject: LU-12815 socklnd: add conns_per_peer parameter
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 71b2476e4ddb95aa42f4a0ea3f23b1826017bfa5

Comment by Andreas Dilger [ 13/May/21 ]

There may still be work needed to distribute RPCs from a single client to multiple CPTs on the server, in order to get the best performance for real IO workloads.

Otherwise, a client with a single interface (NID) will have all of its RPCs handled by cores in a single CPT, which is not quite the same as having multiple real interfaces on the client. Discussion is ongoing in LU-14676.

Comment by Gerrit Updater [ 28/Jul/21 ]

Serguei Smirnov (ssmirnov@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/44417
Subject: LU-12815 socklnd: set conns_per_peer based on link speed
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: ca2f4fed6d85d2e4506958fcc2e1c6c98eb2d020

Comment by Gerrit Updater [ 18/Aug/21 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/41463/
Subject: LU-12815 socklnd: allow dynamic setting of conns_per_peer
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: a5cbe7883db6d77b82fbd83ad4c662499421d229

Comment by Gerrit Updater [ 18/Aug/21 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/44417/
Subject: LU-12815 socklnd: set conns_per_peer based on link speed
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: c44afcfb72a1c2fd8392bfab3143c3835b146be6

Comment by Peter Jones [ 18/Aug/21 ]

Looks like everything has landed for 2.15

Comment by Gerrit Updater [ 09/May/22 ]

"Cyril Bordage <cbordage@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/47252
Subject: LU-12815 socklnd: add conns_per_peer parameter
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: b370ffba7b778f9b5fee325a8c67228ca2454137

Comment by Gerrit Updater [ 20/Sep/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/47252/
Subject: LU-12815 socklnd: add conns_per_peer parameter
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: be2c4bb928b5cf6b428d7974e8fd89ea177fa2df

Generated at Sat Feb 10 02:55:53 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.