Details
-
Improvement
-
Resolution: Unresolved
-
Major
-
None
-
Lustre 2.14.0, Lustre 2.16.0
-
3
-
9223372036854775807
Description
If there is a mismatch between conns_per_peer on a client and server (e.g. different Ethernet network speed across Ethernet switches, or other reasons below) then each side will try to establish a different number of TCP sockets for the peer. LU-17258 is handling this by "giving up" on establishing more peer connections, as long as one could be established for each type.
When this happens, the client should save the conn_count as the new (in memory, until next unmount/remount) conns_per_peer value the remote peer NID, so that it doesn't continue trying to establish more connections whenever there is a problem.
Otherwise, the server will have to handle and reject these connections on a regular basis, which may seem like a DDOS if 10000 clients are all trying to (re-)establish thousands of connections at mount, recovery, or whenever there is a network hiccup. This makes the configuration more "hands off" without the need to tune conns_per_peer explicitly (and in coordination) across all nodes.
It is likely that the servers also need to dynamically shrink conns_per_peer when they start having a lot of connected peers to avoid the need to explicitly tune this for large clusters (and make us get involved to fix the system after it breaks). This will (eventually) cause the remote peers to also shrink their connection count over time due to their backoff of failed connections. I'm thinking something simple like shrinking conns_per_peer by 1 as the number of established peer connections grows past 20000 and again at 40000 (if it hasn't already started shrinking the number of connections when passing 20000). It couldn't be set < 1.
It could print a console message when this is done, suggesting to "set 'options socklnd conns_per_peer=N' in /etc/modprobe.d/lustre.conf to avoid this in the future", but at least the system would continue to work.
I don't know if the server would need to actively disconnect client connections > conns_per_peer, but that might be needed if the number of connections continues to grow (e.g. > 50000).
It would never increase conns_per_peer until the system is restarted, or maybe if explicitly set from userspace again if the admin really thinks they know better.
I've also filed LU-17514 for tracking an "expected_clients" tunable that can be used to set a ballpark figure for the number of clients, so that various runtime parameters like conns_per_peer could be set appropriately early in the cluster mount process.
I was thinking about this further, and I'm wondering if the number of connections per peer should be more dynamic at runtime rather than "establish N connections immediately at mount time"?
Essentially, conns_per_peer would be considered as "maximum number of peer connections" and ksocklnd would start with only 1 connection per peer (maybe not even per peer NID) until there was a substantial amount of traffic flowing to the peer. The the node would dynamically add new connections as long as this increased the real message transfer rate, and the server did not reject the connection with -EALREADY, and did not exceed conns_per_peer. Once the client is finished its IO burst it would dynamically drop idle connections again in the background.
That would allow the "single busy client" case to get peak bandwidth, while the "many clients" case would immediately be handled by 1 initial connection and the server would just not allow it to escalate if it was busy or had too many connections. Depending on how long it takes to establish a new connection, we might even consider to drop idle bulk read/write connections to 0 and only keep the control connection for pings and small messages, but I'm not sure that this is necessary.
I think that behavior gives us the best of both worlds - peak bandwidth when a single client can drive it, without overloading the server when its network/storage is the limit.