[LU-17514] parameter hint for expected number of connected clients Created: 07/Feb/24  Updated: 07/Feb/24

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.17.0
Fix Version/s: None

Type: Improvement Priority: Minor
Reporter: Andreas Dilger Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Issue Links:
Related
is related to LU-17513 how does 'conns_per_peer' apply with ... Open
is related to LU-17515 dynamically shrink 'conns_per_peer' a... Open
is related to LU-12064 Adaptive timeout at_min adjustment & ... Reopened
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

It would be useful to have a variable that can be set early in startup (e.g. libcfs module parameter that is also a tunable parameter) that gives the system a hint on how many clients will be mounting the filesystem.

While we try to tune filesystem parameters dynamically, sometimes that is complex to get right from the start, and by the time 1000 or 10000 clients have connected, then the values used when 10 or 100 clients had mounted are no longer optimal. For example, conns_per_peer at the LNet level could be 4 for 100 clients but should be 1 for 10000 clients or the servers can run out of TCP ports and have too many open sockets. Similarly, at_min should be low (5s) for smaller clusters for faster recovery, but with 10000 clients it should be larger (15s+), but since it is only set at mount time there is no easy way to update it on remote clients afterward.

Having a simple parameter set in {{/etc/modprobe.d/lustre.conf, like:

options libcfs expected_clients=10000

gives us a ballpark figure to work with. That doesn't obviate the need for dynamic tuning of parameters at runtime, but establishes some expectations for what is not obvious when the first 10 (of 10000) clients are mounting the filesystem.



 Comments   
Comment by Andreas Dilger [ 07/Feb/24 ]

Note that expected_clients should never be mistaken as "maximum possible number of clients", but only a ballpark figure for how many clients might be expected, and automatically tune parameters based on that expectation.

There may be significantly more or fewer clients mounting the filesystem:

  • 2/3/4x as many mountpoints because of subdirectory mounting
  • 2/4/8x as many socklnd connections because of multi-rail
  • far fewer mounts because of intermittent client mounting
  • far fewer connections because of LNet routing
    but it is only intended to be an order-of-magnitude guideline and not cast in stone.
Generated at Sat Feb 10 03:36:03 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.