[LU-17296] NRS TBF default rules Created: 17/Nov/23  Updated: 05/Feb/24

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.16.0
Fix Version/s: Lustre 2.17.0

Type: Improvement Priority: Major
Reporter: Andreas Dilger Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Attachments: PDF File SC23-ThemisIO.pdf    
Issue Links:
Related
is related to LU-17503 IO500: improve NRS TBF to sort reques... Open
is related to LU-16007 NRS Jobid default RPC aggregation Open
is related to LU-17044 TBF: minimum guarantee RPC rate when ... Open
is related to LU-8433 Maximizing Bandwidth utilization by T... Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

One issue with Lustre in highly-contended multi-application workloads is the "noisy neighbor" problem, where one application's IO can be negatively affected by another application's IO that is submitted at the same time.

This particularly affects users on login nodes that are trying to run interactive workloads (eg. "ls -l" or compiling applications) that are only submitting a few RPCs at a time, and they can be blocked behind thousands of RPCs from a large job on hundreds of nodes.

It is possible to create NRS TBF rules to limit the RPC processing rate of the servers, but most users do not implement TBF rules, and suffer from this issue.

It would be useful to create a set of default TBF rules that could be applied to all systems, either at installation time or afterward, that provide a "best practice" result for a wide variety of use cases. It should be possible to override the default rules, but for many situations the default should avoid the majority of imbalance between jobs.

For example, applying a very high Token limit (eg. 1M) on jobs by GID, UID, JobID should not constrain the job RPC processing rate if there is no contention, but if there is IO contention on a server then it should fairly balance the RPC rate between the jobs instead of using FIFO order that is often blocking small (eg. interactive user) RPC processing.

Putting in a default JobID rule with a higher TBF allocation for jobs on login nodes (eg. With "*login*" in the job name) would also help boost interactive performance and at least have a reasonable chance of working out of the box. If not, this could be made clear in the documentation to be customized on deployment.


Generated at Sat Feb 10 03:34:16 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.