Details
-
Improvement
-
Resolution: Unresolved
-
Major
-
Lustre 2.16.0
-
3
-
9223372036854775807
Description
One issue with Lustre in highly-contended multi-application workloads is the "noisy neighbor" problem, where one application's IO can be negatively affected by another application's IO that is submitted at the same time.
This particularly affects users on login nodes that are trying to run interactive workloads (eg. "ls -l" or compiling applications) that are only submitting a few RPCs at a time, and they can be blocked behind thousands of RPCs from a large job on hundreds of nodes.
It is possible to create NRS TBF rules to limit the RPC processing rate of the servers, but most users do not implement TBF rules, and suffer from this issue.
It would be useful to create a set of default TBF rules that could be applied to all systems, either at installation time or afterward, that provide a "best practice" result for a wide variety of use cases. It should be possible to override the default rules, but for many situations the default should avoid the majority of imbalance between jobs.
For example, applying a very high Token limit (eg. 1M) on jobs by GID, UID, JobID should not constrain the job RPC processing rate if there is no contention, but if there is IO contention on a server then it should fairly balance the RPC rate between the jobs instead of using FIFO order that is often blocking small (eg. interactive user) RPC processing.
Putting in a default JobID rule with a higher TBF allocation for jobs on login nodes (eg. With "*login*" in the job name) would also help boost interactive performance and at least have a reasonable chance of working out of the box. If not, this could be made clear in the documentation to be customized on deployment.
Attachments
Issue Links
- is related to
-
LU-8433 Maximizing Bandwidth utilization by TBF Rule with Dependency
- Open
-
LU-13037 print stats for NRS TBF rules
- Open
- is related to
-
LU-17920 Add permanent TBF rules
- Open
-
LU-16007 NRS Jobid default RPC aggregation
- Open
-
LU-14501 NRS TBF UID: limit per "any" user?
- Open
-
LU-17044 TBF: minimum guarantee RPC rate when the server is overloaded
- Open
-
LU-17503 IO500: improve NRS TBF to sort requests by object offset for ior-hard-write
- Open
-
LU-17512 add conditional operator for 'jobid_name'
- Resolved