Details
-
New Feature
-
Resolution: Unresolved
-
Minor
-
None
-
None
-
None
-
3
-
9223372036854775807
Description
Currently, it appears that CRRN, TRR, and ORR must be selected separately from Token Bucket FIlters. It might be considered an improvement to have the Round Robin responses, from the Lustre servers to clients while also implementing TBFs.
Attachments
Issue Links
- is related to
-
LU-18180 UID in req_buffer_history
-
- Open
-
-
LU-17902 add NRS TBF policy for nodemap
-
- Open
-
-
LU-18183 shared rate limit for a TBF rule
-
- Open
-
- is related to
-
LU-8433 Maximizing Bandwidth utilization by TBF Rule with Dependency
-
- Open
-
-
LU-14501 NRS TBF UID: limit per "any" user?
-
- Open
-
-
LU-17166 add NRS TBF rule for projid
-
- Open
-
-
LU-17503 IO500: improve NRS TBF to sort requests by object offset for ior-hard-write
-
- Open
-
My strong preference would be to change the NRS TBF implementation in some way to make it easy to configure a simple rule to divide users/jobs into buckets (e.g. by UID, nodemap, projid, jobid) and then allow "fair share" between buckets to be implemented. That would allow the NRS code to make a more informed decision about what a "noisy neighbor" is, and then directly throttle that "neighbors" RPCs when other users' RPCs are being impacted (i.e. turn down its volume until less noisy neighbors can be heard).
This might need some changes to the core of how NRS TBF is implemented, to allow "fair share" balancing between neighbors when there is resource contention, but not throttle when there is enough capacity for incoming RPCs. Currently, when a user is throttled by contention (i.e. it couldn't consume all of its tokens in that time slice) then the delayed RPCs are boosted in priority, because TBF was originally developed for real-time network flow control.
However, in the case of the noisy neighbor under constrained server resources, it will always be exceeding the number of tokens for its timeslice, and these should be deprioritized compared to other users. That allows "quiet neighbors" have their requests processed first (with lower latency) and the noisy neighbor will continue to chug along at whatever rate it can (at some slightly higher latency, but not enough to make a difference). The boost for the user's delayed RPCs only makes sense for a one-off condition, but not if this user continually exceeds its fraction of the resources.