Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-18179

Implementation of Round-Robin/Fair Share response with Token Bucket Filters

Details

    • New Feature
    • Resolution: Unresolved
    • Minor
    • None
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      Currently, it appears that CRRN, TRR, and ORR must be selected separately from Token Bucket FIlters.  It might be considered an improvement to have the Round Robin responses, from the Lustre servers to clients while also implementing TBFs.

      Attachments

        Issue Links

          Activity

            [LU-18179] Implementation of Round-Robin/Fair Share response with Token Bucket Filters

            My strong preference would be to change the NRS TBF implementation in some way to make it easy to configure a simple rule to divide users/jobs into buckets (e.g. by UID, nodemap, projid, jobid) and then allow "fair share" between buckets to be implemented. That would allow the NRS code to make a more informed decision about what a "noisy neighbor" is, and then directly throttle that "neighbors" RPCs when other users' RPCs are being impacted (i.e. turn down its volume until less noisy neighbors can be heard).

            This might need some changes to the core of how NRS TBF is implemented, to allow "fair share" balancing between neighbors when there is resource contention, but not throttle when there is enough capacity for incoming RPCs. Currently, when a user is throttled by contention (i.e. it couldn't consume all of its tokens in that time slice) then the delayed RPCs are boosted in priority, because TBF was originally developed for real-time network flow control.

            However, in the case of the noisy neighbor under constrained server resources, it will always be exceeding the number of tokens for its timeslice, and these should be deprioritized compared to other users. That allows "quiet neighbors" have their requests processed first (with lower latency) and the noisy neighbor will continue to chug along at whatever rate it can (at some slightly higher latency, but not enough to make a difference). The boost for the user's delayed RPCs only makes sense for a one-off condition, but not if this user continually exceeds its fraction of the resources.

            adilger Andreas Dilger added a comment - My strong preference would be to change the NRS TBF implementation in some way to make it easy to configure a simple rule to divide users/jobs into buckets (e.g. by UID, nodemap, projid, jobid) and then allow "fair share" between buckets to be implemented. That would allow the NRS code to make a more informed decision about what a "noisy neighbor" is, and then directly throttle that "neighbors" RPCs when other users' RPCs are being impacted (i.e. turn down its volume until less noisy neighbors can be heard). This might need some changes to the core of how NRS TBF is implemented, to allow "fair share" balancing between neighbors when there is resource contention, but not throttle when there is enough capacity for incoming RPCs. Currently, when a user is throttled by contention (i.e. it couldn't consume all of its tokens in that time slice) then the delayed RPCs are boosted in priority, because TBF was originally developed for real-time network flow control. However, in the case of the noisy neighbor under constrained server resources, it will always be exceeding the number of tokens for its timeslice, and these should be deprioritized compared to other users. That allows "quiet neighbors" have their requests processed first (with lower latency) and the noisy neighbor will continue to chug along at whatever rate it can (at some slightly higher latency, but not enough to make a difference). The boost for the user's delayed RPCs only makes sense for a one-off condition, but not if this user continually exceeds its fraction of the resources.

            This is very similar to the improvements discussed in LU-17503.

            adilger Andreas Dilger added a comment - This is very similar to the improvements discussed in LU-17503 .

            People

              mjaguil Michael Aguilar
              mjaguil Michael Aguilar
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated: