Details

    • New Feature
    • Resolution: Fixed
    • Minor
    • Lustre 2.6.0
    • None
    • 8963

    Description

      NRS (Network Request Scheduler) enables the services to schedule the RPCs in different manners. And there have been a bunch of policies implemented over the main framework. Most of them are aimed at improving throughput rate or similar purposes. But we are trying to implement policies for a differnt kind of purpose, QoS.

      The TBF (Token Bucket Filter) is one of the policies that we implemented for traffic control. It enforces a RPC rate limit on every client according to the NID. The handling of a RPC will be delayed until there are enough tokens for the client. Different clients are scheduled according to their deadlines, so that none of them will be starving even though the service does not have the ability to satisfy all the RPC rate requirments of clients. The RPCs from the the same clients are queued in a FIFO manner.)

      Early tests show that the policy works to enforce the RPC rate limit. But more tests, bechmarks and analyses is needed for its correctness and efficiency.

      Attachments

        Issue Links

          Activity

            [LU-3558] NRS TBF policy for QoS purposes

            Hi Andreas,

            That sounds really interesting. But in order to implement multi-layered policies, I guess the framework of NRS needs a lot of changes, right?

            lixi Li Xi (Inactive) added a comment - Hi Andreas, That sounds really interesting. But in order to implement multi-layered policies, I guess the framework of NRS needs a lot of changes, right?

            Hi Nathan,

            Yes, you are right. The deadlines in the two statements mean different.

            The former one refers to a time point that one of the NID's RPCs should be handled in order to achieve the deserved RPC rate. If the deadline is missed, nothing bad will happen, except the RPC rate is slower than the NID expects.

            The latter one refers to traditional deadline of a RPC.

            Sorry for my misusing.

            lixi Li Xi (Inactive) added a comment - Hi Nathan, Yes, you are right. The deadlines in the two statements mean different. The former one refers to a time point that one of the NID's RPCs should be handled in order to achieve the deserved RPC rate. If the deadline is missed, nothing bad will happen, except the RPC rate is slower than the NID expects. The latter one refers to traditional deadline of a RPC. Sorry for my misusing.

            My preference for the long term is that we have a single "super" NRS policy that does many things at one time, or we use layers of NRS policies at the same time to achieve the optimum result.

            In the case of TBF it would be possible to have it provide QOS guarantees be giving out credits to exports in an unfair manner, so clients with guaranteed bandwidth will get it, and clients with bandwidth limits will be throttled as needed.

            Combining TBF and ORR seems possible, either by using TBF as a first layer filter to avoid massive client unfairness, passing the "unthrottled" requests through to the ORR filter to sort it within and between objects.

            adilger Andreas Dilger added a comment - My preference for the long term is that we have a single "super" NRS policy that does many things at one time, or we use layers of NRS policies at the same time to achieve the optimum result. In the case of TBF it would be possible to have it provide QOS guarantees be giving out credits to exports in an unfair manner, so clients with guaranteed bandwidth will get it, and clients with bandwidth limits will be throttled as needed. Combining TBF and ORR seems possible, either by using TBF as a first layer filter to avoid massive client unfairness, passing the "unthrottled" requests through to the ORR filter to sort it within and between objects.

            it schedules the handling of RPC between different NIDs according to their deadline

            current codes does not consider deadline of a RPC

            I am having trouble reconciling those two statements. Does the first refer to a different kind of deadline?

            nrutman Nathan Rutman added a comment - it schedules the handling of RPC between different NIDs according to their deadline current codes does not consider deadline of a RPC I am having trouble reconciling those two statements. Does the first refer to a different kind of deadline?

            Hi Nathan,

            The current codes does not consider deadline of a RPC yet. But yeah, I think more cases should be tested to make sure it is not a big problem.

            lixi Li Xi (Inactive) added a comment - Hi Nathan, The current codes does not consider deadline of a RPC yet. But yeah, I think more cases should be tested to make sure it is not a big problem.

            Hi Andreas,

            The policy throttles RPC based on TBF algorithm. But it schedules the handling of RPC between different NIDs according to their deadline, so it looks more like a CRR-N policy rather than ORR policy. Yes, we are trying change the policy in order to limit RPC rate of different users/groups/jobs. But I havn't got any idea about how to conjunct it with the ORR NRS policy. Any advice?

            Sure, we are running benchmarks and writing description about it. The test results and documents will come along with code improvement soon.

            lixi Li Xi (Inactive) added a comment - Hi Andreas, The policy throttles RPC based on TBF algorithm. But it schedules the handling of RPC between different NIDs according to their deadline, so it looks more like a CRR-N policy rather than ORR policy. Yes, we are trying change the policy in order to limit RPC rate of different users/groups/jobs. But I havn't got any idea about how to conjunct it with the ORR NRS policy. Any advice? Sure, we are running benchmarks and writing description about it. The test results and documents will come along with code improvement soon.

            Cool!
            How does the deadline scheduling interact with the tokens in the case of a conflict? I.e. if there are not enough tokens yet the deadline is imminent?

            Andreas, I just realized a possible side effect of any NRS policy is that it may oddly affect adaptive timeouts by skewing the measured RPC processing time to the maximum delay induced by the policy. I suppose the worst fallout from this would be slower recovery, so maybe not so horrible.

            nrutman Nathan Rutman added a comment - Cool! How does the deadline scheduling interact with the tokens in the case of a conflict? I.e. if there are not enough tokens yet the deadline is imminent? Andreas, I just realized a possible side effect of any NRS policy is that it may oddly affect adaptive timeouts by skewing the measured RPC processing time to the maximum delay induced by the policy. I suppose the worst fallout from this would be slower recovery, so maybe not so horrible.
            pjones Peter Jones added a comment -

            Lai

            Could you please review the supplied patch and offer advise as appropriate

            Thanks

            Peter

            pjones Peter Jones added a comment - Lai Could you please review the supplied patch and offer advise as appropriate Thanks Peter

            Description of Token Bucket Filter - http://en.wikipedia.org/wiki/Token_bucket_filter

            It would also be useful to test TBF in conjunction with the ORR NRS policy, so that RPCs from clients are sorted before IO and have a better chance to have a more optimal ordering when submitted to the backing storage.

            Before this can be landed, there will need to be a much better description of how this policy is used, and the performance results. As well, an update is needed for the Lustre Manual with details of how to use the policy and set limits for the NIDs.

            adilger Andreas Dilger added a comment - Description of Token Bucket Filter - http://en.wikipedia.org/wiki/Token_bucket_filter It would also be useful to test TBF in conjunction with the ORR NRS policy, so that RPCs from clients are sorted before IO and have a better chance to have a more optimal ordering when submitted to the backing storage. Before this can be landed, there will need to be a much better description of how this policy is used, and the performance results. As well, an update is needed for the Lustre Manual with details of how to use the policy and set limits for the NIDs.
            lixi Li Xi (Inactive) added a comment - Here is the patch. http://review.whamcloud.com/#/c/6901/

            People

              laisiyao Lai Siyao
              lixi Li Xi (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: