[LU-3558] NRS TBF policy for QoS purposes - Whamcloud Community JIRA

Details

Type: New Feature
Resolution: Fixed
Priority: Minor
Fix Version/s: Lustre 2.6.0
Affects Version/s: None
Labels:
- patch
- ptr

Rank (Obsolete):
8963

Description

NRS (Network Request Scheduler) enables the services to schedule the RPCs in different manners. And there have been a bunch of policies implemented over the main framework. Most of them are aimed at improving throughput rate or similar purposes. But we are trying to implement policies for a differnt kind of purpose, QoS.

The TBF (Token Bucket Filter) is one of the policies that we implemented for traffic control. It enforces a RPC rate limit on every client according to the NID. The handling of a RPC will be delayed until there are enough tokens for the client. Different clients are scheduled according to their deadlines, so that none of them will be starving even though the service does not have the ability to satisfy all the RPC rate requirments of clients. The RPCs from the the same clients are queued in a FIFO manner.)

Early tests show that the policy works to enforce the RPC rate limit. But more tests, bechmarks and analyses is needed for its correctness and efficiency.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

NRS-initial-test-result.xlsx
35 kB
14/Jul/13 2:46 PM
TBF-design-1.0.pdf
274 kB
29/Jul/13 10:39 AM

Issue Links

is blocking

LU-11431 Global QoS management based on TBF

Closed

is related to

LU-5717 Dead lock of nrs_tbf_timer_cb

Resolved

LU-6668 Add tests for TBF

Resolved

LU-5379 Get error when has many rules in nrs tbf policy

Resolved

LU-9227 Changing rate of a TBF rule loses control in some testcases

Resolved

LU-5580 Switch between 'JOBID' and 'NID' directly in NRS TBF

Resolved

LU-5620 nrs tbf policy based on opcode

Resolved

LUDOC-221 Document Token Bucket Filter (TBF) NRS policy

Closed

LU-4586 build failure in nrs_tbf_ctl()

Resolved

is related to

LUDOC-328 documentation updates for complex TBF policies

Open

LU-8008 Can't enable or add rules to TBF

Resolved

LU-5620 nrs tbf policy based on opcode

Resolved

LU-7470 Extend TBF policy with NID/JobID expressions

Resolved

LU-9228 Hard TBF Token Compensation under congestion

Resolved

LU-3266 Regression tests for NRS policies

Resolved

LU-8006 Specify ordering of TBF policy rules

Resolved

LU-8236 Wild-card in jobid TBF rule

Resolved

(4 is related to, 8 is related to )

Activity

[LU-3558] NRS TBF policy for QoS purposes

Li Xi (Inactive) added a comment - 04/Sep/13 2:27 PM

The NID based TBF policy works well. But we found a problem of JobID based TBF policy and have to ask for help.

The JobID based TBF policy classifies RPCs according to the Job Stat informantion of each RPC. The simplest Job Stat informantion is 'procname_uid' which can be enabled by 'lctl conf_param server1.sys.jobid_var=procname_uid'. With TBF policy, we are able to set rate limits to different kinds of RPCs. We set the RPC rate of 'dd.0' to 1 RPC/s and the RPC rate 'dd.500' to 1000 RPC/s. If TBF policy works well, when root user ran 'dd' command, an OSS service partition will never handle more than 1 RPC of it. And when user 500 ran 'dd' command, an OSS service partition will never handle more than 1000 RPC of it. Actually this works well except following condition.

When we ran 'dd' using user root and user 500 at the same time, on the same client, writing to the same OST, the performance of user 500 will decline dramatically, i.e. the performance of user 500 is highly affected by the user root.

Here is the result that we got running following command.
dd if=/dev/zero of=/mnt/lustre/fileX bs=1048576 count=XXXX

1. When user 500 ran 'dd' alone, the performance is about 80 MB/s. This is normal because the OSS's performance has an upper limit of about 80 MB/s

2. When user root ran 'dd' alone, the performance is about 2 MB/s. This is normal too, because the OSS has two partition and each has a limit of 1 RPC/s. 1 MB/RPC * 1 RPC/s * 2 = 2 MB/s

3. When user root ran 'dd', and user 500 ran 'dd' on another client, user 500 will get performance of about 80 MB/s and user root will get performance of about 2 MB/s. Please not that different processes writes to differnt files. No matter what the stripes of the files are, we get similar results. There are expected normal results.

4. When user root ran 'dd', and user 500 ran 'dd' on another client,
user 500 will get performance of about 80 MB/s and user root will get 2 MB/s. That's normal too.

5. When user root ran 'dd', and user 500 ran 'dd' on the same client, but they write to different OSTs (i.e. the stripe indexes of these files are different), user 500 will get performance of about 80 MB/s and user root will get 2 MB/s. That's normal too.

6. When user root ran 'dd', and user 500 ran 'dd' on the same client, and they write to the same OST (i.e. the stripe indexes of these files are the same), the performance of user 500 will declines to about 2 MB/s when user root is writing too. The performance of user 500 will go up immediately to 80 MB/s after user root completes its writing.

The result 6 is really strange. We think it is not likely that server side codes cause the problem since result 4 is normal. And result 5 implies that it is the OSC ranther than the OSS throttles RPC rate wrongly. Maybe when some RPCs from an OSC are appending, the OSC does not send any more RPCs? I guess maybe some mechanisms of OSC make it works like this, e.g. max RPC in flight limit? I've tried to enlarge max_rpc_in_flight argument of OSCs but got no luck.

Any suggestions you could provide to us would be greatly appreciated! Thank you in advance!

Li Xi (Inactive) added a comment - 04/Sep/13 2:27 PM The NID based TBF policy works well. But we found a problem of JobID based TBF policy and have to ask for help. The JobID based TBF policy classifies RPCs according to the Job Stat informantion of each RPC. The simplest Job Stat informantion is 'procname_uid' which can be enabled by 'lctl conf_param server1.sys.jobid_var=procname_uid'. With TBF policy, we are able to set rate limits to different kinds of RPCs. We set the RPC rate of 'dd.0' to 1 RPC/s and the RPC rate 'dd.500' to 1000 RPC/s. If TBF policy works well, when root user ran 'dd' command, an OSS service partition will never handle more than 1 RPC of it. And when user 500 ran 'dd' command, an OSS service partition will never handle more than 1000 RPC of it. Actually this works well except following condition. When we ran 'dd' using user root and user 500 at the same time, on the same client, writing to the same OST, the performance of user 500 will decline dramatically, i.e. the performance of user 500 is highly affected by the user root. Here is the result that we got running following command. dd if=/dev/zero of=/mnt/lustre/fileX bs=1048576 count=XXXX 1. When user 500 ran 'dd' alone, the performance is about 80 MB/s. This is normal because the OSS's performance has an upper limit of about 80 MB/s 2. When user root ran 'dd' alone, the performance is about 2 MB/s. This is normal too, because the OSS has two partition and each has a limit of 1 RPC/s. 1 MB/RPC * 1 RPC/s * 2 = 2 MB/s 3. When user root ran 'dd', and user 500 ran 'dd' on another client, user 500 will get performance of about 80 MB/s and user root will get performance of about 2 MB/s. Please not that different processes writes to differnt files. No matter what the stripes of the files are, we get similar results. There are expected normal results. 4. When user root ran 'dd', and user 500 ran 'dd' on another client, user 500 will get performance of about 80 MB/s and user root will get 2 MB/s. That's normal too. 5. When user root ran 'dd', and user 500 ran 'dd' on the same client, but they write to different OSTs (i.e. the stripe indexes of these files are different), user 500 will get performance of about 80 MB/s and user root will get 2 MB/s. That's normal too. 6. When user root ran 'dd', and user 500 ran 'dd' on the same client, and they write to the same OST (i.e. the stripe indexes of these files are the same), the performance of user 500 will declines to about 2 MB/s when user root is writing too. The performance of user 500 will go up immediately to 80 MB/s after user root completes its writing. The result 6 is really strange. We think it is not likely that server side codes cause the problem since result 4 is normal. And result 5 implies that it is the OSC ranther than the OSS throttles RPC rate wrongly. Maybe when some RPCs from an OSC are appending, the OSC does not send any more RPCs? I guess maybe some mechanisms of OSC make it works like this, e.g. max RPC in flight limit? I've tried to enlarge max_rpc_in_flight argument of OSCs but got no luck. Any suggestions you could provide to us would be greatly appreciated! Thank you in advance!

Kit Westneat (Inactive) added a comment - 21/Aug/13 3:25 PM

Would it be possible to get the fix-version set to 2.5? We want to make sure it doesn't slip off the radar at the last minute or anything. I think that's the procedure we talked about on the CDWG call last week.

Thanks.

Kit Westneat (Inactive) added a comment - 21/Aug/13 3:25 PM Would it be possible to get the fix-version set to 2.5? We want to make sure it doesn't slip off the radar at the last minute or anything. I think that's the procedure we talked about on the CDWG call last week. Thanks.

Li Xi (Inactive) added a comment - 16/Aug/13 7:44 AM

The current Lustre NRS TBF codes change a lot on this basis of the last version. First, we optimize the main framework of TBF policy, which makes it easier to add new supports, e.g. Job ID support, UID/GID support. Based on that, we add the support of Job ID. Now we can utilize Job Stat mechanism to set limit to different jobs. And ofcourse, we fix some defects of TBF codes too.

Li Xi (Inactive) added a comment - 16/Aug/13 7:44 AM The current Lustre NRS TBF codes change a lot on this basis of the last version. First, we optimize the main framework of TBF policy, which makes it easier to add new supports, e.g. Job ID support, UID/GID support. Based on that, we add the support of Job ID. Now we can utilize Job Stat mechanism to set limit to different jobs. And ofcourse, we fix some defects of TBF codes too.

Shuichi Ihara (Inactive) added a comment - 29/Jul/13 10:39 AM

This is initial version of Lustre NRS TBF design document.

Shuichi Ihara (Inactive) added a comment - 29/Jul/13 10:39 AM This is initial version of Lustre NRS TBF design document.

Shuichi Ihara (Inactive) added a comment - 14/Jul/13 2:46 PM

This is an inital benchmark resutls we did a couple of weeks ago. Tested on 16 clients with various token policy for few clients. We will test again on the latest codes and push results on here later.

Shuichi Ihara (Inactive) added a comment - 14/Jul/13 2:46 PM This is an inital benchmark resutls we did a couple of weeks ago. Tested on 16 clients with various token policy for few clients. We will test again on the latest codes and push results on here later.

Li Xi (Inactive) added a comment - 13/Jul/13 8:22 AM

Hi Andreas,

That sounds really interesting. But in order to implement multi-layered policies, I guess the framework of NRS needs a lot of changes, right?

Li Xi (Inactive) added a comment - 13/Jul/13 8:22 AM Hi Andreas, That sounds really interesting. But in order to implement multi-layered policies, I guess the framework of NRS needs a lot of changes, right?

Li Xi (Inactive) added a comment - 13/Jul/13 8:18 AM

Hi Nathan,

Yes, you are right. The deadlines in the two statements mean different.

The former one refers to a time point that one of the NID's RPCs should be handled in order to achieve the deserved RPC rate. If the deadline is missed, nothing bad will happen, except the RPC rate is slower than the NID expects.

The latter one refers to traditional deadline of a RPC.

Sorry for my misusing.

Li Xi (Inactive) added a comment - 13/Jul/13 8:18 AM Hi Nathan, Yes, you are right. The deadlines in the two statements mean different. The former one refers to a time point that one of the NID's RPCs should be handled in order to achieve the deserved RPC rate. If the deadline is missed, nothing bad will happen, except the RPC rate is slower than the NID expects. The latter one refers to traditional deadline of a RPC. Sorry for my misusing.

Andreas Dilger added a comment - 13/Jul/13 7:05 AM

My preference for the long term is that we have a single "super" NRS policy that does many things at one time, or we use layers of NRS policies at the same time to achieve the optimum result.

In the case of TBF it would be possible to have it provide QOS guarantees be giving out credits to exports in an unfair manner, so clients with guaranteed bandwidth will get it, and clients with bandwidth limits will be throttled as needed.

Combining TBF and ORR seems possible, either by using TBF as a first layer filter to avoid massive client unfairness, passing the "unthrottled" requests through to the ORR filter to sort it within and between objects.

Andreas Dilger added a comment - 13/Jul/13 7:05 AM My preference for the long term is that we have a single "super" NRS policy that does many things at one time, or we use layers of NRS policies at the same time to achieve the optimum result. In the case of TBF it would be possible to have it provide QOS guarantees be giving out credits to exports in an unfair manner, so clients with guaranteed bandwidth will get it, and clients with bandwidth limits will be throttled as needed. Combining TBF and ORR seems possible, either by using TBF as a first layer filter to avoid massive client unfairness, passing the "unthrottled" requests through to the ORR filter to sort it within and between objects.

Nathan Rutman added a comment - 12/Jul/13 9:00 PM

it schedules the handling of RPC between different NIDs according to their deadline

current codes does not consider deadline of a RPC

I am having trouble reconciling those two statements. Does the first refer to a different kind of deadline?

Nathan Rutman added a comment - 12/Jul/13 9:00 PM it schedules the handling of RPC between different NIDs according to their deadline current codes does not consider deadline of a RPC I am having trouble reconciling those two statements. Does the first refer to a different kind of deadline?

Li Xi (Inactive) added a comment - 12/Jul/13 1:46 PM

Hi Nathan,

The current codes does not consider deadline of a RPC yet. But yeah, I think more cases should be tested to make sure it is not a big problem.

Li Xi (Inactive) added a comment - 12/Jul/13 1:46 PM Hi Nathan, The current codes does not consider deadline of a RPC yet. But yeah, I think more cases should be tested to make sure it is not a big problem.

Li Xi (Inactive) added a comment - 12/Jul/13 1:39 PM

Hi Andreas,

The policy throttles RPC based on TBF algorithm. But it schedules the handling of RPC between different NIDs according to their deadline, so it looks more like a CRR-N policy rather than ORR policy. Yes, we are trying change the policy in order to limit RPC rate of different users/groups/jobs. But I havn't got any idea about how to conjunct it with the ORR NRS policy. Any advice?

Sure, we are running benchmarks and writing description about it. The test results and documents will come along with code improvement soon.

Li Xi (Inactive) added a comment - 12/Jul/13 1:39 PM Hi Andreas, The policy throttles RPC based on TBF algorithm. But it schedules the handling of RPC between different NIDs according to their deadline, so it looks more like a CRR-N policy rather than ORR policy. Yes, we are trying change the policy in order to limit RPC rate of different users/groups/jobs. But I havn't got any idea about how to conjunct it with the ORR NRS policy. Any advice? Sure, we are running benchmarks and writing description about it. The test results and documents will come along with code improvement soon.

People

Assignee:: Lai Siyao

Reporter:: Li Xi (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 13 Start watching this issue

Dates

Created:: 05/Jul/13 1:09 AM

Updated:: 26/Sep/18 7:25 AM

Resolved:: 07/Mar/14 10:07 AM