1.
Conceptually, In traditional TBF, tokens are inserted into the bucket every 1/r seconds, where r is the token rate set by administrator. However, considering that there may be lots of buckets in the system with different rate settings, this method is not very effecient.In some implementations (i.e. paper:" IOFlow: A Software-Defined Storage Architecture"), tokens are replenished to the buckets using a interval (T, i.e. T=10ms) timer with the vaule of T*r. But this method is not applied for all cases. For exmaple, when r < 1/T, at this time the calculated replenished tokens T*r < 1, resulting incorrect control.
Our TBF is not a strict TBF algorithm. In our actual implementation, we do not actually need to insert tokens every 1/r seconds, which is rather inefficient. Instead, tokens in a bucket are only updated according to the elapsed time and token rate when the class queue is actually able to dequeue an RPC request. Our policy uses a global timer always with expiration time setting to the latest deadline of classes (bucket) and all classes are sorted according to their deadlines. When the timer expired, The class with smallest deadline will be selected. Then the first RPC request in this class queue will be dequeued and handled by an idle service thread.
In current implementation:
class.deadline = class.last_check_time + 1/r;
Our TBF assigns deadline spaced by increments of 1/r to successive requests, If all requests are scheduled in order of their deadline values, the application will receive service in proportion to r. But in meanwhile it can not make sure all sequential requests are scheduled batchly or it even destroys the sequentiality of I/O requests, resulting performance regression.
So I am considered a new way to set deadline of class which can configure the batch number of requests in time space:
start RuleName jobid={dd.0} batch_num=32 rate=320
class.deadline = class.last_check_time + batch_num * 1/r
Druing dequeuing requests, we can make sure TBF policy always selects the class since last served and with available tokens. By this way, the batch_num requests belongs to a class can be handled batchly in a round and meanwhile kept the rate control.
2.
During TBF evaluation, we found that:
When the sum of I/O bandwidth requirements for all classes exceed the system capacity, all class get less bandwidth than preconfigured evenly. Under heavy load on a congested server, it will result in some missed deadlines for some classes. At this time we can set the class properties. i.e. The flag "rt" means that it insure requests belong to centain class with high realtime requirement be handled as much as possible.
batch_num = 1
class.token = (now - class.last_check_time) * r
In current implementation, when deqeueu a request, if calcuated tokens is larger than 1, we just simple reset class.last_check_time to current time, reset the deadline of the class, and resort the order of the class in the class sorter (binheap), consume one token and then handle the request. But I think we can keep the class.deadline unchanged, just consume the token and keep the class is at the front of the class sorter. It make the next idle I/O thread will also select this class to serve until exhaust all available token. (simliar to the TBF batchly scheduling)
3.
How to implement min. bandwidth guarantee.
Rule format:
start RuleName jobid={dd.0} reserve=100 rate=300
where the reserve is the min. bandwidth guarantee for classes match the rule.
As mentioned above, when the total bandwidth requirement of all classes exceeds the system capacity, the missed deadline will happen, resulting the obtained rate value is less than configured. The key/value pair "reserve=100" of the above rule defines the min. bandwidth guarantee of the class.
The principle is simple. We maintain the bandwidth requirement with reserve parameter as much as possible. There is a fallback class with very low bandwith limition (i.e. 5% of system capacity). The class statictics itself bandwith every second, if the measured IOPS is less than reserve value for a centain time period, it will set 'congested' flag. The requests belong to the classes without reserve flags will begin to shifted into the fallback class gradually to make sure min. bandwidth guartee. when all classes with reserve flag recover the noramal speed above the reserved values, clear the 'congested' flag, the requests belong to the class without reserve flag will handle as normal.
4.
The TBF policy is a nonwork-conserving scheduler. I am not figuring out how to avoids wasting resources on the server.
But we can implement a new classful work-conserving scheduler similar to linux CFQ with much more features, which can reuse lots of code of TBF policy.
It can support:
Different I/O priorities: RT, BE, IDLE.
RT:It is used for the realtime I/O, This scheduling evle is given higher priority than any other in the system.
BE: This is the best-effort scheduling type, which is the default scheduling level for all classes that have not set a specific IO priority.
IDLE: This is the idle scheduling level, a class seting with this level only gets I/O service when no one else needs the disk.
Different scheduler types of a class: FIFO, ORR (necessary?)
Different quantum setting policy:
a. timeslice (i.e. batchly serve requests belong to a class: 20ms)
b. I/O size (i.e. batchly serve requests belong to a class until use out totoal I/O size quantum: 32MB)
c. RPC number (i.e. batchly serve requests belong to a class until use out total rpc number quantum: 32 RPCs)
I/O service is preemptable. class with high priority can prempt the serving class with lowing priority.
The command format:
lctl set_param ost.OSS.ost_io.nrs_policies="CFQ quantum=time"
start ruleName jobid={dd.0} time=20ms type=rt
Or
lctl set_param ost.OSS.ost_io.nrs_policies="CFQ quantum=size"
start ruleName jobid={dd.0} size=32M type=idle
Or
lctl set_param ost.OSS.ost_io.nrs_policies="CFQ quantum=rpcs"
start ruleName jobid={dd.0} rpcs=32 type=be
This scheduler can implement proportional fair scheduling under heavy congested server (RPC queue depth reaches thousands or tens of thousands), but it is not very suitalbe for a light loaded server.
Any suggestion?
Landed for 2.10