[LU-5620] nrs tbf policy based on opcode Created: 15/Sep/14 Updated: 20/Apr/17 Resolved: 10/Feb/17 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.10.0 |
| Type: | New Feature | Priority: | Minor |
| Reporter: | wu libin (Inactive) | Assignee: | wu libin (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | patch | ||
| Issue Links: |
|
||||||||||||||||||||
| Epic/Theme: | Performance | ||||||||||||||||||||
| Rank (Obsolete): | 15725 | ||||||||||||||||||||
| Description |
|
Some times it need to limit the read or write operation, it can limit the performance of one operation and improve for the other.So, we want to limit the performance for the operation. |
| Comments |
| Comment by wu libin (Inactive) [ 15/Sep/14 ] |
|
Here is the patch: http://review.whamcloud.com/11918 |
| Comment by wu libin (Inactive) [ 15/Sep/14 ] |
|
It can be used like follows: 200" 200" 200" |
| Comment by Peter Jones [ 15/Sep/14 ] |
|
Emoly Please can you review this patch? Thanks Peter |
| Comment by Andreas Dilger [ 16/Dec/16 ] |
|
Rather than specifying actual RPC rate limits for TBF rules, would it be possible to have the specified rates be relative to the actual performance of the device? This would make the TBF rules work better in cases where the actual performance doesn't match the total number of RPCs specified for the different TBF rules. For example, if there is an OST with 20000 read IOPS but only 12000 write IOPS it isn't very easy for users/admins to set up rules that spread the IOPS evenly between jobs/nodes/opcodes. This gets even more complex if e.g. a write takes much more time than a setattr, and a read takes more time than a getattr. What I'm thinking is that the number of RPCs handled by each bucket is proportional to the weight of the rules that match, and if all RPCs matching the buckets are handled then any remaining RPCs in the queue are opportunistically handled until it is time to refresh the buckets. That avoids wasting resources on the server. This would need to be an option, so that TBF limits are "hard" (as they are currently) and cannot be exceeded, or "soft" as I propose and only provide proportional limits between classes rather than absolute limits. It may be desirable for some users if there is a global setting to change between hard and soft limits. |
| Comment by Qian Yingjin (Inactive) [ 19/Dec/16 ] |
|
1. Our TBF is not a strict TBF algorithm. In our actual implementation, we do not actually need to insert tokens every 1/r seconds, which is rather inefficient. Instead, tokens in a bucket are only updated according to the elapsed time and token rate when the class queue is actually able to dequeue an RPC request. Our policy uses a global timer always with expiration time setting to the latest deadline of classes (bucket) and all classes are sorted according to their deadlines. When the timer expired, The class with smallest deadline will be selected. Then the first RPC request in this class queue will be dequeued and handled by an idle service thread. 2. 3. 4. Different scheduler types of a class: FIFO, ORR (necessary?) I/O service is preemptable. class with high priority can prempt the serving class with lowing priority. Or Or This scheduler can implement proportional fair scheduling under heavy congested server (RPC queue depth reaches thousands or tens of thousands), but it is not very suitalbe for a light loaded server. Any suggestion?
|
| Comment by Andreas Dilger [ 10/Jan/17 ] |
|
It should be noted that the kernel CFQ IO scheduler hurts IO performance significantly, so it should not necessarily be used as the basis for NRS. Otherwise, this proposal sounds interesting. I think there definitely should be some (optional) RPC reordering (like ORR) within a class to improve disk I/O ordering and increase the size of disk IO. Ideally, NRS would be able to merge multiple small read or write RPCs from different clients into a single disk IO so that the RAID device sees a single large IO submission. |
| Comment by Qian Yingjin (Inactive) [ 11/Jan/17 ] |
|
I think the main reason that kernel CFQ hurts IO performance is: It make the disk idling to wait for next request on certain cfq queue. The coresponding tunning parameter is slice_idle, which is a non-zero value by default. See the following reference in kernel/Documentation/block/cfq-iosched for details: slice_idle By default slice_idle is a non-zero value. That means by default we idle on queues/service trees. This can be very helpful on highly seeky media like single spindle SATA/SAS disks where we can cut down on overall number of seeks and see improved throughput. Setting slice_idle to 0 will remove all the idling on queues/service tree level and one should see an overall improved throughput on faster storage So depending on storage and workload, it might be useful to set slice_idle=0. In general I think for SATA/SAS disks and software RAID of SATA/SAS disks |
| Comment by Gerrit Updater [ 10/Feb/17 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/11918/ |
| Comment by Peter Jones [ 10/Feb/17 ] |
|
Landed for 2.10 |