Details
-
Improvement
-
Resolution: Unresolved
-
Major
-
None
-
None
-
9223372036854775807
Description
It would be useful to have a default NRS TBF rule that batched IOs on the servers based on their JobID, in a manner similar to how ORR or CRR batch IOs based on the Object ID or client NID.
Aggregating and completing IOs from a specific job is beneficial for aggregate IO throughput optimization. While this is locally unfair for one job to be prioritized over another, it avoids both jobs from being slowed down while their IOs are competing with each other and being interleaved "fairly" to the storage. Prioritizing any one JobID would at least allow it to finish its IO first, and then get on with computation (presumably no longer generating IO) and the other JobIDs can complete their IO with less contention. The IO completion time would probably be comparable for the last JobID, but may even be improved if the reduction in contention allows the IO to be more efficient (i.e. read- or write-only workload vs. mixed read/write from multiple jobs) and have better allocation at the OSD level if there is less concurrency.
An implementation challenge would be to ensure that the same JobID is prioritized across all MDTs/OSTs, so that one job actually finishes its IO first, and does not have uneven completion time across targets. Self-balancing systems might do something (arbitrary) like prioritize IO based on lower JobID name, since this can be determined uniformly across targets without any central control. De-prioritized jobs would get a "credit" for a later priority boost (e.g. GIFT: A Coupon Based Throttle-and-Reward Mechanism for Fair and Efficient I/O Bandwidth Management on Parallel Storage Systems) so that overall the IO is fair and "cp" does not always have priority over "dd" if the JobID is "procname_uid".
Another approach would be to coordinate scheduling of the JobID based on the current time, which should be synchronized across at least server nodes with NTP. Something like hash(jobid) % 10 == ktime_get_real() % 10 to approximately distribute JobIDs uniformly across 10 time slices per second and all servers would prioritize the same JobID at the same time. This is not in itself sufficient to handle the general case, but gives some idea of a potential solution. It would likely need some other hash/modulus to order JobIDs within a time slice if there are hash collisions, so that the servers still schedule the same JobID at the same time, and backfill empty time slices in the same manner.
Attachments
Issue Links
- is related to
-
LU-18269 NRS TBF bucket prioritization
-
- Open
-
-
LU-18179 Implementation of Round-Robin/Fair Share response with Token Bucket Filters
-
- Open
-
-
LU-17296 NRS TBF default rules
-
- Open
-
-
LU-20090 Per‑Rule Scheduling Class type for NRS TBF
-
- Open
-
-
LU-13031 store JobID of program that created file in inodes at create time
-
- Resolved
-
- is related to
-
LU-17503 IO500: improve NRS TBF to sort requests by object offset for ior-hard-write
-
- Open
-
-
LU-17166 add NRS TBF rule for projid
-
- Resolved
-