Loading...

Details

Type: New Feature
Resolution: Unresolved
Priority: Medium
Fix Version/s: None
Affects Version/s: None
Labels:
None

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

Introduction

As a core storage system for large-scale High-Performance Computing (HPC), the Lustre file system faces significant challenges in providing Quality of Service (QoS) guarantees within multi-tenant environments. The Lustre Network Request Scheduler (NRS) framework provides the infrastructure for scheduling PTLRPC requests. By reordering or classifying/throttling requests, it prevents client starvation and presents a more optimizable workload to the backend filesystem, thereby achieving various QoS objectives. The NRS framework currently supports several policies, including FIFO, Round Robin (RR), and Token Bucket Filter (TBF). Among these, the TBF policy implements RPC rate limiting through a token bucket mechanism based on classifications such as NID, JobID, or Nodemap. However, TBF is unable to utilize idle or surplus server bandwidth and lacks multi-dimensional QoS control, such as reservation guarantees, proportional weight allocation, and hard limits.

Meanwhile, mClock is an I/O resource allocation algorithm proposed by Gulati et al. in 2010 (USENIX FAST). Specifically designed for virtualized environments, mClock expresses QoS requirements through three dimensions: Reservation, Weight (Proportion), and Limit. It is capable of maintaining fair allocation even as underlying resources fluctuate dynamically. Its distributed variant, dmClock, further extends this to distributed storage environments, maintaining global per-client reservation, limit, and proportional fairness through schedulers running independently on each node. It is widely recognized as one of the most elegant and comprehensive algorithms for solving resource contention in multi-tenant distributed storage. Currently, it has been adopted by mainstream distributed storage systems like Ceph, serving as a mature reference for production-grade QoS solutions.

This project aims to integrate the mClock algorithm into the NRS framework to provide Lustre with minimum bandwidth guarantees (Reservation), maximum bandwidth limits (Limit), and weight-based allocation of surplus resources (Proportion).

Goals

Our objective is to design and implement an NRS mClock Scheduler, deeply integrating the mClock algorithm's QoS capabilities with the Lustre NRS framework to achieve the following core goals:

Rich QoS Semantics: Support multiple classification categories, such as NID, Nodemap, Opcodes (primarily read/write), UID/GID/ProjID, or their combinations. Provide three-dimensional QoS control: Reservation, Weight, and Limit.

Guaranteed Minimums and Capped Maximums: Ensure the minimum IOPS or bandwidth requirements for critical tenants while capping the maximum IOPS or bandwidth (BPS) for others, such as background tasks.

Dynamic Compensation: When the storage backend has idle capacity, surplus resources are allocated to active jobs proportionally based on their weights.

Low Overhead: Maintain a scheduling complexity of O(log N) to ensure high performance in environments with tens of thousands of concurrent RPCs.

Global Fairness: Sustain proportional resource allocation fairness between clients across multi-server and multi-client scenarios.

Adaptive Scheduling: Dynamically adjust scheduling strategies during workload fluctuations to safeguard reserved bandwidth and prevent resource waste.

Manageability and Operability: Support runtime policy switching, dynamic rule configuration, and comprehensive statistical monitoring.

Reference

Gulati, A., Merchant, A., & Varman, P. (2007). d-clock: distributed QoS in heterogeneous resource environments. ACM SIGMETRICS.
Gulati, A., Merchant, A., & Varman, P. (2010). mClock: Handling Throughput Variability for Hypervisor IO Scheduling. OSDI'10.
Qian, Y., Barton, E., Wang, T., Puntambekar, N., & Dilger, A. (2009). A Novel network request scheduler for a large scale storage system. Computer Science - Research and Development.
Lustre NRS Architecture. (2010). Lustre Wiki.
Wang, Y., & Merchant, A. (2007). Proportional-share scheduling for distributed storage systems. USENIX FAST.

Attachments

Issue Links

is related to

LU-20107 Global Distributed QoS for Lustre

Open

mClock NRS Scheduler