Loading...

Details

Type: New Feature
Resolution: Unresolved
Priority: Medium
Fix Version/s: None
Affects Version/s: None
Labels:
None

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

Global Distributed QoS for Lustre

1. Overview

The objective of this project is to design a robust Quality of Service (QoS) mechanism for a large-scale distributed Lustre cluster (supporting 10,000+ client nodes and 100+ server nodes). The system ensures Minimum Guarantees (Reservation), Maximum Rate Limiting (Limit), and Dynamic Compensation (Proportion). The design draws inspiration from GPFS Distributed Bandwidth Quota Management and mClock fine-grained scheduling.

2. Core Architecture: The Three-Layer Synergistic Model

To achieve global QoS objectives, the system is divided into three functional layers:

Server Scheduler (SS): NRS scheduer runs on each server node. By default, it employs the mClock scheduling algorithm (with an option for NRS TBF only implementing basic throttling and rate limiting). It enforces QoS parameters assigned by the global coordinator for requests issuing into the NRS schulder on the server-side service partition (CPT). It is responsible for sever-side request dispatch for the service partition, executes the global coordinator's decision, collects I/O statistics, and monitors I/O traffic.
Global Policy Coordinator (GPC): Residing on the control plane (defaulting to the MDT0 server node), the GPC does not process I/O requests directly. Instead, it aggregates I/O statistics from all schedulers, orchestrates and coordinates I/O scheduling across all server-side local schedulers, and distributes the policy decisions. It is a global decision-making layer that, based on the I/O statistics from all local schedulers in the previous period round, It calculates the optimal QoS parameters for the next "Round" based on the global view of the previous round and makes decision for the next coordinated scheduling perioid for each scheduler (i.e. mClock) running on the servers.
Client Congestion Controller (CCC) [Optional]: Upon the completion of an I/O request, the client performs source-end congestion throttling control based on piggybackked information in the reply, such as server load status, backlogged request count and queue depth, to prevent excessive requests buildup at the sercver, overwhelming the server memory and capacity.

3. Periodic Feedback and Quota Pre-allocation

The global coordinator is not involved in the I/O path. It utilizes a "Lease" mechanism similar to GPFS to perform macro-level adjustments across time intervals (Rounds). This lightweight global controller effectively addresses the shortcomings of completely decentralized schemes in terms of global optimality and convergence speed.

3.1 Round-based Scheduling Logic

A "Round" (e.g., 1 second in duration) consists of four distinct phases:

Collection Phase: Each Server Scheduler tracks the actual IOPS/BPS throughput, request backlog depth, disk "Pressure factor," and total bandwidth utilization for every QoS Class Bucket from the previous round. The GPC periodically polls these statistics in batches from all servers using the sliding time window algorithm to obtain the "Demand Intent" of each active class bucket. The statistical information includes:

- IOPS and Bandwidth for each (Server, CPT, Bucket);
- Backlong depth and underlying disk bandwidth utilization for each (Server, CPT, Bucket) during the statistical period.

Calculation Phase (Global Arbitration): The GPC aggregates global demand:

- Under-subscription: If \sum \text{Demand} < \text{Total Capacity}, resources are allocated as requested, and surplus bandwidth is moved to a "Free Pool."
- Over-subscription (Congestion): The GPC adjusts the local Reservation $R_{local}$ and Limit (L_{local}) parameters for each server-side bucket based on global QoS settings (R, W, L) , the number of active buckets and demand expectation of each bucket. Note: the coordinator generally does not adjust the weight value (W) of a class bucket. However, it typically adjusts the local upper limit (L_{local}) based on the reserved bandwidth requirements of each class bucket to ensure the total bandwidth for that class does not exceed the overall limit of L. During this process, if a class bucket remains idle (no any I/O activity) for an extended period, its local R_{local} is set to a minimum value to conserve resources.

Dispatch Phase (Decision Distribution): After receiving the global review, the global coornidator runs a distributed scheduling algorithm to calculate Target IOPS/BPS (R_{local}, W_{local}, L_{local}) for each class bucket on the server-side scheduler for the new period round. It then sends these updated QoS Quota parameters to the local schedulers on each server.

Execution Phase (Local Enforcement): Servers enforce the new local QoS policy for the duration of the round while continuing to collect performance data. These statistics serve as an input for the collection step in the next round, forming a closed-loop feedback control system that allows the system to automically adapt to workload changes.

3.2 Server-side Local mClock Scheduler

Once the mClock scheduler obtains the QoS quota parameters for the current round, it applies the mClock algorithm:

Priority: It prioritizes tags meeting the R_{local} (Reservation) threshold.
Dynamic Proportion: Remaining capacity is allocated to various class buckets based on W (Weight/Proportion).
Throttling: Request traffic exceeding the L_{local} (Limit) for a class bucket is throttled.

4. Key Algorithm Design: Weight Credit Borrowing

This is the core mechanism for Dynamic Compensation:

Idle Detection: If the GPC detects that Class A's actual usage across multiple servers is significantly lower than its global R (Min), it temporarily transfers Class A's "Unused Credits" to a global shared pool.
Dynamic Acceleration: If Class B is marked as "Best-Effort Allowed," the GPC temporarily increases its L_{local} (Limit) for the current round, allowing it to utilize the idle bandwidth capacity of the server.
Instant Reclamation: When the bandwidth demand for a class A increases burstly, the server-side mClock immediately prioritizes Class A via its reservation tags. Concurrently, the GPC will throttle other classes back to its original limit in the subsequent round.

5. Resilience and Scalability

5.1 Emergency Handling

Fault Tolerance: If the connection to the GPC is lost, local schedulers gracefully degrade to standalone mClock mode, and there is TTL (Time To Live) for the QoS parameter (R_{local}, W_{local}, L_{local}), once it expired, it turns back to the default QoS setting, ensuring service continuity while maintaining local fairness, but the global QoS goal may be temporarily broken.
Burst Management: The Reservation mechanism in the buckets can handle short-term bursts by spacing the handling deadline with 1/R_{local}. The GPC can also consider the factor of the Backlog Depth during decision-making, temporarily assigning higher weight to heavily backlogged congested class buckets.

5.2 Optimizing for Large Scale (100+ Server Nodes)

To prevent the GPC from becoming a bottleneck in the distributed global QoS control, the following GPFS-sytle optimizations are adopted:

Hierarchical Aggregation: Servers can be organized into groups (e.g., Divide 100 server nodes into 10 groups). A Group Leader aggregates the group's statistics before reporting to the GPC, reducing the controller's processing overhead. When the number of servers is extremely large, a hierarchical GPC architecture (e.g., cluster-level GQC, rack-level GQC) can be employed to reduce pressure on a single controller.
Batching & Piggybacking: Server pressure indicators and statistics for active class buckets of all service partition schedulers (mClock or TBF) on a server are collected in batches. Preliminary suggestions and QoS decisions for all active class buckets on a server for the next round are also assigned in batches. Server congestion status is piggybacked on RPC replies to allow clients to react in milliseconds, thus clients can self-throttle by tunning the parameter max RPC in flight without waiting for the GPC's period .
Silence Suppression: Idle scheduling class buckets (those with no I/O activity in a time length) are excluded from the telemetry collection to minimize network overhead, reducing statistical overhead.

6. Global QoS Management and Configuration

Administrative Interface: Provides tools or interfaces to configure global QoS rules (just similar to the TBF rule) via command sending to the GPC.
Dynamic Global Rule Management: The GPC handles the distribution, modification, and deletion of rules across all involved server-side mClock/TBF scheduler.
- When adding a new global QoS rule, the GPC could evenly distribute and assign the QoS parameters of the rule among all involved server partition scheduler, adding corresponding local rule with initital assigned Qos parameters to the local mClock scheduler.
- Administrators can modify global QoS rules via an interface or command; The coordinator can dispatch commands to relevant mClock or TBF schedulers for corresponding modifications, or wait for the next scheduling round to reallocate QoS quota parameters.
HPC Integration: The system can coordinate with workload managers like SLURM to ensure that specific job IDs receive guaranteed I/O bandwidth during their execution window.

7. Conclusion

This distributed global QoS architecture solves the convergence challenges of large-scale systems through Global Round Control, protects the storage server from saturation via Client-side Congestion Control, and ensures fair, multi-dimensional resource allocation via Server-side mClock scheduling. By combining the flexibility of mClock with the deterministic control of a GPFS-style coordinator, this design provides a mature, production-grade QoS solution for the most demanding Lustre environments.

Attachments

Issue Links

is related to

LU-20099 mClock NRS Scheduler

Open