# ThemisIO → Lustre NRS: Synthesized Implementation Plan

## Problem Statement

The SC23 paper *"Fine-grained Policy-driven I/O Sharing for Burst Buffers"*
(ThemisIO) demonstrates that statistical-token-based scheduling with opportunity
fairness, transition-matrix composite policies, and δ-delayed global fairness
achieves 13.5–13.7% higher throughput and 59.1–99.8% lower I/O interference
slowdown than FIFO, TBF, and GIFT. Lustre's NRS subsystem already provides the
right request-reordering hook points, but none of the current policies
(FIFO, CRR-N, ORR/TRR, TBF, Delay) implement ThemisIO's core innovations.

The goal: bring ThemisIO's useful ideas into `lustre/ptlrpc/nrs*` as a new
`nrs_fairshare` policy, without breaking NRS semantics or disrupting existing
policies.

## ThemisIO Core Ideas

| Concept | Description | Lustre Analog |
|---------|-------------|---------------|
| **Statistical tokens** | Divide [0,1] into segments proportional to fair-share; draw random number to pick entity. Replaces multi-tier token queues with single flat mechanism. | None—TBF uses deterministic token buckets with admin-set rates. |
| **Transition-matrix composition** | Composite policies (e.g., group→user→size-fair) expressed as chain of transition matrices; product yields flat [0,1] ranges. O(1) runtime per dequeue. | None—TBF rules are flat ordered list with first-match semantics. |
| **Opportunity fairness** | Enforce fairness *only* when demand exceeds capacity. Under-loaded systems pass requests at full speed. | None—TBF always rate-limits. |
| **δ-delayed global fairness** | Periodic all-gather (δ ≈ 50–500 ms) across servers synchronizes job-status table so local fair-share ranges converge to global fairness. | TBF `tbf_global_rate` handles cross-CPT token stealing within one server, but no cross-server mechanism. |
| **Runtime metadata only** | Job-id, user-id, job-size read from I/O request at runtime. No admin-supplied rates or offline profiling. | TBF classifies by jobid/uid/gid/projid/nid/opcode, but rates are admin-configured. |

## Gap Analysis: TBF vs. ThemisIO

1. **Static vs. dynamic allocation** — TBF requires admin-specified RPC/s per
   rule. ThemisIO derives fair shares automatically from active entities.

2. **No composite policies** — TBF rules are a flat ordered list with
   first-match semantics. A combined key `uid+jobid` does NOT encode
   user-then-job fairness correctly—it creates a flat cross-product, not a
   hierarchy. ThemisIO handles this natively via matrix composition.

3. **No opportunity fairness** — TBF always rate-limits, even when idle.

4. **No cross-server coordination** — `tbf_global_rate` handles cross-CPT
   token stealing within one server. No cross-OST/MDS δ-fairness.

5. **Cost model mismatch** — TBF uses `cost_model=rpcs|pages`. ThemisIO uses
   I/O cycles (wall-clock time slicing).

## Architectural Decision: Single New Policy

**Decision:** Build `nrs_fairshare` as a new standalone NRS policy. Do NOT
extend TBF.

**Rationale:**
- TBF's flat first-match rule list cannot express hierarchical fair-share
  without restructuring its core data model. Bolting composite policies onto
  TBF would be nearly as much work as a new policy, while carrying regression
  risk for existing TBF users.
- NRS enforces one-primary-policy-per-head (see `nrs_policy_primary` in
  `struct ptlrpc_nrs`), so "combine fairshare + TBF" means a single new policy,
  not two side-by-side.
- A new additive policy has zero disruption to existing TBF, ORR, CRR-N users.
- The new policy reuses TBF's classifier layer (jobid/uid/gid/projid/nid/opcode
  key extraction) and debugfs patterns, but owns scheduling.

## Design Specification

### File Layout

```
lustre/include/lustre_nrs_fairshare.h   — public header (data structures, enums)
lustre/ptlrpc/nrs_fairshare.c           — policy implementation
```

### Key Data Structures

```c
/*
 * Sharing entity types that can appear in a policy hierarchy.
 */
enum nrs_fs_entity_type {
    NRS_FS_ENTITY_JOBID,
    NRS_FS_ENTITY_UID,
    NRS_FS_ENTITY_GID,
    NRS_FS_ENTITY_PROJID,
    NRS_FS_ENTITY_NID,
    __NRS_FS_ENTITY_TYPE_MAX,
};

/*
 * A sharing entity: one job, one user, one group, etc.
 * Tracks active/inactive state, accumulated I/O cost, and the
 * computed statistical-token range [fs_lo, fs_hi) ⊂ [0,1).
 *
 * Entities are stored in a per-head rhashtable keyed by
 * (entity_type, key_bytes).
 */
struct nrs_fs_entity {
    struct rhash_head       fse_rhash;
    struct rcu_head         fse_rcu;
    refcount_t              fse_ref;

    enum nrs_fs_entity_type fse_type;
    /* Variable-length key (jobid string, uid, nid, etc.) */
    u32                     fse_key_len;
    u8                      fse_key[];          /* flexible array */

    /* Statistical token range assigned by last δ-recalculation.
     * Stored as fixed-point u32 fractions of U32_MAX for fast
     * comparison against get_random_u32().
     */
    u32                     fse_lo;             /* inclusive */
    u32                     fse_hi;             /* exclusive */

    /* Active/backlogged state */
    bool                    fse_active;         /* has queued RPCs */
    ktime_t                 fse_last_seen;      /* heartbeat for liveness */

    /* Per-entity sub-queue of pending requests (FIFO within entity).
     * Optional: offset-ordered sub-queue for ORR-style locality on
     * spinning-disk OSTs (Stage 4 enhancement).
     */
    struct list_head        fse_req_list;
    unsigned long           fse_req_count;

    /* Observability counters */
    u64                     fse_dispatched;     /* total RPCs dispatched */
    u64                     fse_cost_pages;     /* total page cost dispatched */
    u64                     fse_opportunity;    /* RPCs dispatched via opp-fair */
    s64                     fse_fairness_debt;  /* cumulative over/under-share */
};

/*
 * Per-NRS-head (per-CPT) policy state.
 */
struct nrs_fs_head {
    struct ptlrpc_nrs_resource  fsh_res;

    /* Entity hash table */
    struct rhashtable           fsh_entity_hash;
    struct rhashtable_params    fsh_hash_params;

    /* Current policy configuration */
    enum nrs_fs_entity_type     fsh_levels[__NRS_FS_ENTITY_TYPE_MAX];
    int                         fsh_depth;      /* number of hierarchy levels */

    /*
     * Flattened transition-matrix product: array of (entity_ptr, lo, hi)
     * triples representing the [0,1) range assignment for each leaf entity.
     * Recomputed every δ interval. Protected by RCU: the δ-recomputation
     * thread publishes a new array; op_req_get readers dereference under
     * rcu_read_lock.
     */
    struct nrs_fs_range __rcu  *fsh_ranges;
    int                         fsh_range_count;

    /* Opportunity fairness threshold: if nrs_req_queued (from the
     * parent ptlrpc_nrs) is below this, bypass fairshare and dequeue
     * FIFO across all entities.
     */
    unsigned long               fsh_opp_threshold;

    /* δ-recalculation interval in ms (default 100, tunable 10–1000) */
    unsigned int                fsh_delta_ms;

    /* Total active entities on this CPT (used for equal-share default) */
    atomic_t                    fsh_active_count;

    /* Number of CPTs for this service (for cross-CPT normalization) */
    int                         fsh_ncpts;

    /* Sequence counter incremented on each δ-recalculation */
    u64                         fsh_generation;

    /* Cost model: pages (default) or rpcs */
    enum nrs_tbf_cost_model     fsh_cost_model;
};

/*
 * Flattened range entry (leaf of the transition-matrix product).
 */
struct nrs_fs_range {
    struct nrs_fs_entity       *fsr_entity;
    u32                         fsr_lo;
    u32                         fsr_hi;
};

/*
 * Per-request NRS private data (stored in nrq->nr_u).
 */
struct nrs_fs_req {
    struct nrs_fs_entity       *fsr_entity;     /* owning entity */
    struct list_head            fsr_list;        /* linkage in entity sub-queue */
    u64                         fsr_cost;        /* page or RPC cost */
    ktime_t                     fsr_enqueue_time;
};
```

### Scheduling Flow (op_req_get)

```
nrs_fairshare_req_get(policy, peek, force):
    head = policy->pol_private

    1. OPPORTUNITY FAIRNESS CHECK
       If policy->pol_nrs->nrs_req_queued < head->fsh_opp_threshold:
           → Scan all active entities, dequeue the request with the
             earliest enqueue timestamp (global FIFO).
           → Mark the dispatched request as opportunity-dispatched
             (fse_opportunity++).
           → Return request.

    2. STATISTICAL TOKEN DRAW
       sample = get_random_u32()
       ranges = rcu_dereference(head->fsh_ranges)
       Binary search ranges[] for the entry where fsr_lo <= sample < fsr_hi.
       winner = ranges[match].fsr_entity

    3. ENTITY SUB-QUEUE DEQUEUE
       If winner->fse_req_list is non-empty:
           → Dequeue oldest request (FIFO within entity).
           → Update winner->fse_dispatched, fse_cost_pages, fse_fairness_debt.
           → If winner->fse_req_list is now empty, mark fse_active = false.
           → Return request.
       Else (entity exhausted between recomputation and dequeue):
           → Fall through to FIFO dequeue across all entities as fallback.

    4. FORCE / PEEK
       Handle peek (return without removing) and force (ignore fairness,
       return any queued request) per NRS contract.
```

### Transition-Matrix Composition

For a configured policy like `uid_then_jobid_fair`:
- Level 0 matrix M₀: rows = 1 (root), columns = active UIDs.
  Each column gets equal share: 1/num_active_uids.
- Level 1 matrix M₁: rows = UIDs, columns = active jobids per UID.
  Each column within a row gets equal share: 1/num_jobs_for_that_uid.
- Product P = M₀ × M₁ yields a flat vector of (leaf_entity, share) pairs.
- Map shares to [0, U32_MAX) ranges: entity_i gets
  [sum(shares[0..i-1]) × U32_MAX, sum(shares[0..i]) × U32_MAX).

For primitive policies (e.g., `jobid_fair`), depth = 1, no matrix
multiplication needed—just equal partitioning of [0, U32_MAX).

Implementation note: In kernel space, avoid floating-point. Use u64
fixed-point arithmetic (e.g., shares as fractions of U32_MAX). The matrix
product is computed in the δ-recalculation kthread, not on the hot dispatch
path.

### Cost Model

**Decision:** Use page-weighted cost (reuse TBF's `NRS_TBF_CM_PAGES`
infrastructure) as the initial token cost basis.

- For BRW (bulk read/write) RPCs: cost = number of pages in the bulk transfer.
- For non-BRW RPCs: cost = 1.
- This naturally weights large and small I/Os without explicit knobs, and
  approximates ThemisIO's "I/O cycle" concept within Lustre's RPC-oriented
  dispatch model.
- Wall-clock time-slicing (true ThemisIO analog) deferred to a future
  enhancement—it would require tracking per-entity dispatch time in
  `op_req_stop`, which adds overhead to the completion hot path.

### Cross-CPT Fairness (δ-Delayed, Intra-Server)

Within one server, each service partition (CPT) runs its own `nrs_fs_head`.
Without coordination, a 4-CPT server could over-allocate shares.

**Mechanism (analogous to `tbf_global_rate`):**
- A per-service kthread (or hrtimer callback) wakes every `fsh_delta_ms`.
- It reads active entity sets from all CPTs for this service.
- It computes the global transition-matrix product (union of all active
  entities across CPTs, equal shares).
- It publishes the new `fsh_ranges` array to each CPT's `nrs_fs_head` via
  RCU pointer swap (`rcu_assign_pointer`).
- Each CPT's `op_req_get` reads via `rcu_dereference` — zero contention on
  the dispatch hot path.

**Cross-server (cross-OST/MDS) fairness:** Out of scope for v1. Would
require a new control RPC or piggybacking on LNet ping. Documented as a
future phase.

### ORR-Style Locality Hybrid (Stage 4)

Fair-share scheduling may regress I/O locality on spinning-disk OSTs because
requests from different entities are interleaved, breaking sequential access
patterns.

**Mitigation:** Within each entity's sub-queue (`fse_req_list`), optionally
sort BRW requests by object and offset (reusing ORR's key structure from
`nrs_orr.c`). The fairshare policy selects WHICH entity to serve; the entity
sub-queue determines the ORDER within that entity's batch.

- Disabled by default (FIFO sub-queues, suitable for NVMe/SSD).
- Enabled via lprocfs: `echo 1 > .../nrs_fairshare_locality`.
- Only applies to `ost_io` service BRW RPCs.

## Implementation Stages

Each stage is independently shippable and provides standalone value.

### Stage 1: Primitive Fair-Share with Opportunity Fairness

**Goal:** Single-level `jobid_fair` and `uid_fair` policies with opportunity
fairness bypass on the `ost_io` regular queue.

**Deliverables:**
- `lustre_nrs_fairshare.h` with data structures
- `nrs_fairshare.c` with all NRS ops:
  - `op_policy_start/stop` — allocate/free `nrs_fs_head`, start/stop δ-kthread
  - `op_res_get` — extract key from RPC (reuse TBF key extraction for
    jobid/uid/gid/projid/nid), look up or create `nrs_fs_entity`
  - `op_req_enqueue` — append to entity's sub-queue, mark active
  - `op_req_get` — opportunity fairness check → statistical token draw →
    entity sub-queue FIFO dequeue
  - `op_req_dequeue` — remove from entity sub-queue
  - `op_req_stop` — update entity cost counters, check liveness
  - `op_lprocfs_init/fini` — basic debugfs entries
- Kbuild wiring in `lustre/ptlrpc/Makefile`
- Registration in `nrs.c` via `ptlrpc_nrs_policy_register(&nrs_conf_fairshare)`
- δ-recalculation kthread for single-CPT (local entity recount + range rebuild)
- lprocfs interface:
  - `nrs_fairshare_mode` — read/write: `jobid_fair`, `uid_fair`, `gid_fair`
  - `nrs_fairshare_delta_ms` — δ interval (default 100)
  - `nrs_fairshare_opp_threshold` — opportunity fairness threshold
  - `nrs_fairshare_stats` — YAML per-entity stats (dispatched, cost, opportunity,
    fairness_debt, queue_depth)
- Tests in `conf-sanity.sh`:
  - Policy start/stop/restart persistence
  - Parameter read/write validation
  - Fallback to FIFO when fairshare is stopped
- Tests in `sanityn.sh`:
  - Two-client contention: verify fair dispatch ratio within ±10%
  - Under-loaded: verify opportunity fairness (no throughput degradation vs FIFO)

### Stage 2: Composite Policies via Transition Matrices

**Goal:** Hierarchical policies like `uid_then_jobid_fair`,
`gid_then_uid_fair`.

**Deliverables:**
- Transition-matrix computation in δ-kthread (u64 fixed-point)
- Extended `nrs_fairshare_mode` syntax: `uid_then_jobid_fair`,
  `gid_then_uid_then_jobid_fair`
- Internal entity tree: root → level-0 entities → level-1 entities → ...
  The flattened range array is the product of per-level share vectors.
- Tests in `sanityn.sh`:
  - User-then-job-fair: two users, one with 2 jobs, one with 4 jobs.
    Verify user-level equal split, then within-user equal split among jobs.
  - Three-tier group-user-jobid: verify hierarchical share allocation.

### Stage 3: Cross-CPT δ-Synchronization

**Goal:** Global (server-wide) fairness across all CPTs.

**Deliverables:**
- Promote δ-kthread to service-level (not per-CPT): gather active entities
  from all CPTs, compute unified transition-matrix product, publish via RCU
  to each `nrs_fs_head`.
- Handle entity migration: an entity active on CPT 0 but not CPT 2 still
  gets its global share; the per-CPT range array includes all global entities,
  not just locally-active ones.
- lprocfs: `nrs_fairshare_cross_cpt` — enable/disable (default: enabled)
- Tests in `sanityn.sh`:
  - Multi-CPT fairness: pin clients to different CPTs, verify global fair split.
  - Cross-CPT entity appears/disappears: verify range recomputation.

### Stage 4: ORR Locality Hybrid and Refinements

**Goal:** Preserve I/O locality on spinning-disk OSTs; polish for production.

**Deliverables:**
- Offset-ordered sub-queues within entities (reuse ORR key extraction)
- `nrs_fairshare_locality` lprocfs knob
- Weight/priority support: `echo "uid_fair weight=uid:1000:2,uid:1001:3"` for
  proportional (non-equal) sharing
- Wall-clock time-slice cost model (optional advanced mode)
- Extended YAML stats: borrowed share, opportunity usage ratio, per-entity
  fairness debt histogram
- Performance benchmarking vs FIFO, TBF, CRR-N on IOR, mdtest, real apps
- Documentation in `Documentation/lustre/nrs_fairshare.txt`

## Observability and Operator Interface

### lprocfs/debugfs Entries (under each service)

| Entry | R/W | Description |
|-------|-----|-------------|
| `nrs_fairshare_mode` | RW | Sharing policy: `jobid_fair`, `uid_fair`, `gid_fair`, `uid_then_jobid_fair`, etc. |
| `nrs_fairshare_delta_ms` | RW | δ-recalculation interval (default 100, range 10–1000) |
| `nrs_fairshare_opp_threshold` | RW | Queued-request threshold for opportunity fairness bypass |
| `nrs_fairshare_cost_model` | RW | `pages` (default) or `rpcs` |
| `nrs_fairshare_locality` | RW | `0` (FIFO sub-queues, default) or `1` (ORR-style offset ordering) |
| `nrs_fairshare_cross_cpt` | RW | `0` or `1` (default 1): cross-CPT entity synchronization |
| `nrs_fairshare_stats` | RO | YAML dump of per-entity statistics |
| `nrs_fairshare_stats_reset` | WO | Reset per-entity counters |

### YAML Stats Format

```yaml
- entity: jobid:batch_sim.12345
  type: jobid
  active: true
  range: [0.000, 0.333)
  dispatched: 148290
  cost_pages: 592160
  opportunity: 3041
  fairness_debt: -12.4
  queue_depth: 7
- entity: jobid:ml_train.67890
  ...
```

## Metadata Gap: Job Size

ThemisIO's `size-fair` policy requires job-size (node count), which is not
currently embedded in Lustre RPCs.

**Staged approach:**
1. **v1 (Stages 1–3):** Implement `jobid_fair`, `uid_fair`, `gid_fair` and
   composite variants. These do NOT require job-size. This covers the majority
   of useful policies.
2. **v1 approximation for size-fair:** Count distinct client NIDs per jobid
   within a sliding window as a proxy for job size. Imperfect (some NIDs may
   not issue I/O in every window) but functional.
3. **v2 (future):** Add job-size field to `ptlrpc_body` or embed in jobid
   sub-fields (e.g., `jobid.nodecount`). Requires client-side changes and
   wire-protocol versioning.

## Testing Strategy

### conf-sanity.sh (policy lifecycle)
- Test: start fairshare, verify active via `nrs_policies` read-back
- Test: stop fairshare, verify fallback to FIFO
- Test: set/get all tunable parameters (delta_ms, opp_threshold, mode,
  cost_model, locality, cross_cpt)
- Test: parameter persistence across policy restart
- Test: invalid parameter rejection (delta_ms=0, unknown mode string)

### sanityn.sh (fairness under contention)
- Test: two clients, `jobid_fair` — verify ~50/50 dispatch ratio (±10%)
- Test: three clients, `uid_fair`, two clients same UID — verify UID-level
  equal share, not client-level
- Test: under-loaded single client — verify no throughput regression vs FIFO
  (opportunity fairness)
- Test: `uid_then_jobid_fair` composite — verify hierarchical share split
- Test: cross-CPT fairness — pin two clients to different CPTs, verify
  global 50/50 split
- Test: entity arrival/departure — start two jobs, add third mid-test,
  verify shares rebalance within 2×δ

### Unit Tests (in-kernel or via test harness)
- Transition-matrix product correctness for known inputs
- Fixed-point arithmetic overflow/underflow edge cases
- Entity hash table create/lookup/remove under concurrent access
- Range binary search correctness (boundary cases: sample=0, sample=U32_MAX-1)

## Notes and Considerations

1. **Burst-buffer vs. kernel translation:** ThemisIO operates in user-space on
   dedicated I/O nodes. Lustre NRS is in-kernel on OSS/MDS. The statistical
   random-number approach (`get_random_u32()`) translates directly. The
   δ-delayed global fairness kthread translates to a kernel worker. The main
   difference is that `op_req_get` runs under `scp_req_lock` (spinlock), so
   the hot path must be O(log N) or better — binary search over the flattened
   range array satisfies this.

2. **Statistical convergence:** ThemisIO's effectiveness depends on sufficient
   I/O volume per entity for the statistical approach to converge (paper §3
   notes this limitation). For overloaded servers with many entities each
   issuing few RPCs, the random draw may produce unfair short-term allocation.
   Mitigation: the δ-recalculation incorporates fairness-debt tracking; entities
   that were under-served get a slightly expanded range in the next interval
   (proportional-integral correction). This is a refinement over base ThemisIO.

3. **Locking on hot path:** `op_req_get` runs under `scp_req_lock`. The
   range-array lookup is read-only (RCU-protected). Entity sub-queue dequeue
   modifies `fse_req_list` — but since each NRS head is per-CPT and dispatch
   is serialized by `scp_req_lock`, no additional lock is needed for the
   sub-queue.

4. **NRS single-policy constraint:** Only one primary policy can be active per
   NRS head. "Combine ORR plus fairshare" means the ORR-style offset ordering
   is integrated INTO the fairshare policy's entity sub-queues, not run as a
   separate policy.

5. **Key extraction reuse:** TBF's key extraction functions
   (`nrs_tbf_jobid_str`, `nrs_tbf_nid_str`, `nrs_tbf_id_str`, etc.) and the
   `nrs_tbf_field` enum are good candidates for extraction into a shared helper
   or direct reuse. If extraction is too invasive for upstream review, the
   fairshare policy can duplicate the ~50 lines of key extraction logic.

6. **Existing TBF users are unaffected.** The new policy is purely additive.
   Operators opt-in via `echo fairshare > .../nrs_policies`. TBF continues to
   work exactly as before.
