-
Improvement
-
Resolution: Unresolved
-
Medium
-
None
-
Lustre 2.18.0
-
None
-
3
-
9223372036854775807
Allow an admin to quiesce a target's clients before a server upgrade so that the post-remount recovery window is minimized. This avoids minutes-long recovery windows and application stalls, without eviction of active clients. Accessed via new "lctl maintenance start/stop" command-line interface and generic Netlink. The quiesce confirms clients are clean (no cached write or replay backlog) before the reboot, so on remount recovery only has to wait out reconnect, not replay.
Client Side Changes
Two reverse-import directives over a dedicated RQF_LDLM_MAINTENANCE RPC using struct ldlm_maintenance_desc, sent only to clients that negotiated OBD_CONNECT2_MAINTENANCE feature at connect time, so pre-feature clients on a mixed-version cluster are never targeted (and so never wrongly evicted). There are three phases to the maintenance quiesce operation:
Client Pre-Recovery Warm-up
The client cancels unused locks (flushing their dirty data) and resets its adaptive-timeout estimates (at_reset on net-latency and per-portal service estimates to CONNECTION_SWITCH_MIN), so on the next mount it reconnects without first waiting out an inflated service estimate. This does not by itself necessarily shorten the import reconnect backoff; prompt reconnect relies on MGS-driven Imperative Recovery nudging clients as soon as the target accepts connections.
Synchronous RPC Operations
The client switches OST and MDT RPCs to write-through (via a new cl_maint_sync flag stored in the client import and checked in osc_enter_cache()) so that no new replay backlog accumulates before server shutdown. Existing dirty data and uncommitted RPCs drains via normal writeback. It self-reverts on reconnect to a new target instance (new ocd_instance from the server), so the maintenance mode is tied directly to the restart and does not need active external intervention to minimize the time that clients perform sync RPCs.
This will temporarily slow down metadata changes and data writes, but they will continue to make forward progress during this window without halting client RPCs completely. Read-only operations will continue normally so that applications are not totally hung as with a full RPC barrier.
Each directive evicts any client that does not ACK the state change, so once the quiesce completes every remaining client is accounted for – in-sync or pre-evicted.
Server Side Changes
A persistent OBD_INCOMPAT_MAINTENANCE flag saved persistently in the target's last_rcvd file indicates "no client has cached backlog". On the next mount the target detects this state and bounds recovery to reconnect time: the deadline is held maintenance_recovery_idle seconds past the most recent reconnect, so that an active reconnect stream keeps recovering as long as clients keep arriving, dynamically avoiding premature eviction with a large number of clients reconnecting to the target or busy MGS IR broadcast with 10k+ clients x N targets/server. A client that is unresponsive past the grace time (a genuine no-show) is evicted. Recovery completes immediately once all expected clients are back. The grace time is an idle gap, not a total time limit that needs to be tuned based on the client count, so a small value valid is even on large clusters. The flag is cleared and synced as recovery completes (tgt_boot_epoch_update), before normal service resumes, so a later unplanned restart recovers normally.
Interfaces
- lctl maintenance start|stop TARGET a NetLink target genl family (TARGET_CMD_MAINTENANCE + status dump)
- per-target routine tgt_maintenance().
- Operator observability: obdfilter.*.maintenance_status state machine (idle -> preparing -> draining -> quiesced -> resuming) + client counts.
- Tunables: maintenance_recovery_idle (reconnect-idle recovery bound), maintenance_ack_grace (directive ACK wait before evicting a straggler).
Tests in lustre/tests/replay-single.sh
- test lctl maintenance start/stop drives quiesced<->idle state and the flag target
- a quiesced restart bounds recovery and preserves the live client
- cancel-without-reboot resumes clients
- is related to
-
LU-19199 Ensure all fast recovery features are enabled and working
-
- Open
-