Details
-
Bug
-
Resolution: Unresolved
-
Medium
-
None
-
None
-
None
-
3
-
9223372036854775807
Description
There are two timeouts related to recovery - the soft timeout (default 5min) and the hard timeout (default 15min). The recovery timeout begins at the soft timeout, but can be extended in five places:
In check_and_start_recovery_timer: The connect request includes an estimate of the server’s service timeout before recovery, which the server uses to update the recovery timeout.
In check_for_recovery_ready: Recovery always waits for the lod_sub_recovery_thread to prepare the llog and get updates from the MDTs before it starts replays, so it just extends the timer as necessary here.
In replay_request_or_update: Makes sure it has at least obd_timeout (100s) each time the server handles an update request during this stage.
In handle_recovery_req: While replaying requests/locks or sending the final ping, it extends the timer to allow enough time for the next replay request.
In target_recovery_overseer: When the timer does expire and times out, the overseer (a thread which manages state change and error handling during recovery) resets the timer after evicting stale exports. This gives some time to see if recovery can resume with the problematic exports gone.
The goal of all of these extensions is to make sure we don't time out recovery which is actually making progress. However, the first extension is problematic for several reasons:
1. The service_timeout as measured before recovery (reboot or failover) has absolutely nothing to do with how long things will take on the recovering server. Often, recovery is entered expressly because a server was having problems (many of which can be resolved by reboot or failover), so if service_timeout is reported as very high by clients, we will extend past the soft timeout for no good reason.
2. The service_timeout really doesn't have anything to do with the actual replays in the reconnection phase. Perhaps it will influence how long it takes to actually commit new transactions or grant new locks, but there are dedicated additional extend_recovery_timer calls in those stages. The reconnection phase simply involves reconstructing the already granted lock state and ptlrpc queues via ptlrpc_replay_next and ldlm_replay_locks, which is very fast. The long part of this stage is the clients actually realizing the server is ready for replay/failover is needed and joining recovery.
3. The client is allowed to extend the timeout all the way up to the hard timeout. In this case, if any clients fail to reconnect, we will hard timeout rather than soft timeout.
This is problematic if any of the clients do not join recovery (very common in many workloads with containers where hosts are scaled up and down frequently). In these cases, if we have let clients extend the soft timeout, we will wait (sometimes several extra minutes) for no reason before evicting the missing clients, leading to additional downtime.
The worst case is if a client extends recovery to the hard timeout. In this case, we will go to the hard timeout behavior without even attempting a soft timeout. Because we never reached a VBR-enabled recovery attempt, any client that has a later transaction in the replay queue will be evicted even if it reconnected and was perfectly fine. If you get unlucky, you can end up getting nearly your entire healthy fleet evicted for no reason.
We have observed all of these behaviors, and by removing the extend_recovery_timer call in check_and_start_recovery_timer, they are resolved.
I think it would be worth re-evaluating all of this timer extension logic, as it feels potentially overengineered and error-prone. Other network filesytems use fixed timeouts for replay, which is much easier to reason about.
The main concern here would be failover - if the client MUST wait for service_timeout before failing over, then we do need to give them at least that long. We should find a way to resolve this in a way that handles both cases. On servers without failover, this extension is a large liability.