Details
-
Bug
-
Resolution: Fixed
-
Blocker
-
Lustre 2.14.0, Lustre 2.12.6
-
None
-
3
-
9223372036854775807
Description
LU-13600 introduced lock ratelimiting logic, but it did not take into account that if there's a disconnection in the REPLAY_LOCKS phase then yet unsent locks get stuck in the sending queue so the replay locks thread hangs with imp_replay_inflight elevated above zero.
The direct consequence from that is recovery state machine never advances from REPLAY to REPLAY_LOCKS status when imp_replay_inflight is non zero:
if (imp->imp_state == LUSTRE_IMP_REPLAY) { CDEBUG(D_HA, "replay requested by %s\n", obd2cli_tgt(imp->imp_obd)); rc = ptlrpc_replay_next(imp, &inflight); if (inflight == 0 && atomic_read(&imp->imp_replay_inflight) == 0) { import_set_state(imp, LUSTRE_IMP_REPLAY_LOCKS); rc = ldlm_replay_locks(imp); if (rc) GOTO(out, rc); } rc = 0; }
To break this we either need to check import state in the replay locks thread before attempting any sending or make sure replay_one_lock() prepares resend requests in such a state that they are never stuck.
Attachments
Issue Links
- is duplicated by
-
LU-17262 replay-single test_135: FAIL: Unexpected sync success
- Open
- is related to
-
LU-18154 Client can permanently hang in lustre recovery on race condition in replay_locks
- Open
-
LU-16753 replay-single: test_135 timeout
- Open
-
LU-17792 replay-single: test 135 Error: 'import is not in REPLAY_LOCKS state'
- Open
-
LU-16943 replay-single test_135: error: set_param: param_path 'fail_val': No such file or directory
- Resolved
- is related to
-
LU-13600 limit number of RPCs in flight during recovery
- Resolved