Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-14027

Client recovery statemachine hangs in recovery disconnected during lock reply

XMLWordPrintable

    • 3
    • 9223372036854775807

      LU-13600 introduced lock ratelimiting logic, but it did not take into account that if there's a disconnection in the REPLAY_LOCKS phase then yet unsent locks get stuck in the sending queue so the replay locks thread hangs with imp_replay_inflight elevated above zero.

      The direct consequence from that is recovery state machine never advances from REPLAY to REPLAY_LOCKS status when imp_replay_inflight is non zero:

              if (imp->imp_state == LUSTRE_IMP_REPLAY) {
                      CDEBUG(D_HA, "replay requested by %s\n",
                             obd2cli_tgt(imp->imp_obd));
                      rc = ptlrpc_replay_next(imp, &inflight);
                      if (inflight == 0 &&
                          atomic_read(&imp->imp_replay_inflight) == 0) {
                              import_set_state(imp, LUSTRE_IMP_REPLAY_LOCKS);
                              rc = ldlm_replay_locks(imp);
                              if (rc)
                                      GOTO(out, rc);
                      }
                      rc = 0;
              }
      

      To break this we either need to check import state in the replay locks thread before attempting any sending or make sure replay_one_lock() prepares resend requests in such a state that they are never stuck.

            green Oleg Drokin
            green Oleg Drokin
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

              Created:
              Updated:
              Resolved: