Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-14027

Client recovery statemachine hangs in recovery disconnected during lock reply

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: Lustre 2.14.0, Lustre 2.12.6
    • Fix Version/s: Lustre 2.14.0, Lustre 2.12.7
    • Labels:
      None
    • Severity:
      3
    • Rank (Obsolete):
      9223372036854775807

      Description

      LU-13600 introduced lock ratelimiting logic, but it did not take into account that if there's a disconnection in the REPLAY_LOCKS phase then yet unsent locks get stuck in the sending queue so the replay locks thread hangs with imp_replay_inflight elevated above zero.

      The direct consequence from that is recovery state machine never advances from REPLAY to REPLAY_LOCKS status when imp_replay_inflight is non zero:

              if (imp->imp_state == LUSTRE_IMP_REPLAY) {
                      CDEBUG(D_HA, "replay requested by %s\n",
                             obd2cli_tgt(imp->imp_obd));
                      rc = ptlrpc_replay_next(imp, &inflight);
                      if (inflight == 0 &&
                          atomic_read(&imp->imp_replay_inflight) == 0) {
                              import_set_state(imp, LUSTRE_IMP_REPLAY_LOCKS);
                              rc = ldlm_replay_locks(imp);
                              if (rc)
                                      GOTO(out, rc);
                      }
                      rc = 0;
              }
      

      To break this we either need to check import state in the replay locks thread before attempting any sending or make sure replay_one_lock() prepares resend requests in such a state that they are never stuck.

        Attachments

          Issue Links

            Activity

              People

              Assignee:
              green Oleg Drokin
              Reporter:
              green Oleg Drokin
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved: