Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-14027

Client recovery statemachine hangs in recovery disconnected during lock reply

    XMLWordPrintable

Details

    • 3
    • 9223372036854775807

    Description

      LU-13600 introduced lock ratelimiting logic, but it did not take into account that if there's a disconnection in the REPLAY_LOCKS phase then yet unsent locks get stuck in the sending queue so the replay locks thread hangs with imp_replay_inflight elevated above zero.

      The direct consequence from that is recovery state machine never advances from REPLAY to REPLAY_LOCKS status when imp_replay_inflight is non zero:

              if (imp->imp_state == LUSTRE_IMP_REPLAY) {
                      CDEBUG(D_HA, "replay requested by %s\n",
                             obd2cli_tgt(imp->imp_obd));
                      rc = ptlrpc_replay_next(imp, &inflight);
                      if (inflight == 0 &&
                          atomic_read(&imp->imp_replay_inflight) == 0) {
                              import_set_state(imp, LUSTRE_IMP_REPLAY_LOCKS);
                              rc = ldlm_replay_locks(imp);
                              if (rc)
                                      GOTO(out, rc);
                      }
                      rc = 0;
              }
      

      To break this we either need to check import state in the replay locks thread before attempting any sending or make sure replay_one_lock() prepares resend requests in such a state that they are never stuck.

      Attachments

        Issue Links

          Activity

            People

              green Oleg Drokin
              green Oleg Drokin
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: