Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-14027

Client recovery statemachine hangs in recovery disconnected during lock reply

Details

    • 3
    • 9223372036854775807

    Description

      LU-13600 introduced lock ratelimiting logic, but it did not take into account that if there's a disconnection in the REPLAY_LOCKS phase then yet unsent locks get stuck in the sending queue so the replay locks thread hangs with imp_replay_inflight elevated above zero.

      The direct consequence from that is recovery state machine never advances from REPLAY to REPLAY_LOCKS status when imp_replay_inflight is non zero:

              if (imp->imp_state == LUSTRE_IMP_REPLAY) {
                      CDEBUG(D_HA, "replay requested by %s\n",
                             obd2cli_tgt(imp->imp_obd));
                      rc = ptlrpc_replay_next(imp, &inflight);
                      if (inflight == 0 &&
                          atomic_read(&imp->imp_replay_inflight) == 0) {
                              import_set_state(imp, LUSTRE_IMP_REPLAY_LOCKS);
                              rc = ldlm_replay_locks(imp);
                              if (rc)
                                      GOTO(out, rc);
                      }
                      rc = 0;
              }
      

      To break this we either need to check import state in the replay locks thread before attempting any sending or make sure replay_one_lock() prepares resend requests in such a state that they are never stuck.

      Attachments

        Issue Links

          Activity

            [LU-14027] Client recovery statemachine hangs in recovery disconnected during lock reply

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/41227/
            Subject: LU-14027 tests: Fix test_135 of replay-single
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: fab71963c2513ec8f4eff2c1636c767c47a46034

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/41227/ Subject: LU-14027 tests: Fix test_135 of replay-single Project: fs/lustre-release Branch: master Current Patch Set: Commit: fab71963c2513ec8f4eff2c1636c767c47a46034

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/41224/
            Subject: LU-14027 ldlm: Do not hang if recovery restarted during lock replay
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set:
            Commit: 5fa7c8f24e71187a0c3ac70a04a8b566de5a76f3

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/41224/ Subject: LU-14027 ldlm: Do not hang if recovery restarted during lock replay Project: fs/lustre-release Branch: b2_12 Current Patch Set: Commit: 5fa7c8f24e71187a0c3ac70a04a8b566de5a76f3

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/41223/
            Subject: LU-14027 ldlm: Do not wait for lock replay sending if import dsconnected
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set:
            Commit: 2bcc166b0a660afab62d96ede496f42c31ada94b

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/41223/ Subject: LU-14027 ldlm: Do not wait for lock replay sending if import dsconnected Project: fs/lustre-release Branch: b2_12 Current Patch Set: Commit: 2bcc166b0a660afab62d96ede496f42c31ada94b

            Etienne AUJAMES (eaujames@ddn.com) uploaded a new patch: https://review.whamcloud.com/41227
            Subject: LU-14027 tests: Fix test_135 of replay-single
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: a2d9b877521a0198f333228b117380c5c855e6e8

            gerrit Gerrit Updater added a comment - Etienne AUJAMES (eaujames@ddn.com) uploaded a new patch: https://review.whamcloud.com/41227 Subject: LU-14027 tests: Fix test_135 of replay-single Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: a2d9b877521a0198f333228b117380c5c855e6e8

            The patch above fix the https://review.whamcloud.com/39111/ ("LU-13600 ptlrpc: limit rate of lock replays") on b2_12 branch.

            eaujames Etienne Aujames added a comment - The patch above fix the https://review.whamcloud.com/39111/ (" LU-13600 ptlrpc: limit rate of lock replays") on b2_12 branch.

            Etienne AUJAMES (eaujames@ddn.com) uploaded a new patch: https://review.whamcloud.com/41224
            Subject: LU-14027 ldlm: Do not hang if recovery restarted during lock replay
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set: 1
            Commit: c54bb57e1687f0db23753eea0b100cc5071d916a

            gerrit Gerrit Updater added a comment - Etienne AUJAMES (eaujames@ddn.com) uploaded a new patch: https://review.whamcloud.com/41224 Subject: LU-14027 ldlm: Do not hang if recovery restarted during lock replay Project: fs/lustre-release Branch: b2_12 Current Patch Set: 1 Commit: c54bb57e1687f0db23753eea0b100cc5071d916a

            Etienne AUJAMES (eaujames@ddn.com) uploaded a new patch: https://review.whamcloud.com/41223
            Subject: LU-14027 ldlm: Do not wait for lock replay sending if import dsconnected
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set: 1
            Commit: d5f9742667dab11393c602807e918c9eb8793b2b

            gerrit Gerrit Updater added a comment - Etienne AUJAMES (eaujames@ddn.com) uploaded a new patch: https://review.whamcloud.com/41223 Subject: LU-14027 ldlm: Do not wait for lock replay sending if import dsconnected Project: fs/lustre-release Branch: b2_12 Current Patch Set: 1 Commit: d5f9742667dab11393c602807e918c9eb8793b2b
            pjones Peter Jones added a comment -

            Landed for 2.14

            pjones Peter Jones added a comment - Landed for 2.14

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/40238/
            Subject: LU-14027 ldlm: Do not hang if recovery restarted during lock replay
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 7ca495ec67f474e10352077fc40123e4818b8e69

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/40238/ Subject: LU-14027 ldlm: Do not hang if recovery restarted during lock replay Project: fs/lustre-release Branch: master Current Patch Set: Commit: 7ca495ec67f474e10352077fc40123e4818b8e69

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/40272/
            Subject: LU-14027 ldlm: Do not wait for lock replay sending if import dsconnected
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: f06a4efe13faca21ae2a6afcf5718d748bb6ac5d

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/40272/ Subject: LU-14027 ldlm: Do not wait for lock replay sending if import dsconnected Project: fs/lustre-release Branch: master Current Patch Set: Commit: f06a4efe13faca21ae2a6afcf5718d748bb6ac5d

            People

              green Oleg Drokin
              green Oleg Drokin
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: