[LU-14027] Client recovery statemachine hangs in recovery disconnected during lock reply Created: 14/Oct/20  Updated: 25/Oct/23  Resolved: 19/Nov/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.14.0, Lustre 2.12.6
Fix Version/s: Lustre 2.14.0, Lustre 2.12.7

Type: Bug Priority: Blocker
Reporter: Oleg Drokin Assignee: Oleg Drokin
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Blocker
Related
is related to LU-13600 limit number of RPCs in flight during... Resolved
is related to LU-16943 replay-single test_135: error: set_pa... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

LU-13600 introduced lock ratelimiting logic, but it did not take into account that if there's a disconnection in the REPLAY_LOCKS phase then yet unsent locks get stuck in the sending queue so the replay locks thread hangs with imp_replay_inflight elevated above zero.

The direct consequence from that is recovery state machine never advances from REPLAY to REPLAY_LOCKS status when imp_replay_inflight is non zero:

        if (imp->imp_state == LUSTRE_IMP_REPLAY) {
                CDEBUG(D_HA, "replay requested by %s\n",
                       obd2cli_tgt(imp->imp_obd));
                rc = ptlrpc_replay_next(imp, &inflight);
                if (inflight == 0 &&
                    atomic_read(&imp->imp_replay_inflight) == 0) {
                        import_set_state(imp, LUSTRE_IMP_REPLAY_LOCKS);
                        rc = ldlm_replay_locks(imp);
                        if (rc)
                                GOTO(out, rc);
                }
                rc = 0;
        }

To break this we either need to check import state in the replay locks thread before attempting any sending or make sure replay_one_lock() prepares resend requests in such a state that they are never stuck.



 Comments   
Comment by Gerrit Updater [ 14/Oct/20 ]

Oleg Drokin (green@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/40238
Subject: LU-14027 ldlm: Do not hang if recovery restarted during lock replay
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 1ed765025e4a2b34ff992cb5c461557bc35ad154

Comment by Gerrit Updater [ 16/Oct/20 ]

Oleg Drokin (green@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/40272
Subject: LU-14027 ldlm: Do not wait for lock replay sending if import dsconnected
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: bffa4ae3a3c38f7cd3bea2b7fbf8e09df98e46a0

Comment by Gerrit Updater [ 19/Nov/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/40272/
Subject: LU-14027 ldlm: Do not wait for lock replay sending if import dsconnected
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: f06a4efe13faca21ae2a6afcf5718d748bb6ac5d

Comment by Gerrit Updater [ 19/Nov/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/40238/
Subject: LU-14027 ldlm: Do not hang if recovery restarted during lock replay
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 7ca495ec67f474e10352077fc40123e4818b8e69

Comment by Peter Jones [ 19/Nov/20 ]

Landed for 2.14

Comment by Gerrit Updater [ 14/Jan/21 ]

Etienne AUJAMES (eaujames@ddn.com) uploaded a new patch: https://review.whamcloud.com/41223
Subject: LU-14027 ldlm: Do not wait for lock replay sending if import dsconnected
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: d5f9742667dab11393c602807e918c9eb8793b2b

Comment by Gerrit Updater [ 14/Jan/21 ]

Etienne AUJAMES (eaujames@ddn.com) uploaded a new patch: https://review.whamcloud.com/41224
Subject: LU-14027 ldlm: Do not hang if recovery restarted during lock replay
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: c54bb57e1687f0db23753eea0b100cc5071d916a

Comment by Etienne Aujames [ 14/Jan/21 ]

The patch above fix the https://review.whamcloud.com/39111/ ("LU-13600 ptlrpc: limit rate of lock replays") on b2_12 branch.

Comment by Gerrit Updater [ 14/Jan/21 ]

Etienne AUJAMES (eaujames@ddn.com) uploaded a new patch: https://review.whamcloud.com/41227
Subject: LU-14027 tests: Fix test_135 of replay-single
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: a2d9b877521a0198f333228b117380c5c855e6e8

Comment by Gerrit Updater [ 04/Mar/21 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/41223/
Subject: LU-14027 ldlm: Do not wait for lock replay sending if import dsconnected
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: 2bcc166b0a660afab62d96ede496f42c31ada94b

Comment by Gerrit Updater [ 04/Mar/21 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/41224/
Subject: LU-14027 ldlm: Do not hang if recovery restarted during lock replay
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: 5fa7c8f24e71187a0c3ac70a04a8b566de5a76f3

Comment by Gerrit Updater [ 25/Oct/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/41227/
Subject: LU-14027 tests: Fix test_135 of replay-single
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: fab71963c2513ec8f4eff2c1636c767c47a46034

Generated at Sat Feb 10 03:06:13 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.