Details
-
Bug
-
Resolution: Fixed
-
Minor
-
None
-
3
-
7300
Description
this is the 2nd part of the issue described in the LU-1239, LDLM_CANCEL will not be resend after reconnect because cancels are always marked as no_resend & no_delay. the case is precisely: rpc timeout, re-connect, cancel rpc resend.
I believe it is enough to just drop these no_resend and no_delay flags on LDLM_CANCEL rpc.
regarding the target case: rpc timeout, re-connect, cancel resend - it works as expected.
let's look through the recovery scenarios:
1. cancel is sent and reply is obtained, recovery starts: no enqueue no cancel is replayed.
2. cancel is sent, no reply, recovery starts, see replay_one_lock():
if (lock->l_flags & LDLM_FL_CANCELING)
CANCELLING is already in progress, so no enqueue no cancel are replayed again.
3. lock is enqueued, recovery starts, cancel is created after lock enqueue is replayed,
but is not sent until the recovery ends, precisely:
BL AST is sent to client from the server
recovery starts
lock enqueue is replayed
BL AST comes to client, CANCEL rpc is created
recovery ends
CANCEL is sent
as CANCEL is not replayed, it does not break the recovery, it will be a new rpc.
however, as lock handle has changed, ESTALE will be returned as there is no such
lock on the server side.
this case is not covered by the current fix, it will still result in lock callback timeout
(waiting_locks_callback()) and the following client eviction. however, the race window
is very narrow comparing with the main resend case being fixed here - so it is just left
unfix for now.
4. lock is enqueued, recovery starts, lock enqueue is replayed and lock handled is updated,
cancel is created - it already contains a new lock handle.
5. all the cases when cancel is created even later are the same as (4).