[LU-1565] lost LDLM_CANCEL RPCs - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Minor
Fix Version/s: Lustre 2.4.0
Affects Version/s: None
Labels:
- patch

Severity:
3
Rank (Obsolete):
7300

Description

this is the 2nd part of the issue described in the ~~LU-1239~~, LDLM_CANCEL will not be resend after reconnect because cancels are always marked as no_resend & no_delay. the case is precisely: rpc timeout, re-connect, cancel rpc resend.

I believe it is enough to just drop these no_resend and no_delay flags on LDLM_CANCEL rpc.
regarding the target case: rpc timeout, re-connect, cancel resend - it works as expected.

let's look through the recovery scenarios:

1. cancel is sent and reply is obtained, recovery starts: no enqueue no cancel is replayed.
2. cancel is sent, no reply, recovery starts, see replay_one_lock():
if (lock->l_flags & LDLM_FL_CANCELING)

{ LDLM_DEBUG(lock, "Not replaying canceled lock:"); RETURN(0); }

CANCELLING is already in progress, so no enqueue no cancel are replayed again.
3. lock is enqueued, recovery starts, cancel is created after lock enqueue is replayed,
but is not sent until the recovery ends, precisely:

BL AST is sent to client from the server
recovery starts
lock enqueue is replayed
BL AST comes to client, CANCEL rpc is created
recovery ends
CANCEL is sent
as CANCEL is not replayed, it does not break the recovery, it will be a new rpc.
however, as lock handle has changed, ESTALE will be returned as there is no such
lock on the server side.
this case is not covered by the current fix, it will still result in lock callback timeout
(waiting_locks_callback()) and the following client eviction. however, the race window
is very narrow comparing with the main resend case being fixed here - so it is just left
unfix for now.

4. lock is enqueued, recovery starts, lock enqueue is replayed and lock handled is updated,
cancel is created - it already contains a new lock handle.

5. all the cases when cancel is created even later are the same as (4).

Attachments

Issue Links

is related to

LU-18072 Lock cancel resending overwhelms ldlm canceld thread

Resolved

LU-7 Reconnect server->client connection

Resolved

Activity

People

Assignee:: Keith Mannthey (Inactive)

Reporter:: Vitaly Fertman

Votes:: 0 Vote for this issue

Watchers:: 9 Start watching this issue

Dates

Created:: 26/Jun/12 9:50 AM

Updated:: 29/Jul/24 5:21 PM

Resolved:: 20/Mar/13 6:38 PM