Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-1565

lost LDLM_CANCEL RPCs

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.4.0
    • None
    • 3
    • 7300

    Description

      this is the 2nd part of the issue described in the LU-1239, LDLM_CANCEL will not be resend after reconnect because cancels are always marked as no_resend & no_delay. the case is precisely: rpc timeout, re-connect, cancel rpc resend.

      I believe it is enough to just drop these no_resend and no_delay flags on LDLM_CANCEL rpc.
      regarding the target case: rpc timeout, re-connect, cancel resend - it works as expected.

      let's look through the recovery scenarios:

      1. cancel is sent and reply is obtained, recovery starts: no enqueue no cancel is replayed.
      2. cancel is sent, no reply, recovery starts, see replay_one_lock():
      if (lock->l_flags & LDLM_FL_CANCELING)

      { LDLM_DEBUG(lock, "Not replaying canceled lock:"); RETURN(0); }

      CANCELLING is already in progress, so no enqueue no cancel are replayed again.
      3. lock is enqueued, recovery starts, cancel is created after lock enqueue is replayed,
      but is not sent until the recovery ends, precisely:

      BL AST is sent to client from the server
      recovery starts
      lock enqueue is replayed
      BL AST comes to client, CANCEL rpc is created
      recovery ends
      CANCEL is sent
      as CANCEL is not replayed, it does not break the recovery, it will be a new rpc.
      however, as lock handle has changed, ESTALE will be returned as there is no such
      lock on the server side.
      this case is not covered by the current fix, it will still result in lock callback timeout
      (waiting_locks_callback()) and the following client eviction. however, the race window
      is very narrow comparing with the main resend case being fixed here - so it is just left
      unfix for now.

      4. lock is enqueued, recovery starts, lock enqueue is replayed and lock handled is updated,
      cancel is created - it already contains a new lock handle.

      5. all the cases when cancel is created even later are the same as (4).

      Attachments

        Issue Links

          Activity

            People

              keith Keith Mannthey (Inactive)
              vitaly_fertman Vitaly Fertman
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: