Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-19400

Rework the HSM cancellation process

    XMLWordPrintable

Details

    • Improvement
    • Resolution: Unresolved
    • Medium
    • None
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      Current implementation

      • HSM cancel are received via a ptlrpc thread, to find the action to cancel directly in the llog via the file FID.
      • The cancel record is added to the llog with the HSM cookie of the action to cancel.
      • The coordinator first processes the request to be cancelled and send it to the copytool. If the cancel and the action are processed at the same time, the action is directly cancelled.
      • Then the cancel is processed. It marks the record to be canceled as CANCELED and then lookup is active request hashtable to retrieved client uuid of the active request to cancel. The whole HAL request while be sent to that client.
      • The coordinator waits the copytool to ack the cancel request with copytool hsm_progress ECANCELED (with action cookie to be canceled).
      • If the copytool does not support cancel request, it ignores the request. And the action to be canceled will be successful.
      • When hsm_progress with ECANCELED is received, the coordinator will mark the cancel record successful (ARS_SUCCEED) and remove the active request (the record to be canceled stays in ARS_CANCELED state.

      Current implementation issues

      • HSM cancel request are unable to directly cancel an action not yet started (the request to be canceled needs to be started).
      • HSM cancel are processed like others request: if many requests are pending in the queue, the cancel will be stuck behind and the action to be cancelled can succeed before the cancel be sent to the copytool.
      • HSM cancel are sent in batch with other actions. Cancel need to be sent to copytool with the action to be canceled. So if there are several cancels in the same HAL, the first will succeed, but the others can be started on different copytools.
      • If the copytool does not support cancel, the action will succeed (the coordinator could ignore copytool requests for an action canceled).
      • The current implementation implies a lot of exception to handle and made the code difficult to understand and to maintain.

      These issues make "lfs hsm_cancel" unusable/unstable.

      Requirements

      • Cancels should always be sent to right copytool (the one is executing the action).
      • Cancels should cancel non started action.
      • Cancels should be proceeded directly.
      • If the copytool does not support cancel (timeout, errors) or if the coordinator fails to send the cancel to copytool, the record should be marks as ARS_CANCELED. In that case, the active request is removed and the futur hsm_progress for that request are ignored by the coordinator.
      • A changelog needs to be emitted when the action is successfully canceled.
      • Cancels should allow the root user to cancel "remove" requests with non-existing file in Lustre.
      • If the file exists, cancel request should be denied/accepted according to the user credential and the file permissions.

      Attachments

        Activity

          People

            eaujames Etienne Aujames
            eaujames Etienne Aujames
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

              Created:
              Updated: