Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-19829

Files lost due to the removal of request deduplication in HSM

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.18.0
    • Lustre 2.15.7, Lustre 2.15.8
    • 1
    • 9223372036854775807

    Description

      At CEA, we have lost a lot of files due to the removal of the deduplication of HSM requests in 2.15 (LU-13651). Our copytools failed to archive files when a file was archived multiple times in parallel. This led to the file in the HSM backend to be removed but since at least one of the archives succeeded, Lustre marked the file as archived. Some of these files where then released by our policy engine. Those files have therefore been lost...

      The real issue comes from the lack of support for parallel archives in our copytool (which could be hard to implement depending on the backend). That being said, I think Lustre could have avoided these issues if it did not ignore failed archives as it does currently. It would be nice if Lustre was more resilient to HSM backends that don't behave well in the face of parallel archives. I see a few ways to improve things:

      1. mark a file dirty at the end of a failed archive iff the file is in the state "exists archived" (a file can only be in this state at the end of an archive if multiple archive requests where sent)
      2. reintroduce request deduplication: this could be made optional so as to avoid the performance issues that originally led to its removal
      3. have a more scalable way of detecting request deduplication

      For the time being, we will reintroduce the old deduplication mecanism to stabilize our production systems. I think marking files as dirty in the face of error is a good approach to prevent this kind of issue regardless of the support for deduplication.

      Attachments

        Issue Links

          Activity

            People

              courrier Guillaume Courrier
              courrier Guillaume Courrier
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: