Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12031

DoM/HSM: hsm_release fails after hsm_restore

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.16.0
    • Lustre 2.12.0
    • 3
    • 9223372036854775807

    Description

       There is an issue when releasing a file striped with DoM after an hsm_restore.

      To reproduce:

      1) create a file with a 1st component on MDT:

      lfs setstripe -E 1M -L mdt -E -1 -S 4M -c -1 /mnt/lustre/domfile

      2) archive and release the file (requires HSM set up)
       

      lfs hsm_archive /mnt/lustre/domfile
      # (wait for archive to complete)
      lfs hsm_release

      3) restore the file

      lfs hsm_restore /mnt/lustre/domfile
      # or cat /mnt/lustre/domfile

      4) release the file => FAILS 

      lfs hsm_release /mnt/lustre/domfile
      
      Cannot send HSM request (use of /mnt/lustre/domfile): Device or resource busy

       
      It may be something wrong with the data version stored in hsm EA.

      Attachments

        Issue Links

          Activity

            [LU-12031] DoM/HSM: hsm_release fails after hsm_restore

            Peter, is there any remaining work for this issue or can it be closed?

            nangelinas Nikitas Angelinas added a comment - Peter, is there any remaining work for this issue or can it be closed?

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/47139/
            Subject: LU-12031 mdt: explicit data version of DoM files
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: aae3289adb2bbc192870f195b78044484f717e16

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/47139/ Subject: LU-12031 mdt: explicit data version of DoM files Project: fs/lustre-release Branch: master Current Patch Set: Commit: aae3289adb2bbc192870f195b78044484f717e16

            "Sergey Cheremencev <sergey.cheremencev@hpe.com>" uploaded a new patch: https://review.whamcloud.com/47497
            Subject: LU-12031 mdt: proof of concept
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: b05ecdbcc682c1d2ef290110bd052bc3e0f2e61a

            gerrit Gerrit Updater added a comment - "Sergey Cheremencev <sergey.cheremencev@hpe.com>" uploaded a new patch: https://review.whamcloud.com/47497 Subject: LU-12031 mdt: proof of concept Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: b05ecdbcc682c1d2ef290110bd052bc3e0f2e61a

            Rather than storing yet another xattr on the DoM inode in this case (which might have issues with backup/restore, etc.), what about just not updating i_version on setxattr from HSM restore (or resetting it to the pre-update i_version)?

            adilger Andreas Dilger added a comment - Rather than storing yet another xattr on the DoM inode in this case (which might have issues with backup/restore, etc.), what about just not updating i_version on setxattr from HSM restore (or resetting it to the pre-update i_version)?

            "Mike Pershin <mpershin@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/47139
            Subject: LU-12031 mdt: explicit data version of DoM files
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 5dbeabb034654f258162f05722477645e5f2fffe

            gerrit Gerrit Updater added a comment - "Mike Pershin <mpershin@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/47139 Subject: LU-12031 mdt: explicit data version of DoM files Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 5dbeabb034654f258162f05722477645e5f2fffe

            Ben - see LU-9961

            adilger Andreas Dilger added a comment - Ben - see LU-9961

            It would be nice if instead of all the playing around with temp files we could just restore to a stripe, and once completed mark it as primary.  We should also be able to restore all the other layout information as well and mark them as secondary.

            beevans Ben Evans (Inactive) added a comment - It would be nice if instead of all the playing around with temp files we could just restore to a stripe, and once completed mark it as primary.  We should also be able to restore all the other layout information as well and mark them as secondary.

            but what is actually bad about DoM release/restore - it is the fact that DoM stripe is not actually archived and is not restored after all though it 'looks' so. On archive operation it is read and stored in atchive but unlike OST object the data in inode is not truncated and stay untouched. Upon restore DoM data is read from archive and is written to volatile file inode. But on swap layout it is gone along with volatile file actually and original data in original inode become just visible as layout says it exists. So all that time DoM data stays in inode and its copy in archive is just lost along with volatile file. That means there is no any sense to archive what is always kept in inode on disk. Therefore I tend to return back to first solution when DoM stripe is either non-released or just removed in favor of first ost stripe if exists

            tappro Mikhail Pershin added a comment - but what is actually bad about DoM release/restore - it is the fact that DoM stripe is not actually archived and is not restored after all though it 'looks' so. On archive operation it is read and stored in atchive but unlike OST object the data in inode is not truncated and stay untouched. Upon restore DoM data is read from archive and is written to volatile file inode. But on swap layout it is gone along with volatile file actually and original data in original inode become just visible as layout says it exists. So all that time DoM data stays in inode and its copy in archive is just lost along with volatile file. That means there is no any sense to archive what is always kept in inode on disk. Therefore I tend to return back to first solution when DoM stripe is either non-released or just removed in favor of first ost stripe if exists

            Ben, unlike inode_version the data_version is not changed by xattr set, that is why I was trying to introduce it. Like on OST it would be changed only on data change - write, truncate and fallocate. So that solves problems when metadata operations affects data_version though required separated xattr to store it.

            tappro Mikhail Pershin added a comment - Ben, unlike inode_version the data_version is not changed by xattr set, that is why I was trying to introduce it. Like on OST it would be changed only on data change - write, truncate and fallocate. So that solves problems when metadata operations affects data_version though required separated xattr to store it.

            I think it's much more sinister than that, in a non-DoM case, the data_version is calculated on each portion of the file (all on OSTs) then combined into a single data version and written to an XATTR on the MDT.  For DoM, the act of writing the HSM data_version to an XATTR would cause the data_version on the MDT to change.  Unless we can predict what the "next" DoM data_version is, so that the HSM XATTR agrees with the calculated data_version after the XATTR is written.  So for a restore->release case it will always be wrong.

            beevans Ben Evans (Inactive) added a comment - I think it's much more sinister than that, in a non-DoM case, the data_version is calculated on each portion of the file (all on OSTs) then combined into a single data version and written to an XATTR on the MDT.  For DoM, the act of writing the HSM data_version to an XATTR would cause the data_version on the MDT to change.  Unless we can predict what the "next" DoM data_version is, so that the HSM XATTR agrees with the calculated data_version after the XATTR is written.  So for a restore->release case it will always be wrong.

            Mike, in case it is helpful to you, newer ext4 code has a "swap data" operation that is meant to allow swapping a "volatile" file into the boot loader inode. This could be used to swap data between two DoM files if needed.

            That said, your recent comments indicate that it isn't the DoM data swap that is the main obstacle, but the ordering problem of the data version. IMHO, a content-based hash is probably still too expensive if the data version is used regularly. That would make inode operations that need 1KB/inode into data operations that need (possibly) 1MB/inode, or at least 64KB/inode. There was some discussion recently on whether the data version should be used for NFS file modification tracking, so doing a DoM checksum on every file access would be punishing. Storing a separate xattr would be much more efficient.

            Maybe I'm missing something, but is it not possible to store the "original" object version in the swapped MDT inode? This might mess with recovery, but if the volatile file is gone it would be pretty clear that the layout swap could not be replayed in any case. We could also special-case the replay operation for layout swap to take this into consideration.

            adilger Andreas Dilger added a comment - Mike, in case it is helpful to you, newer ext4 code has a "swap data" operation that is meant to allow swapping a "volatile" file into the boot loader inode. This could be used to swap data between two DoM files if needed. That said, your recent comments indicate that it isn't the DoM data swap that is the main obstacle, but the ordering problem of the data version. IMHO, a content-based hash is probably still too expensive if the data version is used regularly. That would make inode operations that need 1KB/inode into data operations that need (possibly) 1MB/inode, or at least 64KB/inode. There was some discussion recently on whether the data version should be used for NFS file modification tracking, so doing a DoM checksum on every file access would be punishing. Storing a separate xattr would be much more efficient. Maybe I'm missing something, but is it not possible to store the "original" object version in the swapped MDT inode? This might mess with recovery, but if the volatile file is gone it would be pretty clear that the layout swap could not be replayed in any case. We could also special-case the replay operation for layout swap to take this into consideration.

            People

              tappro Mikhail Pershin
              cealustre CEA
              Votes:
              0 Vote for this issue
              Watchers:
              18 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: