Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-10606

HSM info as part of LOV layout xattr

Details

    • New Feature
    • Resolution: Unresolved
    • Minor
    • None
    • None
    • None
    • 9223372036854775807

    Description

      (HPE LUS-5984)

      Motivation

      As mentioned in LU-10092 and discussed in a concall, it seems that treating Lustre's HSM information as a first-class layout type can bring an alignment of common code paths:

      • Conceptually, mirroring or migrating between Lustre pools is very similar to mirroring or migrating to an HSM
      • _lfs migrate, _FLR mirroring, and HSM data movement both are done through a userspace copytool. The same layout-aware copytool might be usable for all these cases.
      • LU-6081 provides an option to pipe lfs migrate requests though the MDS HSM coordinator queue. lfs mirror resync could be treated similarly.
      • There may be policies involved with mirroring, migrating, and archiving 
      • Some polices are internal to Lustre, some may be external (e.g. hsm restore is internal, lfs mirror resync delayed is external, _lfs mirror resync _immediate is internal), blurring the lines of where policy is managed.
      • It may be desirable to expand the idea of striping hints to also include 
      • There may be a desire to keep partial file components in the HSM, for limiting restore extents or for PFL layouts.
      • There may be other types of layouts beyond HSM where clients may not be able to access that layout's format/layout type directly (e.g. RAID6 parity), and would request an HSM-style restore to a more common layout type. 

      From LU-10092:

       

      This potentially also integrate nicely with composite files and FLR if we enhanced the Lustre layout to include an "HSM layout" component (equivalent to LOV_MAGIC_V1).  The "LOV_MAGIC_HSM" component describes a file in an HSM archive, storing the HSM archive number, "UUID" of the file within the archive, and other parameters (e.g. archive timestamp) needed to identify the file.  The archive timestamp could be useful for storing multiple replicas of the file in HSM and using it for file versioning, along with the FLR mirror_io equivalent to open up a specific component to access an older version of the file.

       

       

      Implementation

      Every layout should get a set of common parameters

      • stored extent range, offset
      • layout generation
      • timestamp
      • read priority (8b)
      • write priority (8b)
      • policy type (16b, see below)
      • flags:
        • writable (turn off to make immutable)
          ** readable (maybe never want to read very slow devices)
        • data missing (dead OST, or missing HSM file; unreadble)
          ** delay_sync (delayed resync only, not immediate)

      The HSM layout would roughly mirror the contents of today's HSM EA:

      • archive number (32b)
      • archive type (if # might be client ID for PCC, we might want another classifier for different types of archives)
      • archive file key?

      Adding an archive file key might be helpful where an HSM backend can't easily reference files by the Lustre FID. Problematically, this might be large - 1024 char string?

      Layout-as-policy

      In general with FLR we are starting to have "implied policies" in the layout: the presence of an FLR layout implies that the file will be copied to the mirror. It specifies a timeframe as well: delayed or immediate. And the number of mirrors requested. It might be good to embrace this a little bit and think about adding some more explicit policy details to the layouts:

      • Schedule delayed resync on close-after-write
      • Evacuate "primary" mirror after completing resync (for e.g. SSD to HDD tiering)
      • Redundancy goal
      • Restore target striping hint (lov_user_md?)
        Since it is difficult to predict all the use cases here, it may make sense to leave such a policy in a YAML or JSON extensible format.
        I understand that this opens a big can of worms; I think for starters we can just add a small integer "policy number" and leave further definition for the future.

      Attachments

        Issue Links

          Activity

            [LU-10606] HSM info as part of LOV layout xattr
            rread Robert Read added a comment -

            From the design doc:

            2.	enum hsm_states {
            3.	        HS_NONE             = 0x00000000,  
            4.	        HS_EXISTS           = 0x00000001,  
            5.	        HS_DIRTY            = 0x00000002,  
            6.	        HS_RELEASED         = 0x00000004,  
            7.	        HS_ARCHIVED         = 0x00000008,  
            8.	        HS_NORELEASE        = 0x00000010,  
            9.	        HS_NOARCHIVE        = 0x00000020,  
            10.	        HS_LOST             = 0x00000040,  
            11.	        HS_PCCRW            = 0x00000080,  
            12.	        HS_PCCRO            = 0x00000100,  
            13.	};   
            

            We may also need an HSM state to indicate the remote copy has been modified and current file in the Lustre is stale. This is needed for cache-style use cases where Lustre is not the primary location of the file.

            Instead of adding new states for PCCRW and PCCRO, why not integrate with existing states? PCCRW could be HS_EXISTS (IIRC), and we could have a new state HS_EXISTS_RO that could be used HSM and PCCRO.

            15.	enum lu_hsm_types {  
            16.	        LU_HSM_TYPE_NONE    = 0,
            17.	        LU_HSM_TYPE_POSIX   = 1,  /* Copytool lhsm_posix */
            18.	        LU_HSM_TYPE_PCCRW   = 2,  /* Used for PCC-RW */
            19.	        LU_HSM_TYPE_PCCRO   = 3,  /* Used for PCC-RO */
            20.	        LU_HSM_TYPE_S3      = 4,  /* Used for S3 */
            21.	        LU_HSM_TYPE_UNKNOWN = 0xffffffff,
            22.	};
            

            Why include the specific HSM backend type here, such as POSIX and S3? There are many possible backends, and it is the copytool's problem to work out which one to use. Instead, it would make sense to have archive attributes here such as RO, RW, and perhaps authoritative or non-authoritative.

            rread Robert Read added a comment - From the design doc: 2. enum hsm_states { 3. HS_NONE = 0x00000000, 4. HS_EXISTS = 0x00000001, 5. HS_DIRTY = 0x00000002, 6. HS_RELEASED = 0x00000004, 7. HS_ARCHIVED = 0x00000008, 8. HS_NORELEASE = 0x00000010, 9. HS_NOARCHIVE = 0x00000020, 10. HS_LOST = 0x00000040, 11. HS_PCCRW = 0x00000080, 12. HS_PCCRO = 0x00000100, 13. }; We may also need an HSM state to indicate the remote copy has been modified and current file in the Lustre is stale. This is needed for cache-style use cases where Lustre is not the primary location of the file. Instead of adding new states for PCCRW and PCCRO, why not integrate with existing states? PCCRW could be HS_EXISTS (IIRC), and we could have a new state HS_EXISTS_RO that could be used HSM and PCCRO. 15. enum lu_hsm_types { 16. LU_HSM_TYPE_NONE = 0, 17. LU_HSM_TYPE_POSIX = 1, /* Copytool lhsm_posix */ 18. LU_HSM_TYPE_PCCRW = 2, /* Used for PCC-RW */ 19. LU_HSM_TYPE_PCCRO = 3, /* Used for PCC-RO */ 20. LU_HSM_TYPE_S3 = 4, /* Used for S3 */ 21. LU_HSM_TYPE_UNKNOWN = 0xffffffff, 22. }; Why include the specific HSM backend type here, such as POSIX and S3? There are many possible backends, and it is the copytool's problem to work out which one to use. Instead, it would make sense to have archive attributes here such as RO, RW, and perhaps authoritative or non-authoritative.

            Seems like some work was started here and then abandoned? After the LUG22 HSM presentation today, I'd like to nudge this again with maybe a clearer list of benefits:

            • partial file restore. HSM layout as one element of a composite layout, maybe even using something like SEPFL.
            • keep file head on disk, archive long tail. Some file types have useful info embedded at the front, the rest isn't regularly used (e.g. file icons, or hdf5 info)
            • multiple archives per file. Current code only lists a single archive, no redundancy.
            • mirror to hsm. FLR layout with HSM as second mirror, allows for tracking archive info even when file is restored in Lustre. Allows for immediate punch of primary mirror on low space (if hsm mirror is synced).
            nrutman Nathan Rutman added a comment - Seems like some work was started here and then abandoned? After the LUG22 HSM presentation today, I'd like to nudge this again with maybe a clearer list of benefits: partial file restore. HSM layout as one element of a composite layout, maybe even using something like SEPFL. keep file head on disk, archive long tail. Some file types have useful info embedded at the front, the rest isn't regularly used (e.g. file icons, or hdf5 info) multiple archives per file. Current code only lists a single archive, no redundancy. mirror to hsm. FLR layout with HSM as second mirror, allows for tracking archive info even when file is restored in Lustre. Allows for immediate punch of primary mirror on low space (if hsm mirror is synced).

            Yingjin Qian (qian@ddn.com) uploaded a new patch: https://review.whamcloud.com/39599
            Subject: LU-10606 hsm: convert old HSM xattr into HSM layout
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 0bf3098f389f03620e915a9e765378d83ac74b36

            gerrit Gerrit Updater added a comment - Yingjin Qian (qian@ddn.com) uploaded a new patch: https://review.whamcloud.com/39599 Subject: LU-10606 hsm: convert old HSM xattr into HSM layout Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 0bf3098f389f03620e915a9e765378d83ac74b36

            Yingjin Qian (qian@ddn.com) uploaded a new patch: https://review.whamcloud.com/39387
            Subject: LU-10606 hsm: store HSM xattr as a basic layout
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: bc21703ce6a3fe3eb541bf3800cc2efaeed0af76

            gerrit Gerrit Updater added a comment - Yingjin Qian (qian@ddn.com) uploaded a new patch: https://review.whamcloud.com/39387 Subject: LU-10606 hsm: store HSM xattr as a basic layout Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: bc21703ce6a3fe3eb541bf3800cc2efaeed0af76

            LU-11376 adds "foreign" layout type. "HSM" can just be a different type of foreign layout.

            nrutman Nathan Rutman added a comment - LU-11376 adds "foreign" layout type. "HSM" can just be a different type of foreign layout.

            @john.hammond any input/opinion on this? I think I'm going to start pushing it at Cray.

            nrutman Nathan Rutman added a comment - @john.hammond any input/opinion on this? I think I'm going to start pushing it at Cray.

            The HSM layout should address current shortcomings as well, so:

            • archive number (32b)
            • archive type (32b)
            • archive flags (32b)
            • archive file key len (16b)
            • archive file key (max 1024B) (size of S3 keys)

            Common layout flags (for all layout types):

            • unavailable (dead OST, or missing HSM file; unreadable)
            • immutable (this mirror is write once)
            • purge_delay (data can be removed after replicating elsewhere)
            • purge_immed (remove data from this layout immediately after replicating)

            write_priority would be used to determine an implied preferred layout. E.g. if mirror A and mirror B are both at wr_prio 1, then clients write to them both simultaneously. Mirror C at prio 2 is written only if A or B are unavailable. (The number of simultaneous mirrors to write should be determined by the count of prio 1 items.)

            nrutman Nathan Rutman added a comment - The HSM layout should address current shortcomings as well, so: archive number (32b) archive type (32b) archive flags (32b) archive file key len (16b) archive file key (max 1024B) (size of S3 keys) Common layout flags (for all layout types): unavailable (dead OST, or missing HSM file; unreadable) immutable (this mirror is write once) purge_delay (data can be removed after replicating elsewhere) purge_immed (remove data from this layout immediately after replicating) write_priority would be used to determine an implied preferred layout. E.g. if mirror A and mirror B are both at wr_prio 1, then clients write to them both simultaneously. Mirror C at prio 2 is written only if A or B are unavailable. (The number of simultaneous mirrors to write should be determined by the count of prio 1 items.)

            People

              qian_wc Qian Yingjin
              nrutman Nathan Rutman
              Votes:
              0 Vote for this issue
              Watchers:
              15 Start watching this issue

              Dates

                Created:
                Updated: