[LU-7207] HSM: Add Archive UUID to delete changelog records - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Unresolved
Priority: Minor
Fix Version/s: None
Affects Version/s: None
Labels:
None

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

HSM tools currently store an external identifier (such as a UUID) in EA when a file is archived. The identifier is used to identify the file in the backend archive, and there may be more than one identifier if the file has been archived to multiple backends. Currently different tools are doing this independently and are not coordinating their EA names or formats.

When a file is deleted, the EA is no longer available, so it would be helpful include the identifier(s) in the Delete changelog record. I suggest we define a standard name and format HSM archive EA, and this data should be included as is in the delete changelog record.

One possible format would be to use JSON to encode a list of endpoint and archive id. Here is a strawman example to begin the discussion:

{
  "replicas": [
    {
      "endpoint" : "s3://my-bucket/archve",
      "id": "UUID"
     },
    { 
      "endpoint" : "wos://address",
      "id": OID
    }
  ]
}

Alternatively, to save space the endpoint could just be an index that refers to a specific endpoint in the local configuration.

Attachments

Issue Links

is related to

LU-6866 MDT file migration is incompatible with HSM

Resolved

LU-10092 PCC: Lustre Persistent Client Cache

Resolved

LU-11376 Special file/dir to represent DAOS Containers

Resolved

Activity

[LU-7207] HSM: Add Archive UUID to delete changelog records

Frank Zago (Inactive) added a comment - 13/Nov/17 4:37 PM

It's configurable in Robinhood. For instance Cray's copytool stores it in "trusted.tascon.uuid", and the format is ascii.

Frank Zago (Inactive) added a comment - 13/Nov/17 4:37 PM It's configurable in Robinhood. For instance Cray's copytool stores it in "trusted.tascon.uuid", and the format is ascii.

Andreas Dilger added a comment - 11/Nov/17 5:58 AM

Henri, what is the xattr name that is used by the RBH copytool, and what format does it store the archive identifier in the xattr (ASCII/binary, any archive prefix, etc)?

Andreas Dilger added a comment - 11/Nov/17 5:58 AM Henri, what is the xattr name that is used by the RBH copytool, and what format does it store the archive identifier in the xattr (ASCII/binary, any archive prefix, etc)?

Nathan Rutman added a comment - 23/Jun/16 11:21 PM

+1 for the UnlinkedArchived directory. The only files that live there will be ones that were archived, so presumably not your thousands of scratch files, but rather only ones you wanted to keep. (This also seems a yummy way to implement undelete, if one were to track the path somehow.) Mainly it means that you don't have to know ahead of time what format (or even what EA) the backend may be using to track it's ids.
I'll throw in another thought - it would be nice to send a tombstone request to the coordinator queue at every unlink. This would allow the copytool to do its thing without depending on Robinhood. E.g. copytool could delete the archive copy, or could put it on a delayed-delete list, etc. This has all the same problems (still need to know backend id mapping) except that presumably the reaction time will be fast, no "pending for a week" issues. It also starts moving away from RBH dependence, which IMHO is a good thing.

Nathan Rutman added a comment - 23/Jun/16 11:21 PM +1 for the UnlinkedArchived directory. The only files that live there will be ones that were archived, so presumably not your thousands of scratch files, but rather only ones you wanted to keep. (This also seems a yummy way to implement undelete, if one were to track the path somehow.) Mainly it means that you don't have to know ahead of time what format (or even what EA) the backend may be using to track it's ids. I'll throw in another thought - it would be nice to send a tombstone request to the coordinator queue at every unlink. This would allow the copytool to do its thing without depending on Robinhood. E.g. copytool could delete the archive copy, or could put it on a delayed-delete list, etc. This has all the same problems (still need to know backend id mapping) except that presumably the reaction time will be fast, no "pending for a week" issues. It also starts moving away from RBH dependence, which IMHO is a good thing.

Robert Read added a comment - 25/Feb/16 5:19 PM

I agree adding changelog is more compact,, but the advantage of this approach is it decouples how the external HSM metadata is stored from Lustre internals, and provides more flexibility for the HSM tools.

A week was just a suggestion. Obviously the TTL should be tunable and default to not retaining them at all. If the system is working properly then files should only be retained long enough to process the queue, and if things are not working properly then the directory could be flushed.

Robert Read added a comment - 25/Feb/16 5:19 PM I agree adding changelog is more compact,, but the advantage of this approach is it decouples how the external HSM metadata is stored from Lustre internals, and provides more flexibility for the HSM tools. A week was just a suggestion. Obviously the TTL should be tunable and default to not retaining them at all. If the system is working properly then files should only be retained long enough to process the queue, and if things are not working properly then the directory could be flushed.

Andreas Dilger added a comment - 25/Feb/16 4:36 PM

If the timeout for these records is a week, then I don't think it is practical to keep this in unlinked inodes in the PENDING directory. Otherwise, there may be far too many inodes created and deleted in that period and PENDING may get too large. In that case I think it is more practical to store the UUID into the ChangeLog record.

In newer releases it is possible to add extensible fields to ChangeLog records as needed, and the lifetime of those records will be exactly as needed. They will only consume a few bytes in a block in the log, and not an inode or increase in the size of the PENDING directory.

Andreas Dilger added a comment - 25/Feb/16 4:36 PM If the timeout for these records is a week, then I don't think it is practical to keep this in unlinked inodes in the PENDING directory. Otherwise, there may be far too many inodes created and deleted in that period and PENDING may get too large. In that case I think it is more practical to store the UUID into the ChangeLog record. In newer releases it is possible to add extensible fields to ChangeLog records as needed, and the lifetime of those records will be exactly as needed. They will only consume a few bytes in a block in the log, and not an inode or increase in the size of the PENDING directory.

Robert Read added a comment - 24/Feb/16 7:34 PM

I agree, the external key should be opaque data and interpreted by data mover associated with the archive ID for that file.

Getting back to the original intention of this ticket, no matter how or where the key is stored, we still need a to ensure the data is available after a file has been deleted. The original proposal here was to add the key to the changelog. Another option is to retain deleted inodes with the archived flag set in a pending directory (much like what is currently done for open-unlinked and migrated files). The data mover would be able access the extend attributes directly using the FID, and since the remove operation already clears the archive flag, a periodic garbage collector could detect which inodes could be removed safely. There could also be a timeout (a week?) to cleanup old files regardless of the achieve flag, just to ensure they don't collect indefinitely.

Robert Read added a comment - 24/Feb/16 7:34 PM I agree, the external key should be opaque data and interpreted by data mover associated with the archive ID for that file. Getting back to the original intention of this ticket, no matter how or where the key is stored, we still need a to ensure the data is available after a file has been deleted. The original proposal here was to add the key to the changelog. Another option is to retain deleted inodes with the archived flag set in a pending directory (much like what is currently done for open-unlinked and migrated files). The data mover would be able access the extend attributes directly using the FID, and since the remove operation already clears the archive flag, a periodic garbage collector could detect which inodes could be removed safely. There could also be a timeout (a week?) to cleanup old files regardless of the achieve flag, just to ensure they don't collect indefinitely.

Frank Zago (Inactive) added a comment - 08/Jan/16 3:53 PM

IMO the UUID should be stored as an opaque binary array. If it is ASCII, then it limits its format (or length), and the tools have to do back and forth conversions like is currently done with FIDs.

YAML output is nice but there isn't a decent library to read/extract them in C. I'd prefer JSON (which one can still parse as YAML) or XML. Due to YAML complexity, the python YAML is also slower than the JSON parser. Or we could have an --xml / --json / --yaml option to lfs, allowing the user to choose.

Frank Zago (Inactive) added a comment - 08/Jan/16 3:53 PM IMO the UUID should be stored as an opaque binary array. If it is ASCII, then it limits its format (or length), and the tools have to do back and forth conversions like is currently done with FIDs. YAML output is nice but there isn't a decent library to read/extract them in C. I'd prefer JSON (which one can still parse as YAML) or XML. Due to YAML complexity, the python YAML is also slower than the JSON parser. Or we could have an --xml / --json / --yaml option to lfs, allowing the user to choose.

Andreas Dilger added a comment - 10/Dec/15 4:48 AM

To continue my previous comments, it would be possible to store multiple UUIDs (or whatever we want to call them) in a single struct lov_hsm_attr_v2, like objects in a lov_mds_md, if it also stored the count of such entries is also stored. They would have to be in the same archive.

I would hope that if we are adding a new xattr format that we would just use the same lov_hsm_attr_v2 struct (whatever we decide it to look like) for the existing "hsm" xattr until composite layouts are ready, to avoid having three different ways to store this data. Depending on release schedules it may be that they are ready in the same release and we don't need to handle both access styles.

Andreas Dilger added a comment - 10/Dec/15 4:48 AM To continue my previous comments, it would be possible to store multiple UUIDs (or whatever we want to call them) in a single struct lov_hsm_attr_v2, like objects in a lov_mds_md, if it also stored the count of such entries is also stored. They would have to be in the same archive. I would hope that if we are adding a new xattr format that we would just use the same lov_hsm_attr_v2 struct (whatever we decide it to look like) for the existing "hsm" xattr until composite layouts are ready, to avoid having three different ways to store this data. Depending on release schedules it may be that they are ready in the same release and we don't need to handle both access styles.

Andreas Dilger added a comment - 10/Dec/15 3:48 AM

My initial thought wouldn't be to store multiple UUIDs per component, but rather to store each archive copy in a separate component, possibly expanding the lov_hsm_attrs_v2 to store an "archive date" so that this could be used for storing multiple versions of the file (in-filesystem versions would store the timestamps on the OST objects as they do now). That makes archive copies and in-filesystem copies more alike.

The main difference, besides performance, would be that we can't randomly update the archive data copy, though we could do clever things like create new components for parts of the file being written, so long as they are block aligned.

Andreas Dilger added a comment - 10/Dec/15 3:48 AM My initial thought wouldn't be to store multiple UUIDs per component, but rather to store each archive copy in a separate component, possibly expanding the lov_hsm_attrs_v2 to store an "archive date" so that this could be used for storing multiple versions of the file (in-filesystem versions would store the timestamps on the OST objects as they do now). That makes archive copies and in-filesystem copies more alike. The main difference, besides performance, would be that we can't randomly update the archive data copy, though we could do clever things like create new components for parts of the file being written, so long as they are block aligned.

Robert Read added a comment - 09/Dec/15 5:22 PM

Will it be possible to support multiple hsm sub_layouts per component?

UUID has specific meaning and not all the identifiers will be UUIDs, so the field should be a bit more generic, such as hsm_identifier or hsm_data. (I know we've excessively abused "UUID" in Lustre since forever but no reason to continue doing that.)

YAML output is great, but I'd expect copytools would be using the API to retrieve the layout data and update the identifiers.

Note, we'll still need to use user xattrs to store UUIDs until this work is completed, so the original idea here would still be a useful interim solution.

Robert Read added a comment - 09/Dec/15 5:22 PM Will it be possible to support multiple hsm sub_layouts per component? UUID has specific meaning and not all the identifiers will be UUIDs, so the field should be a bit more generic, such as hsm_identifier or hsm_data. (I know we've excessively abused "UUID" in Lustre since forever but no reason to continue doing that.) YAML output is great, but I'd expect copytools would be using the API to retrieve the layout data and update the identifiers. Note, we'll still need to use user xattrs to store UUIDs until this work is completed, so the original idea here would still be a useful interim solution.

Andreas Dilger added a comment - 09/Dec/15 12:27 AM

For composite layout access by userspace, "lfs getstripe" will be updated as part of the PFL project to format composite layouts in YAML format, so this can be consumed directly by user tools if desired, something like below (still open to suggestions on this):

$ lfs getstripe -v /mnt/lustre/file
"/mnt/lustre/file":
  fid: "[0x200000400:0x2c3:0x0]"
  composite_header:
    composite_magic: 0x0BDC0BD0
    composite_size:  536
    composite_gen:   6
    composite_flags: 0
    component_count: 3
  components:
    - component_id:     2
      component_flags:  stale, version
      component_start:  0
      component_end:    18446744073709551615
      component_offset: 152
      component_size:   48
      sub_layout:
        hsm_magic:      0x45320BD0
        hsm_flags:      [ exists, archived ] 
        hsm_arch_id:    1
        hsm_arch_ver:   0xabcd1234
        hsm_uuid_len:   16
        hsm_uuid:      e60649ac-b4e3-453f-88c7-611e78c38d5a
    - component_id:     3
      component_flags:  0
      component_start:  20971520
      component_end:    216777216
      component_offset: 208
      component_size:   144
      sub_layout:
        lmm_magic:        0x0BD30BD0
        lmm_pattern:      1
        lmm_stripe_size:  1048576
        lmm_stripe_count: 4
        lmm_stripe_index: 0
        lmm_layout_gen:   0
        lmm_layout_pool: flash
        lmm_obj:
          - 0: { lmm_ost: 0, lmm_fid: "[0x100000000:0x2:0x0]" }
          - 1: { lmm_ost: 1, lmm_fid: "[0x100010000:0x3:0x0]" }
          - 2: { lmm_ost: 2, lmm_fid: "[0x100020000:0x4:0x0]" }
          - 3: { lmm_ost: 3, lmm_fid: "[0x100030000:0x4:0x0]" }
    - component_id:     4
      component_flags:  0
      component_start:  3355443200
      component_end:    3367108864
      component_offset: 352
      component_size:   144
      sub_layout:
        lmm_magic:        0x0BD30BD0
        lmm_pattern:      1
        lmm_stripe_size:  4194304
        lmm_stripe_count: 4
        lmm_stripe_index: 5
        lmm_pool:         flash
        lmm_layout_gen:   0
        lmm_obj:
          - 0: { lmm_ost: 5, lmm_fid: "[0x100050000:0x2:0x0]" }
          - 1: { lmm_ost: 6, lmm_fid: "[0x100060000:0x2:0x0]" }
          - 2: { lmm_ost: 7, lmm_fid: "[0x100070000:0x3:0x0]" }
          - 3: { lmm_ost: 0, lmm_fid: "[0x100000000:0x3:0x0]" }

This describes a file that was originally written (as a normal RAID-0 file), then archived (creating component_id #2 on the same file), and then two disjoint parts of the file (offsets at 21MB and 3.3GB) were read back in from tape to create component_id's #3 and #4. The actual policy decisions of when to read in partial files is up to the policy engine and copytool, and outside the scope of the on-disk format.

Andreas Dilger added a comment - 09/Dec/15 12:27 AM For composite layout access by userspace, "lfs getstripe" will be updated as part of the PFL project to format composite layouts in YAML format, so this can be consumed directly by user tools if desired, something like below (still open to suggestions on this): $ lfs getstripe -v /mnt/lustre/file "/mnt/lustre/file": fid: "[0x200000400:0x2c3:0x0]" composite_header: composite_magic: 0x0BDC0BD0 composite_size: 536 composite_gen: 6 composite_flags: 0 component_count: 3 components: - component_id: 2 component_flags: stale, version component_start: 0 component_end: 18446744073709551615 component_offset: 152 component_size: 48 sub_layout: hsm_magic: 0x45320BD0 hsm_flags: [ exists, archived ] hsm_arch_id: 1 hsm_arch_ver: 0xabcd1234 hsm_uuid_len: 16 hsm_uuid: e60649ac-b4e3-453f-88c7-611e78c38d5a - component_id: 3 component_flags: 0 component_start: 20971520 component_end: 216777216 component_offset: 208 component_size: 144 sub_layout: lmm_magic: 0x0BD30BD0 lmm_pattern: 1 lmm_stripe_size: 1048576 lmm_stripe_count: 4 lmm_stripe_index: 0 lmm_layout_gen: 0 lmm_layout_pool: flash lmm_obj: - 0: { lmm_ost: 0, lmm_fid: "[0x100000000:0x2:0x0]" } - 1: { lmm_ost: 1, lmm_fid: "[0x100010000:0x3:0x0]" } - 2: { lmm_ost: 2, lmm_fid: "[0x100020000:0x4:0x0]" } - 3: { lmm_ost: 3, lmm_fid: "[0x100030000:0x4:0x0]" } - component_id: 4 component_flags: 0 component_start: 3355443200 component_end: 3367108864 component_offset: 352 component_size: 144 sub_layout: lmm_magic: 0x0BD30BD0 lmm_pattern: 1 lmm_stripe_size: 4194304 lmm_stripe_count: 4 lmm_stripe_index: 5 lmm_pool: flash lmm_layout_gen: 0 lmm_obj: - 0: { lmm_ost: 5, lmm_fid: "[0x100050000:0x2:0x0]" } - 1: { lmm_ost: 6, lmm_fid: "[0x100060000:0x2:0x0]" } - 2: { lmm_ost: 7, lmm_fid: "[0x100070000:0x3:0x0]" } - 3: { lmm_ost: 0, lmm_fid: "[0x100000000:0x3:0x0]" } This describes a file that was originally written (as a normal RAID-0 file), then archived (creating component_id #2 on the same file), and then two disjoint parts of the file (offsets at 21MB and 3.3GB) were read back in from tape to create component_id's #3 and #4. The actual policy decisions of when to read in partial files is up to the policy engine and copytool, and outside the scope of the on-disk format.

People

Assignee:: WC Triage

Reporter:: Robert Read

Votes:: 0 Vote for this issue

Watchers:: 19 Start watching this issue

Dates

Created:: 24/Sep/15 5:36 PM

Updated:: 25/Apr/19 5:06 PM