Details
-
Bug
-
Resolution: Unresolved
-
Minor
-
None
-
None
-
None
-
3
-
9223372036854775807
Description
HSM tools currently store an external identifier (such as a UUID) in EA when a file is archived. The identifier is used to identify the file in the backend archive, and there may be more than one identifier if the file has been archived to multiple backends. Currently different tools are doing this independently and are not coordinating their EA names or formats.
When a file is deleted, the EA is no longer available, so it would be helpful include the identifier(s) in the Delete changelog record. I suggest we define a standard name and format HSM archive EA, and this data should be included as is in the delete changelog record.
One possible format would be to use JSON to encode a list of endpoint and archive id. Here is a strawman example to begin the discussion:
{ "replicas": [ { "endpoint" : "s3://my-bucket/archve", "id": "UUID" }, { "endpoint" : "wos://address", "id": OID } ] }
Alternatively, to save space the endpoint could just be an index that refers to a specific endpoint in the local configuration.
My initial thought wouldn't be to store multiple UUIDs per component, but rather to store each archive copy in a separate component, possibly expanding the lov_hsm_attrs_v2 to store an "archive date" so that this could be used for storing multiple versions of the file (in-filesystem versions would store the timestamps on the OST objects as they do now). That makes archive copies and in-filesystem copies more alike.
The main difference, besides performance, would be that we can't randomly update the archive data copy, though we could do clever things like create new components for parts of the file being written, so long as they are block aligned.