[LU-9540] cache HSM actions in memory by FID Created: 19/May/17  Updated: 08/Jun/20

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: John Hammond Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Issue Links:
Related
is related to LU-13651 Conditionally skip finding compatible... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

The time required to submit new HSM requests grows linearly with the size of the HSM actions log. This is because we first run hsm_find_compatible() on the HAL to see if there are already existing actions running or waiting on the files in the HAL. We should avoid this by keeping an in-memory such cache of actions indexed by FID.



 Comments   
Comment by Nathan Rutman [ 23/May/20 ]

This seems like vast overkill to a dubious function -
hsm_find_compatible can simply be skipped for everything except CANCEL.
Nikitas is working on a patch.

Comment by John Hammond [ 27/May/20 ]

> hsm_find_compatible can simply be skipped for everything except CANCEL.

hsm_find_compatible() is used to deduplicate restore requests (from multiple clients, say) for a given file.

Comment by Nikitas Angelinas [ 27/May/20 ]

jhammond, another developer within HPE created a patch to skip calling hsm_find_compatible_cb() when all requests in a hal are archives or restores; we are trying to determine if this is safe to do. The CDT will set HS_EXISTS before sending an archive request to a copytool in mdt_hsm_agent_send ()->mdt_hsm_add_hal(), so an archive request for the same file but for a different archive backend will not be sent as mdt_hsm_agent_send()->mdt_hsm_is_action_compat() will return false; the duplicate request will be failed in the llog and should be removed by the CDT thread later due to timeout. I think by changing the checks in the latter function to also return false if the archive ids are the same unless HS_DIRTY is set should serialize archive requests in all cases? Restore requests seem to take the layout lock on the file in mdt_hsm_register_hal()->cdt_restore_handle_add() before being added to the llog so I think this should serialize them as well, although it blocks the caller, e.g. lfs, so I am not sure if it's ideal.

It already seems possible to have duplicate archive requests added to the llog or duplicate restores racing for the layout lock without applying the aforementioned patch as multiple RPCs can race in mdt_hsm_register_hal(), so I think the patch wouldn't be introducing a new race but making the existing one more likely. Implementing the cache in this ticket sounds like a more elegant solution, but do you think skipping the call to hsm_find_compatible_cb() in the way described would be safe? We can submit the patch to Gerrit if needed, of course.

Comment by Nikitas Angelinas [ 08/Jun/20 ]

jhammond, I have submitted the aforementioned patch at https://review.whamcloud.com/#/c/38867/.

Generated at Sat Feb 10 02:27:05 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.