[LU-13651] Conditionally skip finding compatible HSM requests Created: 08/Jun/20  Updated: 21/Jan/22  Resolved: 03/Nov/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.14.0

Type: Improvement Priority: Major
Reporter: Nikitas Angelinas Assignee: Nikitas Angelinas
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-9540 cache HSM actions in memory by FID Open
is related to LU-13920 HSM: hsm_actions are not processed af... Resolved
is related to LU-14793 HSM: record the index when scan HSM a... Resolved
Rank (Obsolete): 9223372036854775807

 Description   

The HSM action queue is scanned linearly in hsm_find_compatible_cb() for existing requests on the same file so that duplicate or conflicting requests are not added and cancel requests are assigned the correct cookie, but this can cause a large delay in adding new requests when the action queue is very large, as access to it is locked for the duration of the search. Scanning the queue does not guarantee that duplicate or conflicting requests are not added as scanning (in hsm_find_compatible_cb()) and adding requests (in mdt_agent_record_add()) are distinct operations that are not serialized by a lock and so a race window exists between these two function calls within which duplicate or conflicting requests can be added. This is hopefully not a big problem though, as the CDT thread will not send duplicate archive requests to a copytool serving a different HSM backend (and we could probably prevent it from sending duplicate archive requests to a copytool serving the same backend if we made a small change in mdt_hsm_is_action_compat()) and duplicate restore requests are serialized by taking the layout lock on the file before being added to the action queue which effectively serializes them, afaik (although this blocks the caller, e.g. lfs, so I am not sure if it's ideal). Since calling hsm_find_compatible_cb() does not protect completely against this issue and can cause large delays in adding new requests, we skip calling it for all requests apart from cancel requests that don't specify a cookie (which should be all cancel requests in current code), hopefully safely.



 Comments   
Comment by Gerrit Updater [ 08/Jun/20 ]

Nikitas Angelinas (nikitas.angelinas@hpe.com) uploaded a new patch: https://review.whamcloud.com/38867
Subject: LU-13651 hsm: call hsm_find_compatible_cb() only for cancel
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 03d3665c1b872df2181c37638071b64381b4a14b

Comment by Richard Mansfield [ 13/Oct/20 ]

Are these patches ready to land?

Comment by Gerrit Updater [ 03/Nov/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38867/
Subject: LU-13651 hsm: call hsm_find_compatible_cb() only for cancel
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 9f1ef86ac3518dca6e567364e9a3b47fef3fada5

Comment by Peter Jones [ 03/Nov/20 ]

Landed for 2.14

Comment by Gerrit Updater [ 21/Jan/22 ]

"Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/46255
Subject: LU-13651 hsm: call hsm_find_compatible_cb() only for cancel
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: c6142f8c133dfe17f3f8b45167b6022b5630f9e8

Generated at Sat Feb 10 03:03:03 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.