jhammond, another developer within HPE created a patch to skip calling hsm_find_compatible_cb() when all requests in a hal are archives or restores; we are trying to determine if this is safe to do. The CDT will set HS_EXISTS before sending an archive request to a copytool in mdt_hsm_agent_send ()->mdt_hsm_add_hal(), so an archive request for the same file but for a different archive backend will not be sent as mdt_hsm_agent_send()->mdt_hsm_is_action_compat() will return false; the duplicate request will be failed in the llog and should be removed by the CDT thread later due to timeout. I think by changing the checks in the latter function to also return false if the archive ids are the same unless HS_DIRTY is set should serialize archive requests in all cases? Restore requests seem to take the layout lock on the file in mdt_hsm_register_hal()->cdt_restore_handle_add() before being added to the llog so I think this should serialize them as well, although it blocks the caller, e.g. lfs, so I am not sure if it's ideal.
It already seems possible to have duplicate archive requests added to the llog or duplicate restores racing for the layout lock without applying the aforementioned patch as multiple RPCs can race in mdt_hsm_register_hal(), so I think the patch wouldn't be introducing a new race but making the existing one more likely. Implementing the cache in this ticket sounds like a more elegant solution, but do you think skipping the call to hsm_find_compatible_cb() in the way described would be safe? We can submit the patch to Gerrit if needed, of course.
jhammond, I have submitted the aforementioned patch at https://review.whamcloud.com/#/c/38867/.