Details
-
Improvement
-
Resolution: Unresolved
-
Major
-
None
-
None
-
None
-
3
-
9223372036854775807
Description
We have seen a few occurrences in the field where a high HSM load has caused a significant drop in HSM performance, to the point that the HSM subsystem becomes almost unusable, accompanied by MDT thread hangs and the MDT becoming unresponsive. We can usually obtain somewhat usable functionality and continue handling the HSM load at reduced rates by changing different parameters of the overall HSM stack, but this is not an ideal solution, unfortunately.
This is happening as the HSM coordinator thread holds cdt_llog_lock in write mode while accessing the HSM actions llog; if the work performed while holding the lock takes too much time, other MDT threads that also need to take cdt_llog_lock to access the actions llog in order to e.g. mark an HSM action as complete, add a new action, or get a list of current actions, are unable to make progress and so become unresponsive.
In the longer term, a good solution might be to replace the use of the llog in the HSM subsystem with a different mechanism (e.g. LU-10699), or otherwise optimize the relevant paths, but as a short-term solution, we could simply release cdt_llog_lock in the HSM coordinator thread and then reacquire it, once a certain number of HSM requests have been handled, to give the opportunity to other MDT threads that are handling HSM requests to make progress.
Attachments
Issue Links
- is related to
-
LU-18498 change HSM locking for llog operations
-
- Open
-