Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-18201

HSM coordinator causes cdt_llog_lock starvation

Details

    • Improvement
    • Resolution: Unresolved
    • Major
    • None
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      We have seen a few occurrences in the field where a high HSM load has caused a significant drop in HSM performance, to the point that the HSM subsystem becomes almost unusable, accompanied by MDT thread hangs and the MDT becoming unresponsive. We can usually obtain somewhat usable functionality and continue handling the HSM load at reduced rates by changing different parameters of the overall HSM stack, but this is not an ideal solution, unfortunately.

      This is happening as the HSM coordinator thread holds cdt_llog_lock in write mode while accessing the HSM actions llog; if the work performed while holding the lock takes too much time, other MDT threads that also need to take cdt_llog_lock to access the actions llog in order to e.g. mark an HSM action as complete, add a new action, or get a list of current actions, are unable to make progress and so become unresponsive.

      In the longer term, a good solution might be to replace the use of the llog in the HSM subsystem with a different mechanism (e.g. LU-10699), or otherwise optimize the relevant paths, but as a short-term solution, we could simply release cdt_llog_lock in the HSM coordinator thread and then reacquire it, once a certain number of HSM requests have been handled, to give the opportunity to other MDT threads that are handling HSM requests to make progress.

      Attachments

        Issue Links

          Activity

            People

              nangelinas Nikitas Angelinas
              nangelinas Nikitas Angelinas
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated: