Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-18201

HSM coordinator causes cdt_llog_lock starvation

Details

    • Improvement
    • Resolution: Unresolved
    • Major
    • None
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      We have seen a few occurrences in the field where a high HSM load has caused a significant drop in HSM performance, to the point that the HSM subsystem becomes almost unusable, accompanied by MDT thread hangs and the MDT becoming unresponsive. We can usually obtain somewhat usable functionality and continue handling the HSM load at reduced rates by changing different parameters of the overall HSM stack, but this is not an ideal solution, unfortunately.

      This is happening as the HSM coordinator thread holds cdt_llog_lock in write mode while accessing the HSM actions llog; if the work performed while holding the lock takes too much time, other MDT threads that also need to take cdt_llog_lock to access the actions llog in order to e.g. mark an HSM action as complete, add a new action, or get a list of current actions, are unable to make progress and so become unresponsive.

      In the longer term, a good solution might be to replace the use of the llog in the HSM subsystem with a different mechanism (e.g. LU-10699), or otherwise optimize the relevant paths, but as a short-term solution, we could simply release cdt_llog_lock in the HSM coordinator thread and then reacquire it, once a certain number of HSM requests have been handled, to give the opportunity to other MDT threads that are handling HSM requests to make progress.

      Attachments

        Issue Links

          Activity

            [LU-18201] HSM coordinator causes cdt_llog_lock starvation

            We decided to address the HSM queue scalability issues through a different set of patches; I am not closing this ticket yet, in case we want to use it to track that work as well.

            nangelinas Nikitas Angelinas added a comment - We decided to address the HSM queue scalability issues through a different set of patches; I am not closing this ticket yet, in case we want to use it to track that work as well.

            This patch implements the change in the description; as mentioned, it is only intended as a short-term solution. We have deployed this in at least one customer site, that has reported that the patch helped with alleviating the issues mentioned in the description, that were caused by the high HSM load.

            nangelinas Nikitas Angelinas added a comment - This patch implements the change in the description; as mentioned, it is only intended as a short-term solution. We have deployed this in at least one customer site, that has reported that the patch helped with alleviating the issues mentioned in the description, that were caused by the high HSM load.

            "Nikitas Angelinas <nikitas.angelinas@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/56240
            Subject: LU-18201 hsm: yield cdt_llog_lock in coordinator thread
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 1b7f176c4aa107a64d534d73b21bdbe24a0f9674

            gerrit Gerrit Updater added a comment - "Nikitas Angelinas <nikitas.angelinas@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/56240 Subject: LU-18201 hsm: yield cdt_llog_lock in coordinator thread Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 1b7f176c4aa107a64d534d73b21bdbe24a0f9674

            People

              nangelinas Nikitas Angelinas
              nangelinas Nikitas Angelinas
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated: