Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-20205

canceld cpubound in hpreq_check with many locks

    XMLWordPrintable

Details

    • Improvement
    • Resolution: Unresolved
    • Medium
    • Lustre 2.18.0
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      Currently when sending cancel requests we pack potentially as many lock handles in as possible with ELC, so for e.g. cancel requests that's over 600.

      Combine that with LU-20204 with extremely large number of locks on the server and the fact that cancel also has a "hpreq" handler that would iterate all the provided cancel handles to see if any of them qualify the request for high priority status, and we easily get into a situation where in order to even start any useful processing we need to iterate over 600000 entries of linked lists.

      While of course that's not sustainable, fixing the hashtables might take time as we absolutely need to make sure we do it right, and in the meantime we need to consider other ways to relieve the problems.

      E.g. having a knob to limit how many requests we pack into a single ELC request would be helpful to limit the amount of CPU work and it might come handy for other things in the future.

      Other types of requests that pack many lock handles could also be affected, but so far tha was not observed. Still we should consider that case too.

      Attachments

        Issue Links

          Activity

            People

              wc-triage WC Triage
              green Oleg Drokin
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated: