Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-17296

NRS TBF default rules

    XMLWordPrintable

Details

    • Improvement
    • Resolution: Unresolved
    • Major
    • Lustre 2.17.0
    • Lustre 2.16.0
    • 3
    • 9223372036854775807

    Description

      One issue with Lustre in highly-contended multi-application workloads is the "noisy neighbor" problem, where one application's IO can be negatively affected by another application's IO that is submitted at the same time.

      This particularly affects users on login nodes that are trying to run interactive workloads (eg. "ls -l" or compiling applications) that are only submitting a few RPCs at a time, and they can be blocked behind thousands of RPCs from a large job on hundreds of nodes.

      It is possible to create NRS TBF rules to limit the RPC processing rate of the servers, but most users do not implement TBF rules, and suffer from this issue.

      It would be useful to create a set of default TBF rules that could be applied to all systems, either at installation time or afterward, that provide a "best practice" result for a wide variety of use cases. It should be possible to override the default rules, but for many situations the default should avoid the majority of imbalance between jobs.

      For example, applying a very high Token limit (eg. 1M) on jobs by GID, UID, JobID should not constrain the job RPC processing rate if there is no contention, but if there is IO contention on a server then it should fairly balance the RPC rate between the jobs instead of using FIFO order that is often blocking small (eg. interactive user) RPC processing.

      Putting in a default JobID rule with a higher TBF allocation for jobs on login nodes (eg. With "*login*" in the job name) would also help boost interactive performance and at least have a reasonable chance of working out of the box. If not, this could be made clear in the documentation to be customized on deployment.

      Attachments

        Issue Links

          Activity

            People

              wc-triage WC Triage
              adilger Andreas Dilger
              Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

                Created:
                Updated: