Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11192

console warning for TBF on client

    XMLWordPrintable

Details

    • New Feature
    • Resolution: Unresolved
    • Minor
    • None
    • None
    • None
    • 9223372036854775807

    Description

      The I/O rates of users can be throttled by TBF NRS policy. However,
      normal users have no way to know whether the low performance is
      cause by congestion of the system itself or caused by any TBF rule
      added by the administer. They don't have any information about how
      TBF NRS policy is affecting their I/O rate, thus cannot do anything
      to react to the QoS situation.

      To solve the problem, this patch print warning messages to the user's
      consoles just like what quota mechanism is doing. When the I/O of an
      application matches an TBF rule on the server side, a message will be
      printed to the console of the process on the client side. The message
      includes information of the rule name, the classification of
      the RPC, the start/end time of the QoS period and whether further
      limitation will be enforced later if more thresholds are reached.

      This new feature assumes that a centralized tool will be used to
      monitor the I/O performance of the whole file system and manage the
      global configuration of TBF rules on all Lustre services. That tool
      will monitor aggregate the I/O throughput (or metadata operations)
      for each user (or job/client/group etc.) during a given time period.
      And when a global threshold of the I/O throughput has been reached by
      the user since the start time of the QoS period, TBF limitations will
      be enforced on the whole Lustre file system. That tool will configure
      TBF rules with enough information about this QoS decision. Two new
      options have been added into the TBF rule for this purpose:
      "start_epoch" and "end_epoch". If these two options are configured in
      a TBF rule, the printed message will notify the users about the
      start (or end) time of the QoS period.

      There could be multiple QoS thresholds, e.g. one soft threshold and
      one hard threshold. When user reached soft threshold, the RPC rate of
      the user will be reduced to a slightly small value. And if the user
      keeps on doing a lot of I/O and finally reached the hard threshold,
      the global management tool might decide to enforce a very strict RPC
      limitation. Thus, another option has been added to the TBF rule:
      "further_throttle". If this options is configured in a TBF rule, the
      printed message will notify the users that he/she needs to slow down
      the I/O rate until the end of this QoS period, otherwise, further
      limitation will be enforced as a penalty.

      To avoid the flood of messages to the user's console, the same
      message can only printed to a console after a time interval. This
      can be tuned as "qos_message_interval" parameter of ptlrpc module.

      In order to know whether a message has been printed to a console
      or not, a history of message will be kept in a hash table. The
      size of the hash table can be configured as "qos_message_history_size"
      parameter of ptlrpc module.

      Attachments

        Issue Links

          Activity

            People

              lixi_wc Li Xi
              lixi_wc Li Xi
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated: