Details
-
New Feature
-
Resolution: Unresolved
-
Minor
-
None
-
None
-
None
-
9223372036854775807
Description
The I/O rates of users can be throttled by TBF NRS policy. However,
normal users have no way to know whether the low performance is
cause by congestion of the system itself or caused by any TBF rule
added by the administer. They don't have any information about how
TBF NRS policy is affecting their I/O rate, thus cannot do anything
to react to the QoS situation.
To solve the problem, this patch print warning messages to the user's
consoles just like what quota mechanism is doing. When the I/O of an
application matches an TBF rule on the server side, a message will be
printed to the console of the process on the client side. The message
includes information of the rule name, the classification of
the RPC, the start/end time of the QoS period and whether further
limitation will be enforced later if more thresholds are reached.
This new feature assumes that a centralized tool will be used to
monitor the I/O performance of the whole file system and manage the
global configuration of TBF rules on all Lustre services. That tool
will monitor aggregate the I/O throughput (or metadata operations)
for each user (or job/client/group etc.) during a given time period.
And when a global threshold of the I/O throughput has been reached by
the user since the start time of the QoS period, TBF limitations will
be enforced on the whole Lustre file system. That tool will configure
TBF rules with enough information about this QoS decision. Two new
options have been added into the TBF rule for this purpose:
"start_epoch" and "end_epoch". If these two options are configured in
a TBF rule, the printed message will notify the users about the
start (or end) time of the QoS period.
There could be multiple QoS thresholds, e.g. one soft threshold and
one hard threshold. When user reached soft threshold, the RPC rate of
the user will be reduced to a slightly small value. And if the user
keeps on doing a lot of I/O and finally reached the hard threshold,
the global management tool might decide to enforce a very strict RPC
limitation. Thus, another option has been added to the TBF rule:
"further_throttle". If this options is configured in a TBF rule, the
printed message will notify the users that he/she needs to slow down
the I/O rate until the end of this QoS period, otherwise, further
limitation will be enforced as a penalty.
To avoid the flood of messages to the user's console, the same
message can only printed to a console after a time interval. This
can be tuned as "qos_message_interval" parameter of ptlrpc module.
In order to know whether a message has been printed to a console
or not, a history of message will be kept in a hash table. The
size of the hash table can be configured as "qos_message_history_size"
parameter of ptlrpc module.
Attachments
Issue Links
- is blocking
-
LU-11431 Global QoS management based on TBF
- Closed