Type: New Feature
Affects Version/s: None
Fix Version/s: None
Features of Lustre like HSM and OST pool based on SSD have enabled a lot of
new use cases, which makes data management of Lustre file system a new daily
work. The Robinhood Policy Engine is able to do various kinds of data
management based on pre-configured rules and has been confirmed as a versatile
tool to manage large Lustre file systems. However, using Robinhood requires
external machine with stronge CPU, memory and storage. And setuping and
configuring Robinhood properly requires extra efforts from users. That is why
we(DDN) are proposing a policy engine which is implemented completely inside
Lustre. In order to avoid performance regression and complexity, this policy
engine is implemented in a very lightweight way. That means, it can only
support a limited part of use cases, which might be much less than what
Robinhood can do. However, this new policy engine could still be useful for
a lot of use cases, especially for the ones which are relatively simpler.
The core component of policy engine is an arithmetic unit which can calculate
the value of a rule that can be configured by users in run time. The rule is an
arithmetic expression. An expression is either 1) an number, or 2) a constant
name, or 3) a system attribute name, or 4) an object attribute name, or 5) two
expression that are combined together by an operator.
The arithmetic values of all expressions are calculated as unsigned 64 bit
number, so all unsigned 64 bit numbers can be used in the expression.
A constant name is only an alias of an 64 bit number, which should already
been pre-defined in Lustre codes. The value of the constant, thus, is already
been pre-defined. An typical rule for HSM might use constants hsma_bit_[
archive|restore|remove|cancel] to indicate the HSM actions that
should be taken after evaluating the rule.
A system attribute is the system wide attribute of Lustre or the kernel. The
free space is an typical example of system state on OST. And free inode number
on MDT is another example. Date time is an example of the system attribute
which is independent of Lustre. When evaluating the value of the expression,
the value of the system attribute will be used. And since the arithmetic value
is 64 bit, all the attribute values will be 64 bit numbers.
An object attribute name could be any attribute name that is available from
the corresponding Lustre objects, usually MDT objects and OST objects.
Avaliable object attribute names include but not limited to the attributes
that can be read by getattr() syscall, such as atime, mtime, ctime, size,
mode, uid, gid, blocks, type, flags, nlink, rdev, blksize, etc.
An operator could be almost any integer operations that can be used in C
language, including arithmetic operators (+, -, *, /, %), relational and
logical operators (==, !=, >, >=, <, <=), and bitwise operators\
(&, |, ^, <<, >>).
In order to simplify the parsing of the expression in Lustre, the expression
of the rule should be configured in the form of Polish notation
(https://en.wikipedia.org/wiki/Polish_notation). An rule that will trigger
HSM archive action if the modify timestamp of the file is 1 minute ealier than
the system time could be set by the following command:
echo -n "& - >= mtime - sys_time 60 1 hsma_bit_archive" > /proc/fs/lustre/mdt/lustre-MDT0000/hsm_policy_rule
Currently, no optimization of the expression will be done by Lustre when being
set, even the way to optimize it is obvious. For example, "&& 0 expression1" is
essentially equal to "0", however, the value of "expression1" will still be
evaluated when getting the value of the entire expression. That means, before
setting the expression rule, it should be optimized either manually or through
external tool. An external userspace tool which can transfer normal notation
with parenthesis to Polish notation and at the same time optimize the
expression could be really helpful for the users.
The configured rule could be evaluated either synchronously or asynchronously.
Synchronous evaluation means to evaluate the expression in the context of a
service thread. For example, when a file is being accessed, the expression of
the rule will be calculated in the service thread. Corresponding actions will
be triggered by the policy engine if the value of the expression matches a
predefined pattern. In order to avoid performance regression, the speed of
the synchronous evaluation is ciritial. And that is the reason why only one
rule is supported by synchronous evaluation.
However, after synchronous evaluation triggers action job, asynchronous
evaluation could be done on multiple pre-configured rules when handling
the action job. Asynchrouse evaluation is done in a dedicated thread pool
of policy engine so no performance regression will be caused by asynchrous
evaluation. And the service thread pool could scan the whole OST/MDT from
time to time to find the objects that match the rules.
A set of rules like "condition1 -> action1", "condition2 -> action2",
and "condition3 -> action3" could be configured to asynchronous evaluation
of policy engine. And in order to trigger job properly by synchronous
evaluation, a rule that equal to but more optimized than
"|| condition1 || condition2 condition3" should be set.
Obviously, this policy engine has some limitations. And all the things that
this policy engine could do on HSM should be able be accomplished by using
Even though the current codes only have HSM support, this policy engine could
be potentially used for other features which need configurable policies.
Following is a list of the features:
- Data migration between SSD OST pool and normal OST pool. The policy engine
could use a new feature named file heat to decide which data to move to SSD
- RPC classfication in NRS TBF policy. Currently, NRS TBF policy classify RPCs
based on NID/JobID. By using the expression of this policy engine, the TBF
policy could classify RPCs based on an expression of RPC attributes which can
be configured by users. This could enable much more use cases than existing
- Inotify is a useful feature when montoring the events of file systems. But
Lustre itself doesn't support system wide inotify. By using the lightweight
policy engine, a notification mechanism that might be more powerful and
efficient than inotify could be implemented for Lustre. In order to act like
inotify, when the pre-configured rule is matched, instead of applying
background actions, this policy engine could send a notification to the
watching application. Because an expression could be used to filter the
desired events from the original source, the extra overhead such as RPCs
caused by notification could be minimized.
- Cache management tuning on different levels. The policy engine could be used
in cache management systems in order to make the decision of data prefetching
or cache eviction.