Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8674

A lightweight internal policy engine of Lustre for HSM, OST pool migration, file heat, inotify and so on

    XMLWordPrintable

Details

    • New Feature
    • Resolution: Fixed
    • Minor
    • None
    • None
    • 9223372036854775807

    Description

      Features of Lustre like HSM and OST pool based on SSD have enabled a lot of
      new use cases, which makes data management of Lustre file system a new daily
      work. The Robinhood Policy Engine is able to do various kinds of data
      management based on pre-configured rules and has been confirmed as a versatile
      tool to manage large Lustre file systems. However, using Robinhood requires
      external machine with stronge CPU, memory and storage. And setuping and
      configuring Robinhood properly requires extra efforts from users. That is why
      we(DDN) are proposing a policy engine which is implemented completely inside
      Lustre. In order to avoid performance regression and complexity, this policy
      engine is implemented in a very lightweight way. That means, it can only
      support a limited part of use cases, which might be much less than what
      Robinhood can do. However, this new policy engine could still be useful for
      a lot of use cases, especially for the ones which are relatively simpler.

      The core component of policy engine is an arithmetic unit which can calculate
      the value of a rule that can be configured by users in run time. The rule is an
      arithmetic expression. An expression is either 1) an number, or 2) a constant
      name, or 3) a system attribute name, or 4) an object attribute name, or 5) two
      expression that are combined together by an operator.

      The arithmetic values of all expressions are calculated as unsigned 64 bit
      number, so all unsigned 64 bit numbers can be used in the expression.

      A constant name is only an alias of an 64 bit number, which should already
      been pre-defined in Lustre codes. The value of the constant, thus, is already
      been pre-defined. An typical rule for HSM might use constants hsma_bit_[
      archive|restore|remove|cancel] to indicate the HSM actions that
      should be taken after evaluating the rule.

      A system attribute is the system wide attribute of Lustre or the kernel. The
      free space is an typical example of system state on OST. And free inode number
      on MDT is another example. Date time is an example of the system attribute
      which is independent of Lustre. When evaluating the value of the expression,
      the value of the system attribute will be used. And since the arithmetic value
      is 64 bit, all the attribute values will be 64 bit numbers.

      An object attribute name could be any attribute name that is available from
      the corresponding Lustre objects, usually MDT objects and OST objects.
      Avaliable object attribute names include but not limited to the attributes
      that can be read by getattr() syscall, such as atime, mtime, ctime, size,
      mode, uid, gid, blocks, type, flags, nlink, rdev, blksize, etc.

      An operator could be almost any integer operations that can be used in C
      language, including arithmetic operators (+, -, *, /, %), relational and
      logical operators (==, !=, >, >=, <, <=), and bitwise operators\
      (&, |, ^, <<, >>).

      In order to simplify the parsing of the expression in Lustre, the expression
      of the rule should be configured in the form of Polish notation
      (https://en.wikipedia.org/wiki/Polish_notation). An rule that will trigger
      HSM archive action if the modify timestamp of the file is 1 minute ealier than
      the system time could be set by the following command:

      echo -n "& - >= mtime - sys_time 60 1 hsma_bit_archive" > /proc/fs/lustre/mdt/lustre-MDT0000/hsm_policy_rule

      Currently, no optimization of the expression will be done by Lustre when being
      set, even the way to optimize it is obvious. For example, "&& 0 expression1" is
      essentially equal to "0", however, the value of "expression1" will still be
      evaluated when getting the value of the entire expression. That means, before
      setting the expression rule, it should be optimized either manually or through
      external tool. An external userspace tool which can transfer normal notation
      with parenthesis to Polish notation and at the same time optimize the
      expression could be really helpful for the users.

      The configured rule could be evaluated either synchronously or asynchronously.
      Synchronous evaluation means to evaluate the expression in the context of a
      service thread. For example, when a file is being accessed, the expression of
      the rule will be calculated in the service thread. Corresponding actions will
      be triggered by the policy engine if the value of the expression matches a
      predefined pattern. In order to avoid performance regression, the speed of
      the synchronous evaluation is ciritial. And that is the reason why only one
      rule is supported by synchronous evaluation.

      However, after synchronous evaluation triggers action job, asynchronous
      evaluation could be done on multiple pre-configured rules when handling
      the action job. Asynchrouse evaluation is done in a dedicated thread pool
      of policy engine so no performance regression will be caused by asynchrous
      evaluation. And the service thread pool could scan the whole OST/MDT from
      time to time to find the objects that match the rules.

      A set of rules like "condition1 -> action1", "condition2 -> action2",
      and "condition3 -> action3" could be configured to asynchronous evaluation
      of policy engine. And in order to trigger job properly by synchronous
      evaluation, a rule that equal to but more optimized than
      "|| condition1 || condition2 condition3" should be set.

      Obviously, this policy engine has some limitations. And all the things that
      this policy engine could do on HSM should be able be accomplished by using
      Robinhood.

      Even though the current codes only have HSM support, this policy engine could
      be potentially used for other features which need configurable policies.
      Following is a list of the features:

      • Data migration between SSD OST pool and normal OST pool. The policy engine
        could use a new feature named file heat to decide which data to move to SSD
        pool.
      • RPC classfication in NRS TBF policy. Currently, NRS TBF policy classify RPCs
        based on NID/JobID. By using the expression of this policy engine, the TBF
        policy could classify RPCs based on an expression of RPC attributes which can
        be configured by users. This could enable much more use cases than existing
        classfication.
      • Inotify is a useful feature when montoring the events of file systems. But
        Lustre itself doesn't support system wide inotify. By using the lightweight
        policy engine, a notification mechanism that might be more powerful and
        efficient than inotify could be implemented for Lustre. In order to act like
        inotify, when the pre-configured rule is matched, instead of applying
        background actions, this policy engine could send a notification to the
        watching application. Because an expression could be used to filter the
        desired events from the original source, the extra overhead such as RPCs
        caused by notification could be minimized.
      • Cache management tuning on different levels. The policy engine could be used
        in cache management systems in order to make the decision of data prefetching
        or cache eviction.

      Attachments

        Activity

          People

            lixi Li Xi (Inactive)
            lixi Li Xi (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            16 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: