Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8434

Rewrite test framework using Python


    • Type: Improvement
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: Lustre 2.13.0, Lustre 2.12.2
    • Labels:
    • Rank (Obsolete):


      Rewrite test framework using Python
      The Lustre test framework is really useful for checking whether a patch is
      working well and at the same time doesn't break anything. Currently, there are
      close to 1500 test suites in the test framework of Lustre. And most of the test
      codes are written in BASH Shell scripts. Writing Shell scripts for test is
      quite straightfoward, but there are quite some disadvantages of it which could
      be solved by rewriting the tests by Python:
      Lake of log level options:
          Logs with different levels are really elpful for debuging especially when
          the tests can't be run repeatedly. The "sh -x" or "verbose" option doesn't
          help much since the log messages are either too many or too few.
          The powerful logging facility of python should be able to satisfy
          the requirement of debugging. Logging levels of could be redirected to
          different files for different use cases, e.g. error level for fatal
          problems including unexpected bugs of scripts, warning level for problems
          that should be taken care of but not necessarily fatal problems, info level
          for showing the current process, debug level for everything that could
          be useful for debug.
      Too many pitfalls:
          There are so many pitfalls of Bash Shell that it is so easy to write a
          script with defects.
          For a list of common pitfalls:
          Following is real examples on Lustre:
          # The following line won't drop the cache on all OSTs as expected
          do_nodes $(comma_list $(osts_nodes)) echo 3 > /proc/sys/vm/drop_caches
          # The following patch fixed the problem in some scripts, but still left the
          # same problem unsolved in other scripts
          LU-6205 tests: fix bash expansion of fid
          # A dozen of defects are found in the current test scripts 
          LU-7529 test: fix tiny problems of tests
      A lot of defects:
          There are a lot of defects in the test scripts of Lustre. And because of the
          defects, the test results are sometimes not convergent or consistent.
          That means, a test suite could pass in the first run loop, but then fail
          in the next loop. And the inconsistent test problems happen more frequently
          when the test environment changes. It is not a doubt that defects can never
          be entirely eliminated, however rewriting the test scripts could
          be a good chance to cleanup the existing codes that are error-prone.
      Not able to skip test suites efficiently:
          By using "--start-at" and "--stop-at" options, a subset of the test suites
          can be selected to run while skipping other test suits. However, even use
          "--start-at" option, skipping the test suites cost significant time. For
          example, it cost 142 seconds to skip all the test suites from sanity 0a to
          sanity 102a.
      Not able to be run in parallel on multiple clusters:
          A Lustre test script assumes that it will be run on only one cluster.
          However, the test costs too much time if run on only one cluster (about 15
          hours to pass all regression tests). That is why we (DDN) implemented a
          system named LATEST (Lustre Automatic TEST), which could finish all the
          test in less than 50 minutes with 240 hosts by running the tests in
          parallel on multiple clusters. However, if the test scripts were written
          in a way that suitable to be run in parallel, the time cost could be
          reduced further.
          For example, some of the test suites in the same script (e.g. sanityn.sh)
          are dependent with eatch other, that means those test suites can't be
          separated into several clusters.
          Some of the test suites cost too much time to run, e.g. conf-sanity/32a
          costs 827 seconds and conf-sanity/32d costs 832 seconds. That means, no
          matter how many hosts are used, the test can't be finished in less than
          10 minutes. That is another example of test suites that are not friendly
          to parallel run. Test scripts that are friendly to parallel run should
          seperate big test suite into independent small test suites.
      That is the reasons that we are proposing to rewrite the test scripts using
      Python language. That is a large amount of work, thus seperate it into several
      steps might be more realistic:
      1) Write a new test framework under lustre/new_tests directory in Python. This
      framework should be able to support the existing functions of the old test
      framework, and at the same time more extendable and powerful for futher
      2) Add the existing test suites into the lustre/new_tests directory. There are
      many test suites. And moving all of them to new framework is a lot of work,
      thus could neither be done in a single patch nor in a short time. So, this step
      is a long process which might last for quite a few months.
      3) At the same time, all new test suites added in new patches should be based
      on new framework, that means, no patch will be allowed to add new test suites
      into lustre/tests.
      4) After step 2) is completely done, remove lustre/tests and replace it with
      Following is an example of what new test script sanity.py could look like in
      the new framework:
      # Import private codes of test-framework
      import test-framework
      # env: The global environment of running the test
      # cluster: the cluster that running this test suite
      test_0a(env, cluster) {
          # env.le_dir: The directory of the Lustre client for running test
          # env.le_fname: The file name used for running test
          file_path = %s/%s" % (env.le_dir, env.le_fname)
          # cluster.lc_client: the object of the client host and hosts like
          # cluster.lc_oss[X], cluster.lc_mgs, cluster.lc_mds[X] could be used too.
          # host.lh_run(): run command on the host using SSH connection
          # critical=True: the command is critical, thus if it fails, the
          # test should exit with error.
          cluster.lc_client.lh_run("touch %s" % (file_path),
          # env.le_checkstat: $CHECKSTAT in sanity.sh
          ret = cluster.lc_client.lh_run("%s -t %s" %
                                         (env.le_checkstat, file_path),
          # ret.cr_exit_status: the return code of the command
          if ret.cr_exit_status:
              env.le_error("%s is not a file" % (file_path))
          # ...
      # Add the test suite in a dict of test suits so that it could be scheduled
      # to run latter
      # always_except: the test suite should always be skippped
      # slow_except: the test suite should be skipped when slow option is not
      # enabled
      add_test_suite(test_0a, always_except=True, slow_except=False)
      # Init test environment 
      env = init_environment()
      # Run the test suites, possibly on multiple clusters in parallel.




            • Assignee:
              lixi_wc Li Xi
              lixi Li Xi (Inactive)
            • Votes:
              1 Vote for this issue
              15 Start watching this issue


              • Created: