[LU-8434] Rewrite test framework using Python - Whamcloud Community JIRA

Details

Type: Improvement
Resolution: Fixed
Priority: Minor
Fix Version/s: Lustre 2.13.0, Lustre 2.12.2
Affects Version/s: None
Labels:
None

Rank (Obsolete):
9223372036854775807

Description

Rewrite test framework using Python

The Lustre test framework is really useful for checking whether a patch is
working well and at the same time doesn't break anything. Currently, there are
close to 1500 test suites in the test framework of Lustre. And most of the test
codes are written in BASH Shell scripts. Writing Shell scripts for test is
quite straightfoward, but there are quite some disadvantages of it which could
be solved by rewriting the tests by Python:

Lake of log level options:
    Logs with different levels are really elpful for debuging especially when
    the tests can't be run repeatedly. The "sh -x" or "verbose" option doesn't
    help much since the log messages are either too many or too few.

    The powerful logging facility of python should be able to satisfy
    the requirement of debugging. Logging levels of could be redirected to
    different files for different use cases, e.g. error level for fatal
    problems including unexpected bugs of scripts, warning level for problems
    that should be taken care of but not necessarily fatal problems, info level
    for showing the current process, debug level for everything that could
    be useful for debug.

Too many pitfalls:
    There are so many pitfalls of Bash Shell that it is so easy to write a
    script with defects.

    For a list of common pitfalls:
    http://bash.cumulonim.biz/BashPitfalls.html

    Following is real examples on Lustre:

    # The following line won't drop the cache on all OSTs as expected
    do_nodes $(comma_list $(osts_nodes)) echo 3 > /proc/sys/vm/drop_caches

    # The following patch fixed the problem in some scripts, but still left the
    # same problem unsolved in other scripts
    LU-6205 tests: fix bash expansion of fid

    # A dozen of defects are found in the current test scripts 
    LU-7529 test: fix tiny problems of tests

A lot of defects:
    There are a lot of defects in the test scripts of Lustre. And because of the
    defects, the test results are sometimes not convergent or consistent.
    That means, a test suite could pass in the first run loop, but then fail
    in the next loop. And the inconsistent test problems happen more frequently
    when the test environment changes. It is not a doubt that defects can never
    be entirely eliminated, however rewriting the test scripts could
    be a good chance to cleanup the existing codes that are error-prone.

Not able to skip test suites efficiently:
    By using "--start-at" and "--stop-at" options, a subset of the test suites
    can be selected to run while skipping other test suits. However, even use
    "--start-at" option, skipping the test suites cost significant time. For
    example, it cost 142 seconds to skip all the test suites from sanity 0a to
    sanity 102a.

Not able to be run in parallel on multiple clusters:
    A Lustre test script assumes that it will be run on only one cluster.
    However, the test costs too much time if run on only one cluster (about 15
    hours to pass all regression tests). That is why we (DDN) implemented a
    system named LATEST (Lustre Automatic TEST), which could finish all the
    test in less than 50 minutes with 240 hosts by running the tests in
    parallel on multiple clusters. However, if the test scripts were written
    in a way that suitable to be run in parallel, the time cost could be
    reduced further.

    For example, some of the test suites in the same script (e.g. sanityn.sh)
    are dependent with eatch other, that means those test suites can't be
    separated into several clusters.

    Some of the test suites cost too much time to run, e.g. conf-sanity/32a
    costs 827 seconds and conf-sanity/32d costs 832 seconds. That means, no
    matter how many hosts are used, the test can't be finished in less than
    10 minutes. That is another example of test suites that are not friendly
    to parallel run. Test scripts that are friendly to parallel run should
    seperate big test suite into independent small test suites.

That is the reasons that we are proposing to rewrite the test scripts using
Python language. That is a large amount of work, thus seperate it into several
steps might be more realistic:

1) Write a new test framework under lustre/new_tests directory in Python. This
framework should be able to support the existing functions of the old test
framework, and at the same time more extendable and powerful for futher
development.

2) Add the existing test suites into the lustre/new_tests directory. There are
many test suites. And moving all of them to new framework is a lot of work,
thus could neither be done in a single patch nor in a short time. So, this step
is a long process which might last for quite a few months.

3) At the same time, all new test suites added in new patches should be based
on new framework, that means, no patch will be allowed to add new test suites
into lustre/tests.

4) After step 2) is completely done, remove lustre/tests and replace it with
lustre/new_tests.

Following is an example of what new test script sanity.py could look like in
the new framework:

#!/usr/bin/python
# Import private codes of test-framework
import test-framework

# env: The global environment of running the test
#
# cluster: the cluster that running this test suite
test_0a(env, cluster) {
    # env.le_dir: The directory of the Lustre client for running test
    #
    # env.le_fname: The file name used for running test
    file_path = %s/%s" % (env.le_dir, env.le_fname)

    # cluster.lc_client: the object of the client host and hosts like
    # cluster.lc_oss[X], cluster.lc_mgs, cluster.lc_mds[X] could be used too.
    # 
    # host.lh_run(): run command on the host using SSH connection
    #
    # critical=True: the command is critical, thus if it fails, the
    # test should exit with error.
    #
    #
    cluster.lc_client.lh_run("touch %s" % (file_path),
                             critical=True)
    # env.le_checkstat: $CHECKSTAT in sanity.sh
    ret = cluster.lc_client.lh_run("%s -t %s" %
                                   (env.le_checkstat, file_path),
                                    critical=False)
    # ret.cr_exit_status: the return code of the command
    if ret.cr_exit_status:
        env.le_error("%s is not a file" % (file_path))
    # ...
}
# Add the test suite in a dict of test suits so that it could be scheduled
# to run latter
#
# always_except: the test suite should always be skippped
#
# slow_except: the test suite should be skipped when slow option is not
# enabled
add_test_suite(test_0a, always_except=True, slow_except=False)

# Init test environment 
env = init_environment()
# Run the test suites, possibly on multiple clusters in parallel.
run_tests(env)

Rewrite test framework using Python

Details

Description

Attachments

Activity

People

Dates