Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-7284

test results marked failed although all sub test pass

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Minor
    • None
    • None
    • shaodow
    • 0.5
    • 3
    • 9223372036854775807

    Description

      The overall status of test script has marked failed although all sub test passed:
      https://testing.hpdd.intel.com/test_sets/bfdb5482-508d-11e5-95a9-5254006e85c2

      Attachments

        Activity

          [LU-7284] test results marked failed although all sub test pass
          emoly.liu Emoly Liu added a comment - +1 on master https://testing.hpdd.intel.com/test_sets/e897ff26-0c90-11e8-a6ad-52540065bddc
          heckes Frank Heckes (Inactive) added a comment - - edited

          The problem is that 'lnet-selftest' in case of https://testing.hpdd.intel.com/test_sets/0273cf04-6d4e-11e5-bf10-5254006e85c2 and
          'ost-pools' for session https://testing.hpdd.intel.com/test_sets/bfdb5482-508d-11e5-95a9-5254006e85c2 contain clean-up code that will
          cause auster to fail (aka exit with status != 0) if an operation fails. These clean-up commands should be bundled in a test (cleanup would
          be a good name) that could FAIL|TIMEOUT|PASS|SKIP. This would make the failure immediately transparent to any user. Otherwise
          the clean-up acts like a 'hidden' test for the suite.
          We think this is a design flaw of the Lustre test suites especially those mentioned explicitly in the ticket and should be solved within the Lustre
          (internal) test framework.

          heckes Frank Heckes (Inactive) added a comment - - edited The problem is that 'lnet-selftest' in case of https://testing.hpdd.intel.com/test_sets/0273cf04-6d4e-11e5-bf10-5254006e85c2 and 'ost-pools' for session https://testing.hpdd.intel.com/test_sets/bfdb5482-508d-11e5-95a9-5254006e85c2 contain clean-up code that will cause auster to fail (aka exit with status != 0) if an operation fails. These clean-up commands should be bundled in a test (cleanup would be a good name) that could FAIL|TIMEOUT|PASS|SKIP. This would make the failure immediately transparent to any user. Otherwise the clean-up acts like a 'hidden' test for the suite. We think this is a design flaw of the Lustre test suites especially those mentioned explicitly in the ticket and should be solved within the Lustre (internal) test framework.
          heckes Frank Heckes (Inactive) added a comment - - edited

          For ppc test and example Nathaniel presented, auster returns 0 for all subtests, but clean-up fails at the end of the test suite.
          Although all information are there, it's cumbersome to dig into the suite log/stdout files for these events.
          It's not sure whether the situation has (should) to be handled in autotest. I think it's a problem in the Lustre test suite which should add 'test' (name == 'cleanup') that could pass, fail or timeout. I'll discuss the problem with Charlie.

          heckes Frank Heckes (Inactive) added a comment - - edited For ppc test and example Nathaniel presented, auster returns 0 for all subtests, but clean-up fails at the end of the test suite. Although all information are there, it's cumbersome to dig into the suite log/stdout files for these events. It's not sure whether the situation has (should) to be handled in autotest. I think it's a problem in the Lustre test suite which should add 'test' (name == 'cleanup') that could pass, fail or timeout. I'll discuss the problem with Charlie.
          utopiabound Nathaniel Clark added a comment - I've just run into this here: https://testing.hpdd.intel.com/test_sets/0273cf04-6d4e-11e5-bf10-5254006e85c2

          Okay. I'll check the script and error and will eventually convert the ticket into an LU or LDEV ticket.

          heckes Frank Heckes (Inactive) added a comment - Okay. I'll check the script and error and will eventually convert the ticket into an LU or LDEV ticket.

          The lustre test framework is responsible for the results.yml file. It might be a correct failure as there was a failure in cleaning up the ost-pools test:

          06:55:47:Starting test test_complete in suite ost-pools timeout at 1441094147
          06:55:47:== ost-pools test complete, duration 2160 sec == 06:55:40 (1441090540)
          06:55:47:CMD: shadow-35vm3 lctl pool_list lustre
          06:55:47:Pools from lustre:
          06:55:47:rm: cannot remove `/mnt/lustre/d405.sanity-hsm/striped_dir': Directory not empty
          06:55:47:rm: cannot remove `/mnt/lustre/d500.sanity-hsm': Directory not empty
          06:55:47: ost-pools : @@@@@@ FAIL: remove sub-test dirs failed
          06:55:47: Trace dump:
          06:55:47: = /usr/lib64/lustre/tests/test-framework.sh:4748:error_noexit()
          06:55:47: = /usr/lib64/lustre/tests/test-framework.sh:4779:error()
          06:55:47: = /usr/lib64/lustre/tests/test-framework.sh:4293:check_and_cleanup_lustre()
          06:55:47: = /usr/lib64/lustre/tests/ost-pools.sh:1574:main()
          06:55:47:Dumping lctl log to /logdir/test_logs/2015-09-01/lustre-ppc-el6_6-x86_64-vs-lustre-ppc-el6_6-x86_64_ppc64-review-dne-part-2-1_2_1_191_-70079184457860-005147/ost-pools..*.1441090543.log
          06:55:47:CMD: shadow-31.shadow.whamcloud.com,shadow-35vm2,shadow-35vm3,shadow-35vm4,shadow-35vm5,shadow-35vm6,shadow-35vm7 /usr/sbin/lctl dk > /logdir/test_logs/2015-09-01/lustre-ppc-el6_6-x86_64-vs-lustre-ppc-el6_6-x86_64_ppc64-review-dne-part-2-1_2_1_191_-70079184457860-005147/ost-pools..debug_log.\$(hostname -s).1441090543.log;
          06:55:47: dmesg > /logdir/test_logs/2015-09-01/lustre-ppc-el6_6-x86_64-vs-lustre-ppc-el6_6-x86_64_ppc64-review-dne-part-2-1_2_1_191_-70079184457860-005147/ost-pools..dmesg.\$(hostname -s).1441090543.log
          06:55:59:ost-pools returned 0

          colmstea Charlie Olmstead added a comment - The lustre test framework is responsible for the results.yml file. It might be a correct failure as there was a failure in cleaning up the ost-pools test: 06:55:47:Starting test test_complete in suite ost-pools timeout at 1441094147 06:55:47:== ost-pools test complete, duration 2160 sec == 06:55:40 (1441090540) 06:55:47:CMD: shadow-35vm3 lctl pool_list lustre 06:55:47:Pools from lustre: 06:55:47:rm: cannot remove `/mnt/lustre/d405.sanity-hsm/striped_dir': Directory not empty 06:55:47:rm: cannot remove `/mnt/lustre/d500.sanity-hsm': Directory not empty 06:55:47: ost-pools : @@@@@@ FAIL: remove sub-test dirs failed 06:55:47: Trace dump: 06:55:47: = /usr/lib64/lustre/tests/test-framework.sh:4748:error_noexit() 06:55:47: = /usr/lib64/lustre/tests/test-framework.sh:4779:error() 06:55:47: = /usr/lib64/lustre/tests/test-framework.sh:4293:check_and_cleanup_lustre() 06:55:47: = /usr/lib64/lustre/tests/ost-pools.sh:1574:main() 06:55:47:Dumping lctl log to /logdir/test_logs/2015-09-01/lustre-ppc-el6_6-x86_64-vs-lustre-ppc-el6_6-x86_64_ppc64- review-dne-part-2 -1_2_1_ 191 _-70079184457860-005147/ost-pools..*.1441090543.log 06:55:47:CMD: shadow-31.shadow.whamcloud.com,shadow-35vm2,shadow-35vm3,shadow-35vm4,shadow-35vm5,shadow-35vm6,shadow-35vm7 /usr/sbin/lctl dk > /logdir/test_logs/2015-09-01/lustre-ppc-el6_6-x86_64-vs-lustre-ppc-el6_6-x86_64_ppc64- review-dne-part-2 -1_2_1_ 191 _-70079184457860-005147/ost-pools..debug_log.\$(hostname -s).1441090543.log; 06:55:47: dmesg > /logdir/test_logs/2015-09-01/lustre-ppc-el6_6-x86_64-vs-lustre-ppc-el6_6-x86_64_ppc64- review-dne-part-2 -1_2_1_ 191 _-70079184457860-005147/ost-pools..dmesg.\$(hostname -s).1441090543.log 06:55:59:ost-pools returned 0

          We will pull this ticket into the next sprint to complete analysis to determine how often this happening and prioritize the effort to fix.

          jlevi Jodi Levi (Inactive) added a comment - We will pull this ticket into the next sprint to complete analysis to determine how often this happening and prioritize the effort to fix.
          leonel8a Lee Ochoa (Inactive) added a comment - - edited

          Frank, looked at the results.yml file and found...

                  name: ost-pools
                  description: auster ost-pools
                  submission: Tue Sep  1 06:19:40 UTC 2015
                  report_version: 2
                  SubTests:
                  -
                      name: test_1a
                      status: PASS
                      duration: 22
                      return_code: 0
                      error:
          .
          .
          .
                  -
                      name: test_26
                      status: PASS
                      duration: 54
                      return_code: 0
                      error:
                  duration: 2173
                  status: FAIL
          

          So despite all the tests passing, the test was marked failed in the results file, not sure why. I've updated the ticket's component to autotest, though I'm not sure it was AT's issue either, I just know it's further upstream than maloo

          leonel8a Lee Ochoa (Inactive) added a comment - - edited Frank, looked at the results.yml file and found... name: ost-pools description: auster ost-pools submission: Tue Sep 1 06:19:40 UTC 2015 report_version: 2 SubTests: - name: test_1a status: PASS duration: 22 return_code: 0 error: . . . - name: test_26 status: PASS duration: 54 return_code: 0 error: duration: 2173 status: FAIL So despite all the tests passing, the test was marked failed in the results file, not sure why. I've updated the ticket's component to autotest, though I'm not sure it was AT's issue either, I just know it's further upstream than maloo

          People

            wc-triage WC Triage
            heckes Frank Heckes (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: