Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4344

Test marked FAIL with "No sub tests failed in this test set"

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Major
    • None
    • None
    • patches pushed to git
    • 3
    • 1792

    Description

      LU context can be found at LU-764 test marked "No sub tests failed in this test set.":

       
      Keith Mannthey added a comment - 11/Mar/13 12:44 PM 
      I have seen a bit of this over the last week or so on master. This is a good example.
      
      https://maloo.whamcloud.com/test_sessions/bf361f32-8919-11e2-b643-52540035b04c
      
      There was one real error at the very end end of this test. Other than that all subtests "passed" even though 3 whole sections are marked FAILED. What logs do you look at to see the cleanup issues previously mentioned? How can we tell if it is a problem with the patch or some autotest anomaly?
      "
      

      Minh Diep added a comment - 21/Mar/13 9:37 AM

      I looked at the log above, sanity actually ran much longer but maloo only show a few hundreds of second
      == sanity test complete, duration 2584 sec == 14:03:39 (1362866619)

      I don't think this is the same problem. Please file a TT ticket.

      
      

      This has happened a bit here an there. It would be nice to always know what failed when something "fails".

      Please update the LU if this is seen as an LU issue.

      Attachments

        Issue Links

          Activity

            [LU-4344] Test marked FAIL with "No sub tests failed in this test set"

            It looks like some tests that are timed out (e.g. sanity-sec test_13 or test_14 due to LU-5847) are being marked FAIL, but checking the Set-wide logs: suite_log and/or suite_stdout shows that the test was still running, but just taking too long to complete. The suite_log shows the output of the test continues even after it is marked for timeout:

            08:50:15:== sanity-sec test 13: test nids == 08:49:23 (1409672963)
            :
            :
            08:50:17:CMD: onyx-35vm3 /usr/sbin/lctl nodemap_test_nid 30.0.0.1
            08:50:17:CMD: onyx-35vm3 /usr/sbin/lctl nodemap_test_nid 30.0.0.2
            :
            :
            09:51:01:CMD: onyx-35vm3 /usr/sbin/lctl nodemap_test_nid 30.7.0.190
            09:51:02:CMD: onyx-35vm3 /usr/sbin/lctl nodemap_test_nid 30.7.0.191
            09:51:02:********** Timeout by autotest system **********09:51:03:CMD: onyx-35vm3 /usr/sbin/lctl nodemap_test_nid 30.7.0.192
            09:51:03:CMD: onyx-35vm3 /usr/sbin/lctl nodemap_test_nid 30.7.0.193
            09:51:04:CMD: onyx-35vm3 /usr/sbin/lctl nodemap_test_nid 30.7.0.194
            :
            :
            09:55:56:CMD: onyx-35vm3 /usr/sbin/lctl nodemap_test_nid 30.7.2.27
            09:55:56:CMD: onyx-35vm3 /usr/sbin/lctl nodemap_test_nid 30.7.2.28
            

            In a few cases, I've even see the test complete and be marked PASS in the logs after it was "timed out". In the case of sanity-sec test_13 and test_14, the tests have been shortened to run in a couple of minutes, but I think the more important issues are:

            • the test should have been marked as TIMEOUT instead of FAIL if a subtest is timed out
            • if a subtest has started it should produce an entry for that subtest in the logs, even if it did not complete (TEI-2815)
            adilger Andreas Dilger added a comment - It looks like some tests that are timed out (e.g. sanity-sec test_13 or test_14 due to LU-5847 ) are being marked FAIL , but checking the Set-wide logs: suite_log and/or suite_stdout shows that the test was still running, but just taking too long to complete. The suite_log shows the output of the test continues even after it is marked for timeout: 08:50:15:== sanity-sec test 13: test nids == 08:49:23 (1409672963) : : 08:50:17:CMD: onyx-35vm3 /usr/sbin/lctl nodemap_test_nid 30.0.0.1 08:50:17:CMD: onyx-35vm3 /usr/sbin/lctl nodemap_test_nid 30.0.0.2 : : 09:51:01:CMD: onyx-35vm3 /usr/sbin/lctl nodemap_test_nid 30.7.0.190 09:51:02:CMD: onyx-35vm3 /usr/sbin/lctl nodemap_test_nid 30.7.0.191 09:51:02:********** Timeout by autotest system **********09:51:03:CMD: onyx-35vm3 /usr/sbin/lctl nodemap_test_nid 30.7.0.192 09:51:03:CMD: onyx-35vm3 /usr/sbin/lctl nodemap_test_nid 30.7.0.193 09:51:04:CMD: onyx-35vm3 /usr/sbin/lctl nodemap_test_nid 30.7.0.194 : : 09:55:56:CMD: onyx-35vm3 /usr/sbin/lctl nodemap_test_nid 30.7.2.27 09:55:56:CMD: onyx-35vm3 /usr/sbin/lctl nodemap_test_nid 30.7.2.28 In a few cases, I've even see the test complete and be marked PASS in the logs after it was "timed out". In the case of sanity-sec test_13 and test_14, the tests have been shortened to run in a couple of minutes, but I think the more important issues are: the test should have been marked as TIMEOUT instead of FAIL if a subtest is timed out if a subtest has started it should produce an entry for that subtest in the logs, even if it did not complete (TEI-2815)

            Another failure https://testing.hpdd.intel.com/test_sets/b42e4da6-501b-11e4-8734-5254006e85c2

            It seems like many of these failures that I can find happen when some server is being unmounted. Unfortunately, it isn't possible to diagnose this remotely because these failures do not leave any server console/dmesg/debug logs on the test. I've filed TEI-2815 to try and get our test framework to capture more logs in this kind of failure situation.

            adilger Andreas Dilger added a comment - Another failure https://testing.hpdd.intel.com/test_sets/b42e4da6-501b-11e4-8734-5254006e85c2 It seems like many of these failures that I can find happen when some server is being unmounted. Unfortunately, it isn't possible to diagnose this remotely because these failures do not leave any server console/dmesg/debug logs on the test. I've filed TEI-2815 to try and get our test framework to capture more logs in this kind of failure situation.
            dmiter Dmitry Eremin (Inactive) added a comment - one more fail: https://testing.hpdd.intel.com/test_sets/7fc6919c-3863-11e4-b7d4-5254006e85c2

            Matt,

            I think that one failed due to TEI-1403.

            jamesanunez James Nunez (Inactive) added a comment - Matt, I think that one failed due to TEI-1403.
            ezell Matt Ezell added a comment -

            I don't see why this failed, is this another instance?
            https://maloo.whamcloud.com/test_sets/9bea8fb6-b77d-11e3-98de-52540035b04c

            ezell Matt Ezell added a comment - I don't see why this failed, is this another instance? https://maloo.whamcloud.com/test_sets/9bea8fb6-b77d-11e3-98de-52540035b04c
            mjmac Michael MacDonald (Inactive) added a comment - More instances? https://maloo.whamcloud.com/test_sets/392d7044-a794-11e3-ba84-52540035b04c https://maloo.whamcloud.com/test_sets/bb8b2514-a80b-11e3-9505-52540035b04c
            dmiter Dmitry Eremin (Inactive) added a comment - One more time it happens: https://maloo.whamcloud.com/test_sets/9618ff94-81e2-11e3-94d9-52540035b04c

            Perhaps LU-764 should be reopened?

            keith Keith Mannthey (Inactive) added a comment - Perhaps LU-764 should be reopened?

            This needs to be investigated in the Lustre code and so needs to be fixed under an LU ticket.

            chris Chris Gearing (Inactive) added a comment - This needs to be investigated in the Lustre code and so needs to be fixed under an LU ticket.

            I agree with Chris. Most the time there is some issue. Normally in the unmounting of filesystems or some other cleanup phase of the test.

            Perhaps a setup and cleanup phase for each subtest could catch all these extra issues in a less confusing way.

            keith Keith Mannthey (Inactive) added a comment - I agree with Chris. Most the time there is some issue. Normally in the unmounting of filesystems or some other cleanup phase of the test. Perhaps a setup and cleanup phase for each subtest could catch all these extra issues in a less confusing way.

            People

              wc-triage WC Triage
              keith Keith Mannthey (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: