[LU-4344] Test marked FAIL with "No sub tests failed in this test set" Created: 21/Mar/13  Updated: 16/Sep/21  Resolved: 16/Sep/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Keith Mannthey (Inactive) Assignee: WC Triage
Resolution: Cannot Reproduce Votes: 0
Labels: support
Environment:

patches pushed to git


Issue Links:
Related
is related to LU-5847 sanity-sec: lctl nodemap_test_nid on ... Resolved
is related to LU-5738 No sub tests failed even though test ... Resolved
Severity: 3
Rank (Obsolete): 1792

 Description   

LU context can be found at LU-764 test marked "No sub tests failed in this test set.":

 
Keith Mannthey added a comment - 11/Mar/13 12:44 PM 
I have seen a bit of this over the last week or so on master. This is a good example.

https://maloo.whamcloud.com/test_sessions/bf361f32-8919-11e2-b643-52540035b04c

There was one real error at the very end end of this test. Other than that all subtests "passed" even though 3 whole sections are marked FAILED. What logs do you look at to see the cleanup issues previously mentioned? How can we tell if it is a problem with the patch or some autotest anomaly?
"

Minh Diep added a comment - 21/Mar/13 9:37 AM

I looked at the log above, sanity actually ran much longer but maloo only show a few hundreds of second
== sanity test complete, duration 2584 sec == 14:03:39 (1362866619)

I don't think this is the same problem. Please file a TT ticket.


This has happened a bit here an there. It would be nice to always know what failed when something "fails".

Please update the LU if this is seen as an LU issue.



 Comments   
Comment by Nathaniel Clark [ 25/Apr/13 ]

Some review-zfs runs that all subtests pass, but test is marked as failed:
https://maloo.whamcloud.com/test_sets/669e7926-ad38-11e2-bd7c-52540035b04c
https://maloo.whamcloud.com/test_sets/b6c9e15c-ad19-11e2-bd7c-52540035b04c
https://maloo.whamcloud.com/test_sets/424c1ea6-ad2f-11e2-bd7c-52540035b04c
https://maloo.whamcloud.com/test_sets/65d3f9be-ad1d-11e2-bd7c-52540035b04c
https://maloo.whamcloud.com/test_sets/1a56168c-ad38-11e2-9002-52540035b04c

Comment by Nathaniel Clark [ 29/May/13 ]

Another review-zfs run (sanity-quota):
https://maloo.whamcloud.com/test_sets/4704e34c-c83c-11e2-b8c5-52540035b04c

Comment by Keith Mannthey (Inactive) [ 29/May/13 ]

This is a totally broken ZFS run:
https://maloo.whamcloud.com/test_sessions/6fcdc280-c837-11e2-8dd9-52540035b04c

For example on lnetself test:
https://maloo.whamcloud.com/test_sets/9c5751b0-c83a-11e2-8dd9-52540035b04c

Duration is 376 seconds, it is marked FAIL, 0/0 subtests have ran

There are no lustre-initialization logs to look at as well.

https://maloo.whamcloud.com/test_sessions/9917ca1e-c832-11e2-8dd9-52540035b04c is similar.

Comment by Chris Gearing (Inactive) [ 30/Jul/13 ]

Right so the problem here is in the Lustre test framework.

The results.yaml file created by the results framework looks like this (abbreviated).

Tests:
-
        name: sanity
        description: auster sanity
        submission: Thu Jul 25 06:26:09 PDT 2013
        report_version: 2
        SubTests:
        -
            name: test_0a
            status: PASS
            duration: 0
            return_code: 0
            error:
        -
            name: test_0b
            status: PASS
            duration: 3
            return_code: 0
            error:
       status: PASS

but sometimes it looks like this

Tests:
-
        name: sanity
        description: auster sanity
        submission: Thu Jul 25 06:26:09 PDT 2013
        report_version: 2
        SubTests:
        -
            name: test_0a
            status: PASS
            duration: 0
            return_code: 0
            error:
        -
            name: test_0b
            status: PASS
            duration: 3
            return_code: 0
            error:
       status: FAIL

Which autotest/maloo faithfully reports.

So the framework needs to be fixed up.

Now I think we could remove the final status from the yaml and let Maloo decide if the suite passed, but while the result is their Maloo should report the result it receives.

My guess is that it is showing a real issue.

This needs to be fixed in the test framework so I will change this to a lustre ticket.

Comment by Keith Mannthey (Inactive) [ 31/Jul/13 ]

I agree with Chris. Most the time there is some issue. Normally in the unmounting of filesystems or some other cleanup phase of the test.

Perhaps a setup and cleanup phase for each subtest could catch all these extra issues in a less confusing way.

Comment by Chris Gearing (Inactive) [ 04/Dec/13 ]

This needs to be investigated in the Lustre code and so needs to be fixed under an LU ticket.

Comment by Keith Mannthey (Inactive) [ 04/Dec/13 ]

Perhaps LU-764 should be reopened?

Comment by Dmitry Eremin (Inactive) [ 20/Jan/14 ]

One more time it happens: https://maloo.whamcloud.com/test_sets/9618ff94-81e2-11e3-94d9-52540035b04c

Comment by Michael MacDonald (Inactive) [ 10/Mar/14 ]

More instances?

https://maloo.whamcloud.com/test_sets/392d7044-a794-11e3-ba84-52540035b04c
https://maloo.whamcloud.com/test_sets/bb8b2514-a80b-11e3-9505-52540035b04c

Comment by Matt Ezell [ 29/Mar/14 ]

I don't see why this failed, is this another instance?
https://maloo.whamcloud.com/test_sets/9bea8fb6-b77d-11e3-98de-52540035b04c

Comment by James Nunez (Inactive) [ 31/Mar/14 ]

Matt,

I think that one failed due to TEI-1403.

Comment by Dmitry Eremin (Inactive) [ 10/Sep/14 ]

one more fail: https://testing.hpdd.intel.com/test_sets/7fc6919c-3863-11e4-b7d4-5254006e85c2

Comment by Andreas Dilger [ 14/Oct/14 ]

Another failure https://testing.hpdd.intel.com/test_sets/b42e4da6-501b-11e4-8734-5254006e85c2

It seems like many of these failures that I can find happen when some server is being unmounted. Unfortunately, it isn't possible to diagnose this remotely because these failures do not leave any server console/dmesg/debug logs on the test. I've filed TEI-2815 to try and get our test framework to capture more logs in this kind of failure situation.

Comment by Andreas Dilger [ 03/Nov/14 ]

It looks like some tests that are timed out (e.g. sanity-sec test_13 or test_14 due to LU-5847) are being marked FAIL, but checking the Set-wide logs: suite_log and/or suite_stdout shows that the test was still running, but just taking too long to complete. The suite_log shows the output of the test continues even after it is marked for timeout:

08:50:15:== sanity-sec test 13: test nids == 08:49:23 (1409672963)
:
:
08:50:17:CMD: onyx-35vm3 /usr/sbin/lctl nodemap_test_nid 30.0.0.1
08:50:17:CMD: onyx-35vm3 /usr/sbin/lctl nodemap_test_nid 30.0.0.2
:
:
09:51:01:CMD: onyx-35vm3 /usr/sbin/lctl nodemap_test_nid 30.7.0.190
09:51:02:CMD: onyx-35vm3 /usr/sbin/lctl nodemap_test_nid 30.7.0.191
09:51:02:********** Timeout by autotest system **********09:51:03:CMD: onyx-35vm3 /usr/sbin/lctl nodemap_test_nid 30.7.0.192
09:51:03:CMD: onyx-35vm3 /usr/sbin/lctl nodemap_test_nid 30.7.0.193
09:51:04:CMD: onyx-35vm3 /usr/sbin/lctl nodemap_test_nid 30.7.0.194
:
:
09:55:56:CMD: onyx-35vm3 /usr/sbin/lctl nodemap_test_nid 30.7.2.27
09:55:56:CMD: onyx-35vm3 /usr/sbin/lctl nodemap_test_nid 30.7.2.28

In a few cases, I've even see the test complete and be marked PASS in the logs after it was "timed out". In the case of sanity-sec test_13 and test_14, the tests have been shortened to run in a couple of minutes, but I think the more important issues are:

  • the test should have been marked as TIMEOUT instead of FAIL if a subtest is timed out
  • if a subtest has started it should produce an entry for that subtest in the logs, even if it did not complete (TEI-2815)
Generated at Sat Feb 10 01:41:52 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.