[LU-4344] Test marked FAIL with "No sub tests failed in this test set" Created: 21/Mar/13 Updated: 16/Sep/21 Resolved: 16/Sep/21 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Keith Mannthey (Inactive) | Assignee: | WC Triage |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | support | ||
| Environment: |
patches pushed to git |
||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 1792 | ||||||||||||
| Description |
|
LU context can be found at Keith Mannthey added a comment - 11/Mar/13 12:44 PM I have seen a bit of this over the last week or so on master. This is a good example. https://maloo.whamcloud.com/test_sessions/bf361f32-8919-11e2-b643-52540035b04c There was one real error at the very end end of this test. Other than that all subtests "passed" even though 3 whole sections are marked FAILED. What logs do you look at to see the cleanup issues previously mentioned? How can we tell if it is a problem with the patch or some autotest anomaly? " Minh Diep added a comment - 21/Mar/13 9:37 AM I looked at the log above, sanity actually ran much longer but maloo only show a few hundreds of second I don't think this is the same problem. Please file a TT ticket. This has happened a bit here an there. It would be nice to always know what failed when something "fails". Please update the LU if this is seen as an LU issue. |
| Comments |
| Comment by Nathaniel Clark [ 25/Apr/13 ] |
|
Some review-zfs runs that all subtests pass, but test is marked as failed: |
| Comment by Nathaniel Clark [ 29/May/13 ] |
|
Another review-zfs run (sanity-quota): |
| Comment by Keith Mannthey (Inactive) [ 29/May/13 ] |
|
This is a totally broken ZFS run: For example on lnetself test: Duration is 376 seconds, it is marked FAIL, 0/0 subtests have ran There are no lustre-initialization logs to look at as well. https://maloo.whamcloud.com/test_sessions/9917ca1e-c832-11e2-8dd9-52540035b04c is similar. |
| Comment by Chris Gearing (Inactive) [ 30/Jul/13 ] |
|
Right so the problem here is in the Lustre test framework. The results.yaml file created by the results framework looks like this (abbreviated). Tests:
-
name: sanity
description: auster sanity
submission: Thu Jul 25 06:26:09 PDT 2013
report_version: 2
SubTests:
-
name: test_0a
status: PASS
duration: 0
return_code: 0
error:
-
name: test_0b
status: PASS
duration: 3
return_code: 0
error:
status: PASS
but sometimes it looks like this Tests:
-
name: sanity
description: auster sanity
submission: Thu Jul 25 06:26:09 PDT 2013
report_version: 2
SubTests:
-
name: test_0a
status: PASS
duration: 0
return_code: 0
error:
-
name: test_0b
status: PASS
duration: 3
return_code: 0
error:
status: FAIL
Which autotest/maloo faithfully reports. So the framework needs to be fixed up. Now I think we could remove the final status from the yaml and let Maloo decide if the suite passed, but while the result is their Maloo should report the result it receives. My guess is that it is showing a real issue. This needs to be fixed in the test framework so I will change this to a lustre ticket. |
| Comment by Keith Mannthey (Inactive) [ 31/Jul/13 ] |
|
I agree with Chris. Most the time there is some issue. Normally in the unmounting of filesystems or some other cleanup phase of the test. Perhaps a setup and cleanup phase for each subtest could catch all these extra issues in a less confusing way. |
| Comment by Chris Gearing (Inactive) [ 04/Dec/13 ] |
|
This needs to be investigated in the Lustre code and so needs to be fixed under an LU ticket. |
| Comment by Keith Mannthey (Inactive) [ 04/Dec/13 ] |
|
Perhaps |
| Comment by Dmitry Eremin (Inactive) [ 20/Jan/14 ] |
|
One more time it happens: https://maloo.whamcloud.com/test_sets/9618ff94-81e2-11e3-94d9-52540035b04c |
| Comment by Michael MacDonald (Inactive) [ 10/Mar/14 ] |
|
More instances? https://maloo.whamcloud.com/test_sets/392d7044-a794-11e3-ba84-52540035b04c |
| Comment by Matt Ezell [ 29/Mar/14 ] |
|
I don't see why this failed, is this another instance? |
| Comment by James Nunez (Inactive) [ 31/Mar/14 ] |
|
Matt, I think that one failed due to TEI-1403. |
| Comment by Dmitry Eremin (Inactive) [ 10/Sep/14 ] |
|
one more fail: https://testing.hpdd.intel.com/test_sets/7fc6919c-3863-11e4-b7d4-5254006e85c2 |
| Comment by Andreas Dilger [ 14/Oct/14 ] |
|
Another failure https://testing.hpdd.intel.com/test_sets/b42e4da6-501b-11e4-8734-5254006e85c2 It seems like many of these failures that I can find happen when some server is being unmounted. Unfortunately, it isn't possible to diagnose this remotely because these failures do not leave any server console/dmesg/debug logs on the test. I've filed TEI-2815 to try and get our test framework to capture more logs in this kind of failure situation. |
| Comment by Andreas Dilger [ 03/Nov/14 ] |
|
It looks like some tests that are timed out (e.g. sanity-sec test_13 or test_14 due to 08:50:15:== sanity-sec test 13: test nids == 08:49:23 (1409672963) : : 08:50:17:CMD: onyx-35vm3 /usr/sbin/lctl nodemap_test_nid 30.0.0.1 08:50:17:CMD: onyx-35vm3 /usr/sbin/lctl nodemap_test_nid 30.0.0.2 : : 09:51:01:CMD: onyx-35vm3 /usr/sbin/lctl nodemap_test_nid 30.7.0.190 09:51:02:CMD: onyx-35vm3 /usr/sbin/lctl nodemap_test_nid 30.7.0.191 09:51:02:********** Timeout by autotest system **********09:51:03:CMD: onyx-35vm3 /usr/sbin/lctl nodemap_test_nid 30.7.0.192 09:51:03:CMD: onyx-35vm3 /usr/sbin/lctl nodemap_test_nid 30.7.0.193 09:51:04:CMD: onyx-35vm3 /usr/sbin/lctl nodemap_test_nid 30.7.0.194 : : 09:55:56:CMD: onyx-35vm3 /usr/sbin/lctl nodemap_test_nid 30.7.2.27 09:55:56:CMD: onyx-35vm3 /usr/sbin/lctl nodemap_test_nid 30.7.2.28 In a few cases, I've even see the test complete and be marked PASS in the logs after it was "timed out". In the case of sanity-sec test_13 and test_14, the tests have been shortened to run in a couple of minutes, but I think the more important issues are:
|