[LU-1897] Failure on test suite replay-single, test_70b: dbench not found on some of the test nodes Created: 11/Sep/12 Updated: 24/Dec/15 Resolved: 15/Dec/15 |
|
| Status: | Closed |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.0, Lustre 2.1.6, Lustre 2.8.0 |
| Fix Version/s: | Lustre 2.4.0 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Maloo | Assignee: | Keith Mannthey (Inactive) |
| Resolution: | Cannot Reproduce | Votes: | 1 |
| Labels: | MB, mn1 | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 5743 | ||||||||
| Description |
|
This issue was created by maloo for jay <jay@whamcloud.com> This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/c4620b6c-fc31-11e1-a4a6-52540035b04c. The sub-test test_70b failed with the following error:
Info required for matching: replay-single 70b |
| Comments |
| Comment by Ian Colle (Inactive) [ 05/Oct/12 ] |
|
https://maloo.whamcloud.com/test_sets/8702cb68-0eeb-11e2-8f7c-52540035b04c |
| Comment by nasf (Inactive) [ 04/Nov/12 ] |
|
Another failure instance: https://maloo.whamcloud.com/test_sets/2e20ce44-26da-11e2-b938-52540035b04c |
| Comment by Sarah Liu [ 07/Nov/12 ] |
|
another failure: https://maloo.whamcloud.com/test_sets/c2d37650-2819-11e2-aa14-52540035b04c rundbench load on fat-intel-1vm2,fat-intel-1vm6.lab.whamcloud.com failed! |
| Comment by Bob Glossman (Inactive) [ 19/Nov/12 ] |
|
also seen in: client-10vm2: dbench: no process found |
| Comment by Peter Jones [ 03/Dec/12 ] |
|
This is happening quite often https://maloo.whamcloud.com/test_sets/d11835ae-3c71-11e2-82b5-52540035b04c |
| Comment by Peter Jones [ 03/Dec/12 ] |
|
https://maloo.whamcloud.com/test_sets/query?utf8=✓&test_set%5Btest_set_script_id%5D=f6a12204-32c3-11e0-a61c-52540025f9ae&test_set%5Bstatus%5D=FAIL&test_set%5Bquery_bugs%5D=&test_session%5Btest_host%5D=&test_session%5Btest_group%5D=&test_session%5Buser_id%5D=&test_session%5Bquery_date%5D=&test_session%5Bquery_recent_period%5D=&test_node%5Bos_type_id%5D=&test_node%5Bdistribution_type_id%5D=&test_node%5Barchitecture_type_id%5D=&test_node%5Bfile_system_type_id%5D=&test_node%5Blustre_branch_id%5D=24a6947e-04a9-11e1-bb5f-52540025f9af&test_node_network%5Bnetwork_type_id%5D=&commit=Update+results |
| Comment by Peter Jones [ 03/Dec/12 ] |
|
Hi Chris Is this a TT issue? Peter |
| Comment by Bob Glossman (Inactive) [ 07/Dec/12 ] |
|
https://maloo.whamcloud.com/test_sets/c2c1a95c-405b-11e2-a16b-52540035b04c test log looks very odd. Although it reports replay-single test_70b: @@@@@@ FAIL: dbench not found on some of client-10vm1.lab.whamcloud.com,client-10vm2! the log has output of dbench actually running just a few lines before the error report. |
| Comment by Mikhail Pershin [ 08/Dec/12 ] |
|
It looks like dbench ends right before test checks that it is running. |
| Comment by nasf (Inactive) [ 28/Dec/12 ] |
|
Another failure instance: https://maloo.whamcloud.com/test_sets/19e6b2fe-513c-11e2-b56e-52540035b04c |
| Comment by Bob Glossman (Inactive) [ 07/Jan/13 ] |
|
another instance: |
| Comment by Bob Glossman (Inactive) [ 08/Jan/13 ] |
|
another instance: |
| Comment by Keith Mannthey (Inactive) [ 08/Jan/13 ] |
|
So Bob and I looked into the problem a bit today. In general dbench is just a background load for the test and as mentioned Dbench is ending before the test finishes. I observed the Dbench finishing at 62s and the test finishing in and checking in on it at a little under 90's. I have submitted at patch to move the dbench duration out a bit to see it help the issue. The test could stand to be rewritten to make the back ground load a non-issue. http://review.whamcloud.com/4973 is simple patch that adds a comment and increases a duration that should allow dbench to run longer. |
| Comment by Prakash Surya (Inactive) [ 14/Jan/13 ] |
|
Looks like I suffered from this over the weekend: https://maloo.whamcloud.com/test_sessions/b2e3bc58-5c73-11e2-ab3b-52540035b04c |
| Comment by Andreas Dilger [ 17/Jan/13 ] |
|
It might be nice as part of this fix to change the error message to "dbench is no longer running on some of the test nodes", which would not give the false impression that it is not installed on the test nodes... |
| Comment by Keith Mannthey (Inactive) [ 17/Jan/13 ] |
|
I will change the comment and resubmit the patch. |
| Comment by Keith Mannthey (Inactive) [ 31/Jan/13 ] |
|
http://review.whamcloud.com/4973 has been merged. Please reopen if the problem sill occurs. |
| Comment by Nathaniel Clark [ 18/Mar/13 ] |
|
I think this failed here: |
| Comment by Keith Mannthey (Inactive) [ 18/Mar/13 ] |
|
It looks like dbench didn't start on client-26vm5 fast enough for the first check? 12 seconds isn't long enough to start dbench on a remote client... Fun. I will submit a patch. This is the console from the main client. 15:52:52:Lustre: DEBUG MARKER: /usr/sbin/lctl mark Started rundbench load pid=1931 ... 15:52:52:Lustre: DEBUG MARKER: Started rundbench load pid=1931 ... 15:53:03:Lustre: DEBUG MARKER: killall -0 dbench 15:53:04:Lustre: DEBUG MARKER: /usr/sbin/lctl mark replay-single test_70b: @@@@@@ FAIL: dbench not running on some of client-26vm5,client-26vm6.lab.whamcloud.com! In the autotest long you see dbench start some time after this on the other client. It does run just not at the right time. |
| Comment by Keith Mannthey (Inactive) [ 19/Mar/13 ] |
|
http://review.whamcloud.com/5761 version 1 of the patch has been submitted. |
| Comment by Peter Jones [ 26/Mar/13 ] |
|
Landed for 2.4 |
| Comment by Jian Yu [ 18/Apr/14 ] |
|
The issue also exists on Lustre b2_1 branch: |
| Comment by Saurabh Tandan (Inactive) [ 10/Dec/15 ] |
|
master, build# 3264, 2.7.64 tag |
| Comment by Saurabh Tandan (Inactive) [ 10/Dec/15 ] |
|
Instance found in recent tags 2.7.63, 2.7.64 |
| Comment by Saurabh Tandan (Inactive) [ 11/Dec/15 ] |
|
master, build# 3264, 2.7.64 tag |
| Comment by Saurabh Tandan (Inactive) [ 15/Dec/15 ] |
|
master, build# 3266, 2.7.64 tag |
| Comment by Sarah Liu [ 15/Dec/15 ] |
|
close this ticket as the recent instances on master (tag-2.7.63 and tag-2.7.64) are caused by the "no space left on device" shadow-52vm6: [769] open ./clients/client0/~dmtmp/PWRPNT/PPTC112.TMP failed for handle 10033 (No space left on device) shadow-52vm6: (770) ERROR: handle 10033 was not found shadow-52vm6: Child failed with status 1 shadow-52vm6: touch: missing file operand shadow-52vm6: Try 'touch --help' for more information. shadow-52vm6: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 21 sec shadow-52vm1: 1 733 0.17 MB/sec warmup 124 sec latency 53560.001 ms shadow-52vm5: 1 791 0.17 MB/sec warmup 124 sec latency 54097.359 ms shadow-52vm1: 1 733 0.17 MB/sec warmup 125 sec latency 1340.522 ms shadow-52vm5: 1 794 0.17 MB/sec warmup 125 sec latency 54380.844 ms CMD: shadow-52vm1.shadow.whamcloud.com,shadow-52vm5,shadow-52vm6 killall -0 dbench shadow-52vm6: dbench: no process found replay-single test_70b: @@@@@@ FAIL: dbench stopped on some of shadow-52vm1.shadow.whamcloud.com,shadow-52vm5,shadow-52vm6! shadow-52vm1: 1 733 0.17 MB/sec warmup 126 sec latency 2340.713 ms shadow-52vm5: 1 794 0.17 MB/sec warmup 126 sec latency 1715.056 ms |
| Comment by Andreas Dilger [ 24/Dec/15 ] |
|
Sarah, |
| Comment by Jian Yu [ 24/Dec/15 ] |
|
Hi Andreas,
Yes, it's LU-4846. And the original issue in this ticket is not related to "No space left on device", so we do not need link this ticket to LU-4846. |