[LU-1897] Failure on test suite replay-single, test_70b: dbench not found on some of the test nodes Created: 11/Sep/12  Updated: 24/Dec/15  Resolved: 15/Dec/15

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0, Lustre 2.1.6, Lustre 2.8.0
Fix Version/s: Lustre 2.4.0

Type: Bug Priority: Blocker
Reporter: Maloo Assignee: Keith Mannthey (Inactive)
Resolution: Cannot Reproduce Votes: 1
Labels: MB, mn1

Issue Links:
Duplicate
is duplicated by LU-2120 Test failure on test suite replay-sin... Resolved
Severity: 3
Rank (Obsolete): 5743

 Description   

This issue was created by maloo for jay <jay@whamcloud.com>

This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/c4620b6c-fc31-11e1-a4a6-52540035b04c.

The sub-test test_70b failed with the following error:

dbench not found on some of client-28vm1,client-28vm2.lab.whamcloud.com !

Info required for matching: replay-single 70b



 Comments   
Comment by Ian Colle (Inactive) [ 05/Oct/12 ]

https://maloo.whamcloud.com/test_sets/8702cb68-0eeb-11e2-8f7c-52540035b04c

Comment by nasf (Inactive) [ 04/Nov/12 ]

Another failure instance:

https://maloo.whamcloud.com/test_sets/2e20ce44-26da-11e2-b938-52540035b04c
https://maloo.whamcloud.com/sub_tests/51c6d29e-26da-11e2-b938-52540035b04c

Comment by Sarah Liu [ 07/Nov/12 ]

another failure: https://maloo.whamcloud.com/test_sets/c2d37650-2819-11e2-aa14-52540035b04c

rundbench load on fat-intel-1vm2,fat-intel-1vm6.lab.whamcloud.com failed!

Comment by Bob Glossman (Inactive) [ 19/Nov/12 ]

also seen in:
https://maloo.whamcloud.com/sub_tests/e0d319ee-3093-11e2-9075-52540035b04c

client-10vm2: dbench: no process found
replay-single test_70b: @@@@@@ FAIL: dbench not found on some of client-10vm1,client-10vm2!

Comment by Peter Jones [ 03/Dec/12 ]

This is happening quite often https://maloo.whamcloud.com/test_sets/d11835ae-3c71-11e2-82b5-52540035b04c

Comment by Peter Jones [ 03/Dec/12 ]

https://maloo.whamcloud.com/test_sets/query?utf8=✓&test_set%5Btest_set_script_id%5D=f6a12204-32c3-11e0-a61c-52540025f9ae&test_set%5Bstatus%5D=FAIL&test_set%5Bquery_bugs%5D=&test_session%5Btest_host%5D=&test_session%5Btest_group%5D=&test_session%5Buser_id%5D=&test_session%5Bquery_date%5D=&test_session%5Bquery_recent_period%5D=&test_node%5Bos_type_id%5D=&test_node%5Bdistribution_type_id%5D=&test_node%5Barchitecture_type_id%5D=&test_node%5Bfile_system_type_id%5D=&test_node%5Blustre_branch_id%5D=24a6947e-04a9-11e1-bb5f-52540025f9af&test_node_network%5Bnetwork_type_id%5D=&commit=Update+results

Comment by Peter Jones [ 03/Dec/12 ]

Hi Chris

Is this a TT issue?

Peter

Comment by Bob Glossman (Inactive) [ 07/Dec/12 ]

https://maloo.whamcloud.com/test_sets/c2c1a95c-405b-11e2-a16b-52540035b04c

test log looks very odd. Although it reports

replay-single test_70b: @@@@@@ FAIL: dbench not found on some of client-10vm1.lab.whamcloud.com,client-10vm2!

the log has output of dbench actually running just a few lines before the error report.

Comment by Mikhail Pershin [ 08/Dec/12 ]

It looks like dbench ends right before test checks that it is running.

Comment by nasf (Inactive) [ 28/Dec/12 ]

Another failure instance:

https://maloo.whamcloud.com/test_sets/19e6b2fe-513c-11e2-b56e-52540035b04c

Comment by Bob Glossman (Inactive) [ 07/Jan/13 ]

another instance:
https://maloo.whamcloud.com/test_sets/927c0d00-5717-11e2-8b17-52540035b04c

Comment by Bob Glossman (Inactive) [ 08/Jan/13 ]

another instance:
https://maloo.whamcloud.com/test_sets/72d446e2-5943-11e2-9b54-52540035b04c

Comment by Keith Mannthey (Inactive) [ 08/Jan/13 ]

So Bob and I looked into the problem a bit today.

In general dbench is just a background load for the test and as mentioned Dbench is ending before the test finishes.

I observed the Dbench finishing at 62s and the test finishing in and checking in on it at a little under 90's.

I have submitted at patch to move the dbench duration out a bit to see it help the issue. The test could stand to be rewritten to make the back ground load a non-issue.

http://review.whamcloud.com/4973 is simple patch that adds a comment and increases a duration that should allow dbench to run longer.

Comment by Prakash Surya (Inactive) [ 14/Jan/13 ]

Looks like I suffered from this over the weekend: https://maloo.whamcloud.com/test_sessions/b2e3bc58-5c73-11e2-ab3b-52540035b04c

Comment by Andreas Dilger [ 17/Jan/13 ]

It might be nice as part of this fix to change the error message to "dbench is no longer running on some of the test nodes", which would not give the false impression that it is not installed on the test nodes...

Comment by Keith Mannthey (Inactive) [ 17/Jan/13 ]

I will change the comment and resubmit the patch.

Comment by Keith Mannthey (Inactive) [ 31/Jan/13 ]

http://review.whamcloud.com/4973 has been merged. Please reopen if the problem sill occurs.

Comment by Nathaniel Clark [ 18/Mar/13 ]

I think this failed here:
https://maloo.whamcloud.com/test_sets/43757b76-8eb0-11e2-81eb-52540035b04c

Comment by Keith Mannthey (Inactive) [ 18/Mar/13 ]

It looks like dbench didn't start on client-26vm5 fast enough for the first check?
https://maloo.whamcloud.com/test_logs/6cc6de7a-8eb0-11e2-81eb-52540035b04c

12 seconds isn't long enough to start dbench on a remote client... Fun. I will submit a patch.

This is the console from the main client.

15:52:52:Lustre: DEBUG MARKER: /usr/sbin/lctl mark Started rundbench load pid=1931 ...
15:52:52:Lustre: DEBUG MARKER: Started rundbench load pid=1931 ...
15:53:03:Lustre: DEBUG MARKER: killall -0 dbench
15:53:04:Lustre: DEBUG MARKER: /usr/sbin/lctl mark  replay-single test_70b: @@@@@@ FAIL: dbench not running on some of client-26vm5,client-26vm6.lab.whamcloud.com! 

In the autotest long you see dbench start some time after this on the other client. It does run just not at the right time.

Comment by Keith Mannthey (Inactive) [ 19/Mar/13 ]

http://review.whamcloud.com/5761 version 1 of the patch has been submitted.

Comment by Peter Jones [ 26/Mar/13 ]

Landed for 2.4

Comment by Jian Yu [ 18/Apr/14 ]

The issue also exists on Lustre b2_1 branch:
https://maloo.whamcloud.com/test_sets/a0806aa0-c5ce-11e3-9255-52540035b04c

Comment by Saurabh Tandan (Inactive) [ 10/Dec/15 ]

master, build# 3264, 2.7.64 tag
Hard Failover: EL6.7 Server/Client
https://testing.hpdd.intel.com/test_sets/80a20678-9edd-11e5-87a9-5254006e85c2

Comment by Saurabh Tandan (Inactive) [ 10/Dec/15 ]

Instance found in recent tags 2.7.63, 2.7.64

Comment by Saurabh Tandan (Inactive) [ 11/Dec/15 ]

master, build# 3264, 2.7.64 tag
Hard Failover: EL7 Server/Client - ZFS
https://testing.hpdd.intel.com/test_sets/5ba6d7bc-9e20-11e5-91b0-5254006e85c2

Comment by Saurabh Tandan (Inactive) [ 15/Dec/15 ]

master, build# 3266, 2.7.64 tag
Hard Failover:EL7 Server/SLES11 SP3 Client
https://testing.hpdd.intel.com/test_sets/a8b3fb9e-a077-11e5-8d69-5254006e85c2

Comment by Sarah Liu [ 15/Dec/15 ]

close this ticket as the recent instances on master (tag-2.7.63 and tag-2.7.64) are caused by the "no space left on device"

shadow-52vm6: [769] open ./clients/client0/~dmtmp/PWRPNT/PPTC112.TMP failed for handle 10033 (No space left on device)
shadow-52vm6: (770) ERROR: handle 10033 was not found
shadow-52vm6: Child failed with status 1
shadow-52vm6: touch: missing file operand
shadow-52vm6: Try 'touch --help' for more information.
shadow-52vm6: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 21 sec
shadow-52vm1:    1       733     0.17 MB/sec  warmup 124 sec  latency 53560.001 ms
shadow-52vm5:    1       791     0.17 MB/sec  warmup 124 sec  latency 54097.359 ms
shadow-52vm1:    1       733     0.17 MB/sec  warmup 125 sec  latency 1340.522 ms
shadow-52vm5:    1       794     0.17 MB/sec  warmup 125 sec  latency 54380.844 ms
CMD: shadow-52vm1.shadow.whamcloud.com,shadow-52vm5,shadow-52vm6 killall -0 dbench
shadow-52vm6: dbench: no process found
 replay-single test_70b: @@@@@@ FAIL: dbench stopped on some of shadow-52vm1.shadow.whamcloud.com,shadow-52vm5,shadow-52vm6! 
shadow-52vm1:    1       733     0.17 MB/sec  warmup 126 sec  latency 2340.713 ms
shadow-52vm5:    1       794     0.17 MB/sec  warmup 126 sec  latency 1715.056 ms
Comment by Andreas Dilger [ 24/Dec/15 ]

Sarah,
Is there a different ticket tracking the "no space left on device" failure? If this problem is still happening, but this bug is closed, then either this bug needs to be linked to an existing ticket or a new ticket opened up to track these failures.

Comment by Jian Yu [ 24/Dec/15 ]

Hi Andreas,

Is there a different ticket tracking the "no space left on device" failure?

Yes, it's LU-4846. And the original issue in this ticket is not related to "No space left on device", so we do not need link this ticket to LU-4846.

Generated at Sat Feb 10 01:20:39 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.