[LU-1018] Test failure on test suite parallel-scale, subtest test_compilebench Created: 12/Oct/11  Updated: 14/Feb/13  Resolved: 12/Feb/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0, Lustre 2.1.5, Lustre 1.8.7
Fix Version/s: Lustre 2.4.0, Lustre 2.1.5, Lustre 1.8.9

Type: Bug Priority: Blocker
Reporter: Maloo Assignee: Minh Diep
Resolution: Fixed Votes: 0
Labels: LB

Severity: 3
Rank (Obsolete): 2196

 Description   

This issue was created by maloo for sarah <sarah@whamcloud.com>

This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/feb7fdd4-f4d6-11e0-908b-52540025f9af.

The sub-test test_compilebench failed with the following error:

test failed to respond and timed out



 Comments   
Comment by Peter Jones [ 13/Oct/11 ]

Bobi

Could you please look into this possible 1.8.7 blocker as your to priority?

Thanks

Peter

Comment by Sarah Liu [ 13/Oct/11 ]

parallel-scale pass on manual test https://maloo.whamcloud.com/test_sets/c4b5014a-f5fd-11e0-9b90-52540025f9af

Comment by Sarah Liu [ 17/Oct/11 ]

sorry, the above report is invalid, I reran the test and it pass: https://maloo.whamcloud.com/test_sets/d59ed148-f90e-11e0-a451-52540025f9af

Comment by Zhenyu Xu [ 18/Oct/11 ]

Observed that Sarah's latest successfully manually test spaned 4090s while autotest timesout on 3600s.

Comment by Johann Lombardi (Inactive) [ 20/Oct/11 ]

Wow, more than 1h to run compilebench sounds a lot. Have we changed the compilebench parameters since 1.8.6-wc1?
How long did this test take with 1.8.6-wc1?

Comment by Zhenyu Xu [ 21/Oct/11 ]

I search Maloo for passed compilebench tests, the parameters and duration of the cases are listed here:

-i 2 -r 2: 3053 629 2465 2489
-i 4 -r 4: 1794 2799 1282 2494 1258 1261 3431
-i 10 -r 10: 3104 2276

Comment by Johann Lombardi (Inactive) [ 21/Oct/11 ]

Those durations are for 1.8.6-wc1 or 1.8.7-wc1?
What parameters were used for the run that took 4090s?

Comment by Zhenyu Xu [ 21/Oct/11 ]

I can't not tell what lustre version was used in Maloo reports.

And 4090s case used -i 10 -r 10.

Comment by Jian Yu [ 23/Oct/11 ]

v1_8_6_RC2: https://maloo.whamcloud.com/test_sets/2918beba-969a-11e0-9a27-52540025f9af
compilebench: 4427s (-i 10 -r 10)

v1_8_6_RC4: https://maloo.whamcloud.com/test_sets/7e4bec38-a269-11e0-aee5-52540025f9af
compilebench: 4751s (-i 10 -r 10)

Comment by Sarah Liu [ 24/Oct/11 ]

I tested compilebench on 1.8.5, it took 3254s(-i 10 -r 10) to complete.

Comment by Johann Lombardi (Inactive) [ 24/Nov/11 ]

Has this problem been hit with 2.2 (it is in the blocker list)?

Comment by Sarah Liu [ 27/Nov/11 ]

Hi Johann,

Actually on SLES client and RHEL6-i686 client we got time out, while on RHEL6-x86_64 client it pass. Here are the Maloo reports.

Pass: https://maloo.whamcloud.com/test_sets/c4dbaebe-11ad-11e1-9936-52540025f9af
Fail: https://maloo.whamcloud.com/test_sets/fefd8fac-0f89-11e1-aad7-52540025f9af
https://maloo.whamcloud.com/test_sets/188f0b10-114e-11e1-ad46-52540025f9af

Comment by Johann Lombardi (Inactive) [ 28/Nov/11 ]

The one that passed was with real nodes (and it passed in 2789s), while the failed ones were with VMs (the timeout is 3600s).
We should either increase the test timeout or decrease the number of iterations.

Comment by Peter Jones [ 04/Jan/12 ]

Bobi

Do you have enough information to provide a fix for this one?

Peter

Comment by Zhenyu Xu [ 04/Jan/12 ]

I think its the autotest script should be changed to either increase the test timeout or decrease the test iteration number.

Comment by Zhenyu Xu [ 04/Jan/12 ]

please check whether we can adjust test timeout for this or reduce the test iteration argument for the test.

Comment by Andreas Dilger [ 16/Jan/12 ]

I think compilebench is only useful to run for a long time if we are actually using the performance to compare whether the Lustre code is getting slower. Running under a VM doesn't allow meaningful performance comparisons, and spending an hour to run this test is probably not useful for a VM.

For this reason, I think it is better to change the test to be shorter (e.g. -i 4 -r 4), rather than increase the timeout. This can be re-examined in the future once Maloo has some way to report and compare performance results.

Comment by Andreas Dilger [ 19/Jan/12 ]

Bumping this to critical, to get the test run time shorter. The current cbench_IDIRS and cbench_RUNS are apparently being set to 10 by the environment, since the default in the lustre/tests/parallel-scale.sh script are both 4. Oleg thinks this should still be run close to an hour, so changing it to 8 and 8 (64 builds from 100) should get us under the 1h mark.

Comment by Chris Gearing (Inactive) [ 20/Jan/12 ]

The environment does not set these, they are set to 4 in master but 10 in b1_8. I would suggest as all the references point to b1_8 then we should update b1_8 before changing the test code.

I'll switch this back to an LU issue, then if we want the environment to override the default we can create a new ticket.

In fact someone should make it consistent across all branches.

Comment by Peter Jones [ 20/Jan/12 ]

Assigning to Minh to look into. This is a priority if it will affect either 2.1.1 or 2.2 testing (even if just interop with 1.8.x)

Comment by Minh Diep [ 23/Jan/12 ]

ok, I will change it to 4

Comment by Minh Diep [ 24/Jan/12 ]

patch http://review.whamcloud.com/#change,2005

Comment by Build Master (Inactive) [ 21/Mar/12 ]

Integrated in lustre-b1_8 » x86_64,client,el6,inkernel #181
LU-1018 test: Reduce compilebench run time (Revision ee16d209e20fbbaddfa46e9aa86b7236bb350a44)

Result = SUCCESS
Johann Lombardi : ee16d209e20fbbaddfa46e9aa86b7236bb350a44
Files :

  • lustre/tests/parallel-scale.sh
Comment by Build Master (Inactive) [ 21/Mar/12 ]

Integrated in lustre-b1_8 » i686,client,el6,inkernel #181
LU-1018 test: Reduce compilebench run time (Revision ee16d209e20fbbaddfa46e9aa86b7236bb350a44)

Result = SUCCESS
Johann Lombardi : ee16d209e20fbbaddfa46e9aa86b7236bb350a44
Files :

  • lustre/tests/parallel-scale.sh
Comment by Build Master (Inactive) [ 21/Mar/12 ]

Integrated in lustre-b1_8 » i686,client,el5,ofa #181
LU-1018 test: Reduce compilebench run time (Revision ee16d209e20fbbaddfa46e9aa86b7236bb350a44)

Result = SUCCESS
Johann Lombardi : ee16d209e20fbbaddfa46e9aa86b7236bb350a44
Files :

  • lustre/tests/parallel-scale.sh
Comment by Build Master (Inactive) [ 21/Mar/12 ]

Integrated in lustre-b1_8 » i686,client,el5,inkernel #181
LU-1018 test: Reduce compilebench run time (Revision ee16d209e20fbbaddfa46e9aa86b7236bb350a44)

Result = SUCCESS
Johann Lombardi : ee16d209e20fbbaddfa46e9aa86b7236bb350a44
Files :

  • lustre/tests/parallel-scale.sh
Comment by Build Master (Inactive) [ 21/Mar/12 ]

Integrated in lustre-b1_8 » x86_64,client,ubuntu1004,inkernel #181
LU-1018 test: Reduce compilebench run time (Revision ee16d209e20fbbaddfa46e9aa86b7236bb350a44)

Result = SUCCESS
Johann Lombardi : ee16d209e20fbbaddfa46e9aa86b7236bb350a44
Files :

  • lustre/tests/parallel-scale.sh
Comment by Build Master (Inactive) [ 21/Mar/12 ]

Integrated in lustre-b1_8 » i686,server,el5,ofa #181
LU-1018 test: Reduce compilebench run time (Revision ee16d209e20fbbaddfa46e9aa86b7236bb350a44)

Result = SUCCESS
Johann Lombardi : ee16d209e20fbbaddfa46e9aa86b7236bb350a44
Files :

  • lustre/tests/parallel-scale.sh
Comment by Build Master (Inactive) [ 21/Mar/12 ]

Integrated in lustre-b1_8 » i686,server,el5,inkernel #181
LU-1018 test: Reduce compilebench run time (Revision ee16d209e20fbbaddfa46e9aa86b7236bb350a44)

Result = SUCCESS
Johann Lombardi : ee16d209e20fbbaddfa46e9aa86b7236bb350a44
Files :

  • lustre/tests/parallel-scale.sh
Comment by Build Master (Inactive) [ 21/Mar/12 ]

Integrated in lustre-b1_8 » x86_64,server,el5,ofa #181
LU-1018 test: Reduce compilebench run time (Revision ee16d209e20fbbaddfa46e9aa86b7236bb350a44)

Result = SUCCESS
Johann Lombardi : ee16d209e20fbbaddfa46e9aa86b7236bb350a44
Files :

  • lustre/tests/parallel-scale.sh
Comment by Build Master (Inactive) [ 21/Mar/12 ]

Integrated in lustre-b1_8 » x86_64,client,el5,inkernel #181
LU-1018 test: Reduce compilebench run time (Revision ee16d209e20fbbaddfa46e9aa86b7236bb350a44)

Result = SUCCESS
Johann Lombardi : ee16d209e20fbbaddfa46e9aa86b7236bb350a44
Files :

  • lustre/tests/parallel-scale.sh
Comment by Build Master (Inactive) [ 21/Mar/12 ]

Integrated in lustre-b1_8 » x86_64,client,el5,ofa #181
LU-1018 test: Reduce compilebench run time (Revision ee16d209e20fbbaddfa46e9aa86b7236bb350a44)

Result = SUCCESS
Johann Lombardi : ee16d209e20fbbaddfa46e9aa86b7236bb350a44
Files :

  • lustre/tests/parallel-scale.sh
Comment by Build Master (Inactive) [ 21/Mar/12 ]

Integrated in lustre-b1_8 » x86_64,server,el5,inkernel #181
LU-1018 test: Reduce compilebench run time (Revision ee16d209e20fbbaddfa46e9aa86b7236bb350a44)

Result = SUCCESS
Johann Lombardi : ee16d209e20fbbaddfa46e9aa86b7236bb350a44
Files :

  • lustre/tests/parallel-scale.sh
Comment by Build Master (Inactive) [ 21/Mar/12 ]

Integrated in lustre-b1_8 » x86_64,client,el6,inkernel #172
LU-1018 test: Reduce compilebench run time (Revision ee16d209e20fbbaddfa46e9aa86b7236bb350a44)

Result = SUCCESS
Johann Lombardi : ee16d209e20fbbaddfa46e9aa86b7236bb350a44
Files :

  • lustre/tests/parallel-scale.sh
Comment by Build Master (Inactive) [ 21/Mar/12 ]

Integrated in lustre-b1_8 » i686,client,el6,inkernel #172
LU-1018 test: Reduce compilebench run time (Revision ee16d209e20fbbaddfa46e9aa86b7236bb350a44)

Result = SUCCESS
Johann Lombardi : ee16d209e20fbbaddfa46e9aa86b7236bb350a44
Files :

  • lustre/tests/parallel-scale.sh
Comment by Build Master (Inactive) [ 21/Mar/12 ]

Integrated in lustre-b1_8 » x86_64,client,el5,ofa #172
LU-1018 test: Reduce compilebench run time (Revision ee16d209e20fbbaddfa46e9aa86b7236bb350a44)

Result = SUCCESS
Johann Lombardi : ee16d209e20fbbaddfa46e9aa86b7236bb350a44
Files :

  • lustre/tests/parallel-scale.sh
Comment by Build Master (Inactive) [ 21/Mar/12 ]

Integrated in lustre-b1_8 » i686,server,el5,ofa #172
LU-1018 test: Reduce compilebench run time (Revision ee16d209e20fbbaddfa46e9aa86b7236bb350a44)

Result = SUCCESS
Johann Lombardi : ee16d209e20fbbaddfa46e9aa86b7236bb350a44
Files :

  • lustre/tests/parallel-scale.sh
Comment by Build Master (Inactive) [ 21/Mar/12 ]

Integrated in lustre-b1_8 » x86_64,server,el5,ofa #172
LU-1018 test: Reduce compilebench run time (Revision ee16d209e20fbbaddfa46e9aa86b7236bb350a44)

Result = SUCCESS
Johann Lombardi : ee16d209e20fbbaddfa46e9aa86b7236bb350a44
Files :

  • lustre/tests/parallel-scale.sh
Comment by Build Master (Inactive) [ 21/Mar/12 ]

Integrated in lustre-b1_8 » i686,client,el5,ofa #172
LU-1018 test: Reduce compilebench run time (Revision ee16d209e20fbbaddfa46e9aa86b7236bb350a44)

Result = SUCCESS
Johann Lombardi : ee16d209e20fbbaddfa46e9aa86b7236bb350a44
Files :

  • lustre/tests/parallel-scale.sh
Comment by Build Master (Inactive) [ 21/Mar/12 ]

Integrated in lustre-b1_8 » x86_64,server,el5,inkernel #172
LU-1018 test: Reduce compilebench run time (Revision ee16d209e20fbbaddfa46e9aa86b7236bb350a44)

Result = SUCCESS
Johann Lombardi : ee16d209e20fbbaddfa46e9aa86b7236bb350a44
Files :

  • lustre/tests/parallel-scale.sh
Comment by Build Master (Inactive) [ 21/Mar/12 ]

Integrated in lustre-b1_8 » x86_64,client,el5,inkernel #172
LU-1018 test: Reduce compilebench run time (Revision ee16d209e20fbbaddfa46e9aa86b7236bb350a44)

Result = SUCCESS
Johann Lombardi : ee16d209e20fbbaddfa46e9aa86b7236bb350a44
Files :

  • lustre/tests/parallel-scale.sh
Comment by Build Master (Inactive) [ 21/Mar/12 ]

Integrated in lustre-b1_8 » i686,server,el5,inkernel #172
LU-1018 test: Reduce compilebench run time (Revision ee16d209e20fbbaddfa46e9aa86b7236bb350a44)

Result = SUCCESS
Johann Lombardi : ee16d209e20fbbaddfa46e9aa86b7236bb350a44
Files :

  • lustre/tests/parallel-scale.sh
Comment by Build Master (Inactive) [ 21/Mar/12 ]

Integrated in lustre-b1_8 » i686,client,el5,inkernel #172
LU-1018 test: Reduce compilebench run time (Revision ee16d209e20fbbaddfa46e9aa86b7236bb350a44)

Result = SUCCESS
Johann Lombardi : ee16d209e20fbbaddfa46e9aa86b7236bb350a44
Files :

  • lustre/tests/parallel-scale.sh
Comment by Jian Yu [ 06/Jan/13 ]

With cbench_IDIRS=4 and cbench_RUNS=4, I fount the parallel-scale compilebench still timed out on Lustre b1_8 build #236 on RHEL5.8/x86_64 distro/arch:
https://maloo.whamcloud.com/test_sets/37f71c04-57cf-11e2-9cc9-52540035b04c

== parallel-scale test compilebench: compilebench ==================================================== 14:10:11 (1357423811)
OPTIONS:
cbench_DIR=/usr/bin
cbench_IDIRS=4
cbench_RUNS=4
client-26vm5
client-26vm6.lab.whamcloud.com
./compilebench -D /mnt/lustre/d0.compilebench -i 4         -r 4 --makej
using working directory /mnt/lustre/d0.compilebench, 4 intial dirs 4 runs
native unpatched native-0 222MB in 415.31 seconds (0.54 MB/s)
native patched native-0 109MB in 331.86 seconds (0.33 MB/s)
native patched compiled native-0 691MB in 135.53 seconds (5.10 MB/s)
create dir kernel-0 222MB in 546.25 seconds (0.41 MB/s)
create dir kernel-1 222MB in 378.33 seconds (0.59 MB/s)
create dir kernel-2 222MB in 449.13 seconds (0.50 MB/s)
create dir kernel-3 222MB in 366.71 seconds (0.61 MB/s)
compile dir kernel-2 680MB in 116.23 seconds (5.86 MB/s)
compile dir kernel-0 680MB in 140.25 seconds (4.85 MB/s)
compile dir kernel-3 680MB in 144.27 seconds (4.72 MB/s)
compile dir kernel-1 680MB in 133.51 seconds (5.10 MB/s)

Should we reduce the values of cbench_IDIRS and cbench_RUNS to 2 as what we did in LU-1625?

Comment by Minh Diep [ 08/Jan/13 ]

yes, I think we should. I will provide a patch

Comment by Minh Diep [ 08/Jan/13 ]

actually, after research about this. I found that the above failure is due to compile bench was very slow. There were logs that this test passed in 2400 sec or less. I haven't found anything in lustre about any issue. The one that we changed cbench_RUNS to 2 was over nfs.

Comment by Minh Diep [ 10/Jan/13 ]

I can not reproduce this problem in the lab. I ran and it passed https://maloo.whamcloud.com/test_sets/db834dca-5af4-11e2-b205-52540035b04c

Comment by Jian Yu [ 10/Jan/13 ]

Another instance on Lustre b1_8:
https://maloo.whamcloud.com/test_sets/bfba4ddc-5aef-11e2-8985-52540035b04c

Comment by Emoly Liu [ 10/Jan/13 ]

Hit on b1_8:
https://maloo.whamcloud.com/test_sets/6afb8d4a-5aa1-11e2-84d3-52540035b04c

Comment by Jian Yu [ 13/Jan/13 ]

Another instance on Lustre b1_8:
https://maloo.whamcloud.com/test_sets/fba2415e-5d7b-11e2-8199-52540035b04c

Comment by Zhenyu Xu [ 14/Jan/13 ]

found no error in Lustre, do notice that in all failed cases, the file operations are slower than passed cases, and that makes the test timespan differ much.

fail cases
create dir kernel-0 222MB in 570.28 seconds (0.39 MB/s)
create dir kernel-1 222MB in 359.16 seconds (0.62 MB/s)
create dir kernel-2 222MB in 393.61 seconds (0.56 MB/s)
create dir kernel-3 222MB in 443.75 seconds (0.50 MB/s)
compile dir kernel-2 680MB in 117.00 seconds (5.82 MB/s)
compile dir kernel-0 680MB in 147.07 seconds (4.63 MB/s)
compile dir kernel-3 680MB in 135.61 seconds (5.02 MB/s)
compile dir kernel-1 680MB in 243.61 seconds (2.79 MB/s)
passed cases
create dir kernel-0 222MB in 131.53 seconds (1.69 MB/s)
create dir kernel-1 222MB in 134.71 seconds (1.65 MB/s)
create dir kernel-2 222MB in 135.66 seconds (1.64 MB/s)
create dir kernel-3 222MB in 132.69 seconds (1.68 MB/s)
compile dir kernel-2 680MB in 49.19 seconds (13.84 MB/s)
compile dir kernel-0 680MB in 53.67 seconds (12.68 MB/s)
compile dir kernel-3 680MB in 53.41 seconds (12.74 MB/s)
compile dir kernel-1 680MB in 50.37 seconds (13.51 MB/s)

And most of the failed cases happened on our single client node with all participants running on the multiple VMs of the node, I suspect our client nodes is not powerful enough to hold all the participants and run test_compilebench fast enough.

Comment by Jian Yu [ 15/Jan/13 ]

We need reduce the values of cbench_IDIRS and cbench_RUNS to 2 for parallel-scale.sh.

Comment by Minh Diep [ 15/Jan/13 ]

patch http://review.whamcloud.com/5032

Comment by Jian Yu [ 16/Jan/13 ]

The fix is also needed on b2_1 and master branches.

Some instances:
https://maloo.whamcloud.com/test_sessions/cc4e845e-5f19-11e2-b507-52540035b04c
https://maloo.whamcloud.com/test_sessions/598d6ec6-47b3-11e2-876e-52540035b04c
https://maloo.whamcloud.com/test_sessions/2054afa2-4484-11e2-8b5c-52540035b04c
https://maloo.whamcloud.com/test_sessions/3f300a3e-3ec6-11e2-856e-52540035b04c
https://maloo.whamcloud.com/test_sessions/aae373be-fe99-11e1-b4cd-52540035b04c

Comment by Minh Diep [ 17/Jan/13 ]

patch for master: http://review.whamcloud.com/#change,5052
patch for b2_1: http://review.whamcloud.com/#change,5053

Comment by Peter Jones [ 21/Jan/13 ]

Landed for 1.8.9 and 2.4

Comment by Keith Mannthey (Inactive) [ 07/Feb/13 ]

LU-2767 Interop 2.1.4<->2.4 failure on test suite parallel-scale test_compilebench

Seems to have it this same issue. Are we not going to land the fix for 2.1?

Comment by Minh Diep [ 12/Feb/13 ]

landed on 2.1.5

Generated at Sat Feb 10 01:12:42 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.