[LU-1625] Test failure on test suite parallel-scale-nfsv4, subtest test_metabench Created: 12/Jul/12  Updated: 22/Feb/13  Resolved: 19/Aug/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.3.0, Lustre 2.1.3, Lustre 1.8.8
Fix Version/s: Lustre 2.3.0, Lustre 2.1.3, Lustre 1.8.9

Type: Bug Priority: Blocker
Reporter: Maloo Assignee: Keith Mannthey (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 4488

 Description   

This issue was created by maloo for sarah <sarah@whamcloud.com>

This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/4a115426-cba8-11e1-8847-52540035b04c.

The sub-test test_metabench failed with the following error:

test failed to respond and timed out

From the log, this test took more than 35 minutes before it was ended. I check several pass runs, it usual takes less than 1800s, so the test may just be killed by the system.



 Comments   
Comment by Peter Jones [ 13/Jul/12 ]

Yangsheng

Could you please look into this one?

Thanks

Peter

Comment by Yang Sheng [ 20/Jul/12 ]

Looks like this just test timeout. There about 1 hours between compilebeach & metabeach.

Lustre: DEBUG MARKER: == parallel-scale-nfsv4 test compilebench: compilebench == 12:49:59 (1342036199)
Lustre: DEBUG MARKER: == parallel-scale-nfsv4 test metabench: metabench == 13:53:31 (1342040011)

Is the SLOW=yes has seted?

Comment by Sarah Liu [ 20/Jul/12 ]

I think yes, the full test suite run by Autotest should set SLOW=yes as usual.

Comment by Yang Sheng [ 24/Jul/12 ]

So looks compilebeach works normal. Just metabench was killed by timeout. But got less info from the logs. I'll trying to search other failed instance to investigate.

Comment by Yang Sheng [ 25/Jul/12 ]

I suspect this issue relate to some nfs problem. from stacktrace, the nfsv4-svc thread always running same location. I'll do more check for that.

Comment by Peter Jones [ 08/Aug/12 ]

Bobijam will help with this one

Comment by Zhenyu Xu [ 08/Aug/12 ]

Sarah,

what the timeout rule for autotest? as in https://maloo.whamcloud.com/test_sessions/3b113b66-e157-11e1-b541-52540035b04c, I saw parallel-scale-nfsv3 can run 9499 seconds, while parallel-scale-nfsv4 timed out in 3600 seconds.

Comment by Sarah Liu [ 09/Aug/12 ]

Bobi, I think for each test, timeout is set to 3600s, 9499s was the total number for 5 tests

Comment by Andreas Dilger [ 10/Aug/12 ]

Minh, could you please just change the default compilebench numbers to "2" and "2" for parallel-scale-nfs.sh. This is a trivial change, and reduces the testing time, rather than making it take longer, and I don't think the benefits of testing NFS for such a long time is matched by the number of users who use NFS.

I think this is simply the following:

test_compilebench() {
    export cbench_IDIRS=${cbench_IDIRS:-2}
    export cbench_RUNS=${cbench_RUNS:-2}

    run_compilebench
}       
run_test compilebench "compilebench"

If there are other sub-parts of parallel-scale-nfsv4 that are taking a long time, I think they can also be shortened in a similar manner.

Comment by Minh Diep [ 10/Aug/12 ]

patch to reduce parallel-scale for nfs
http://review.whamcloud.com/#change,3596

Comment by Jian Yu [ 13/Aug/12 ]

RHEL6.3/x86_64 (2.1.3 Server + 1.8.8-wc1 Client):
https://maloo.whamcloud.com/test_sets/7422aff4-e42a-11e1-b6d3-52540035b04c

Comment by Peter Jones [ 15/Aug/12 ]

Patch landed for 2.1.3 and 2.3. If there are still issues with this with Minh's changes in place then please reopen

Comment by Keith Mannthey (Inactive) [ 16/Aug/12 ]

One extra change is need. I missed part of the needed patch. http://review.whamcloud.com/3701 has been pushed to fix this issue.

Comment by Peter Jones [ 19/Aug/12 ]

Extra tweak landed too

Comment by Emoly Liu [ 03/Jan/13 ]

Patch for b1_8 is at http://review.whamcloud.com/4949

Generated at Sat Feb 10 01:18:16 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.