[LU-1792] conf-sanity.sh test_53a/test_53b take too long to run Created: 27/Aug/12  Updated: 10/Mar/18  Resolved: 10/Mar/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Andreas Dilger Assignee: WC Triage
Resolution: Cannot Reproduce Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 10440

 Description   

Looking at recent review test runs to see where the time is being spent, I see that conf-sanity.sh test_53a and test_53b are sometimes taking far too long to run - over 800s and 1000s respectively, sometimes twice that. They should be able to complete in a few seconds, but the remount in the middle of the test seems to take the longest time. Given that these tests run in our test environment about 20x per day, this could be wasting 10h or more of testing time each day.

Some investigation needs to be done to see why these tests are taking so long to run. Is it that mount and/or unmount is very slow? If so, why? Simply skipping these tests for SLOW=no is not a valid solution, since slow mounting/unmounting affects all of our users and wastes even more time for every test that is run, but it is less visible when done at the start of a test run instead of in the middle.

See https://maloo.whamcloud.com/sub_tests/c14c1bec-ef9e-11e1-bdf7-52540035b04c and https://maloo.whamcloud.com/sub_tests/c1559e7e-ef9e-11e1-bdf7-52540035b04c for logs.



 Comments   
Comment by Brian Murrell (Inactive) [ 27/Aug/12 ]

As much as I hate to beat a (way) dead (and buried) horse, ltest used to time every test and keep a history of each test's run time and use those gathered timings to apply a timeout to future test runs. The result was that any individual test's time run was bounded by it's historical run-times to prevent excessive wasting of time. Of course, any test that over-ran it's historically calculated limit was considered a failure.

Just food for thought.

Comment by Andreas Dilger [ 28/Aug/12 ]

Sure, I remember. I recall there was a 10-15% margin for variability in the test. However, looking at the test results the variability is huge. This may be related to running in a VM and potentially contending for CPU or disk bandwidth?

The shortest passing test took 115s and the longest took 1541s so any attempt to do this for the current virtual test environment wouldn't be possible. I haven't made any attempt to correlate this to branch/arch/cluster.

Comment by Brian Murrell (Inactive) [ 28/Aug/12 ]

Hrm. Yeah. I suppose VMing all of these test clusters could indeed introduce a lot of variability. Perhaps too much to apply any sort of "expected run time" type of watchdogs. Pity.

I do remember that feature being a boon to preventing test clusters from spinning for many many hours on a test that had failed in an unpredictable manner.

I wonder if it would be worth the effort for somebody to actually do the branch/arch/cluster correlation of the wildly swinging test times to see just how unpredictable it really is.

Comment by Chris Gearing (Inactive) [ 30/Aug/12 ]

This is an issue of why lustre is taking so long to mount/unmount/remount the filesystem and the focus should be there for this topic.

We do have statistics for the duration of every test ever run on autotest, so if people want to mine the data they are welcome. The problem of 'correct time' is not a vm vs !vm issue because physical hardware can vary by as much between systems and network bandwidth in particular is very dependent on what else is happening on the system.

The vm's client #24 and below probably provide very consistent times because they are a completely closed system during test with no outside influence.

Comment by Andreas Dilger [ 10/Mar/18 ]

Current subtest runs are in the 125 or 250s range.

Generated at Sat Feb 10 01:19:43 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.