[LU-1068] obdfilter-survey produces no useful results Created: 03/Feb/12  Updated: 01/Jun/16

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.2.0
Fix Version/s: None

Type: Improvement Priority: Major
Reporter: Andreas Dilger Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Rank (Obsolete): 10243

 Description   

I saw that obdfilter-survey is taking between 1-2 hours to run in the current test configurations. However, there are no useful test results being generated from these test runs, so this is a complete waste of testing time.

It is still useful to run obdfilter-survey with a minimal set of parameters just to ensure that this script and the corresponding kernel code continues to function.

The current code should run a minimal set of tests with SLOW=no using

{ nobjhi=1; thrhi=4; }

, but since 2012-01-25 obdfilter-survey has been running for all review tests, which never happened previously, and taking 3000-5000s. It isn't clear why this is happening, but may possibly due to enabling SLOW tests on all review builds, and that obdfilter-survey was added into the review test list. Before changing anything in this test, I'd like to first understand why these tests started running, and under what conditions they are run (e.g. SLOW=no or SLOW=yes).



 Comments   
Comment by Andreas Dilger [ 03/Feb/12 ]

Chris, could you please comment on why the obdfilter-survey test has started running since 2012-01-25 in VM environments. I don't mind running this script to ensure basic test script functionality, but it is taking 1-2h to complete, and according to Cliff the test output provides no discernible value for the time that it takes to run.

Is there some mechanism by which the script can know if it is running inside a VM or not?

Comment by Andreas Dilger [ 03/Feb/12 ]

FYI, the maloo query that I used was:

https://maloo.whamcloud.com/test_sets/query?utf8=%E2%9C%93&test_set[test_set_script_id]=11a69f28-4a54-11e0-a7f6-52540025f9af&test_set[status]=&test_set[query_bugs]=&test_session[test_host]=&test_session[test_group]=review&test_session[user_id]=&test_session[query_date]=&test_session[query_recent_period]=2419200&test_node[os_type_id]=&test_node[distribution_type_id]=&test_node[architecture_type_id]=&test_node[file_system_type_id]=&test_node[lustre_branch_id]=&test_node_network[network_type_id]=&commit=Update+results

Comment by Brian Murrell (Inactive) [ 03/Feb/12 ]

This may be obvious to everyone but I just wanted to be explicit in that it would seem to me that running obdfilter-survey would be useful in conjunction with whatever performance testing we run (i.e. to compare our lustre performance to the best obdfilter-survey results), but presumably that is done on real hardware and not VMs. In the VM case, I'm in agreement that running it just long enough to ensure that it's not broken seems sufficient.

Where I think running obdfilter-survey is useful is in determining if lustre can achieve the same performance as the "best" obdfilter-survey run without having to tune it. I would think if not, then our automatic tuning algorithms need some work, but also that obdfilter run would give us the information to tune and get the best performance for the performance testing (while the algorithms are fixed).

As a bit of an aside, I also think it's useful in performance test runs to start out with an sgpdd_survey so that we can see how close to raw hardware performance lustre can be. That way we are not only tracking inter-release performance but performance as a function of hardware speed on a release-by-release basis. Having these numbers would be useful for those times when people/customers/potential customers ask "how much of my hardware's performance can I expect?".

Comment by Andreas Dilger [ 04/Feb/12 ]

Brian, I definitely agree that we should have some performance tests, but until then there should be some way to detect in the test whether we are running in a VM... Is it possible to check /proc/cpuinfo or similar to check that the test is running on virtual hardware? Alternately, would it be possible to set something in the Lustre test configuration that indicates whether this is real or virtual hardware?

I totally agree that having performance testing that is compared against a known baseline for that hardware would also be valuable. We would need to somehow manage the performance separately for each branch (e.g. a baseline is produced for the first test of each branch and should only ever get faster). In Buffalo we used to track the runtime of all tests for each node, and mark the test slow when it takes more than 10% above average, and mark it failed if it takes more than twice as long it is marked as failed. It would of course be even better if the test results provided real performance numbers (creates and stats and unlinks/sec, MB/s read and write, etc) and compared those against the baseline numbers.

I don't expect this in the short term, however, so just reducing the useless time taken by obdfilter-survey is a good start.

Comment by Brian Murrell (Inactive) [ 04/Feb/12 ]

but until then there should be some way to detect in the test whether we are running in a VM

Well, that is of course going to be dependent on the VM but I think we are using KVM/libvirt in the lab and here on my KVM:

# grep model\ name /proc/cpuinfo 
model name	: QEMU Virtual CPU version 0.14.0

So yeah, quite easily I'd think.

Generated at Sat Feb 10 01:13:11 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.