[LU-160] Test failure on test suite sanity, subtest test_155a Created: 24/Mar/11 Updated: 26/May/11 Resolved: 25/May/11 |
|
| Status: | Closed |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.1.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Maloo | Assignee: | Jian Yu |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 5020 |
| Description |
|
This issue was created by maloo for Prakash Surya <surya1@llnl.gov> This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/114847c8-5645-11e0-bb3d-52540025f9af. The sub-test test_155a failed with the following error:
If the loopback OST device is not large enough to contain a file as large as $((big * 2)) this test will fail. The large file is then not cleaned up, leaving the OST completely full and unusable by later tests. Perhaps this test should be skipped if the OST is not big enough to hold the large file? |
| Comments |
| Comment by Peter Jones [ 25/Mar/11 ] |
|
Yu Jian Could you please look into this LLNL issue? Thanks Peter |
| Comment by Jian Yu [ 27/Mar/11 ] |
Yes, agreed. Here is the description of test_155_load(): I'll make a patch to test_155_load() to check whether the OST size is bigger than the large file size and whether there is enough space free for creating the large file. If not, then let's skip test 155 {a,b,c,d}. |
| Comment by Build Master (Inactive) [ 28/Mar/11 ] |
|
Integrated in Yu Jian : 2f63aca1935bd7a3aca18d39a32550a4d9140650
|
| Comment by Chris Gearing (Inactive) [ 28/Mar/11 ] |
|
My only problem here is that if the OST it not big-enough the test has failed and so should fail indicating that the test failed because the OST was not big enough. This is correct and informative, the tester can then ask for the test to be positively skipped. If the user asked for this test and it did not pass it should indicate fail. |
| Comment by Oleg Drokin [ 28/Mar/11 ] |
|
On the other hand if the user did not ask for this test specifically and it was on by default (like it is now), it means there is one more retesting iteration with this test disabled now which sounds like a waste of time? |
| Comment by Jian Yu [ 29/Mar/11 ] |
How about moving the size check and skip codes to the position right before creating the large file by dd? In this case, if the OST is not big enough, the large file test would be skipped, otherwise the test would definitely fail with ENOSPC, which is not the purpose of test 155*. While creating other test cases which have the requirement of OST size or free space, we usually add the check and skip codes to prompt the user the prerequisite for running those tests, e.g., sanity test 24v, 78, 101d, 116, and tests in parallel-scale.sh, etc. |
| Comment by Jian Yu [ 29/Mar/11 ] |
Oleg, do I understand correctly that Chris and you suggested me to just let the test 155* fail with ENOSPC when the minimum available size of OST is smaller than the large file? |
| Comment by Oleg Drokin [ 04/Apr/11 ] |
|
I think only Chris suggested this. |
| Comment by Jian Yu [ 05/Apr/11 ] |
Right. The OSS Read Cache sanity tests (155* and 156) are on by default now. So, if one user ran sanity tests without adding them to the except list, those tests would be performed by default. We could not know that those tests were run whether because they were on by default or because they were specifically requested by user. IMHO, since the feature was enabled in Lustre, sanity tests against it should be on by default to catch regressions. In the current acc-sm test suite, many tests have prerequisites and might be skipped due to configuration mismatching. So, after running the test suite, user need to check the test summary to see whether there were tests skipped due to mismatching the prerequisites or not. If the tests cared by user were in the skipped list, the user could configure Lustre to meet the preconditions and perform those tests again. So, for sanity tests 155*, I think we could just handle them in the same way as what are used in other tests which also have the requirement of OST size or free space. Is this ok, Oleg and Chris? |
| Comment by Oleg Drokin [ 06/Apr/11 ] |
|
So I am not sure what you propose. |
| Comment by Chris Gearing (Inactive) [ 06/Apr/11 ] |
|
If the size if not insufficient then we must fail, we cannot skip. We do not know if the test was wanted by the user or not and so we must presume the worst case and fail. The user should either provide enough disk space or skip the test. In a simple view a test has three results. SKIPPED: This must be because the user asked for it to be skipped / it is skipped by default [The user expects it to be skipped]. We must always err on the side of failure, if there is any doubt we must fail. False failures do not produce bugs in the field, false passes do. |
| Comment by Jian Yu [ 06/Apr/11 ] |
|
OK, agreed. So, besides sanity test 155*, do we need to change other tests which use skip()/skip_env() instead of error() to handle the prerequisite mismatching? I just did a grep under lustre/tests dir and found almost all of the tests were affected. |
| Comment by Chris Gearing (Inactive) [ 07/Apr/11 ] |
|
Well I think it should be raised as a todo, perhaps the first thing to do it identify them and then process them one at a time. We can't fix everything today, but we can raise a plan to sort this stuff. For the sake of this thread can you give an example of this behavour - paste a couple of tests into this thread. |
| Comment by Jian Yu [ 07/Apr/11 ] |
|
I just ran egrep '(skip "|skip_env ")' * under lustre/tests directory and got a list. Chris, could you please take a look at the list to make sure that we would process them in the future? May I just fix sanity test 155* in this thread? |
| Comment by Prakash Surya (Inactive) [ 07/Apr/11 ] |
|
I agree with Chris, it is best to play it safe and err on the side of failure. With that being said, I also think the test environment should be set up in a way as to allow it to pass with the default settings. Is there perhaps a better way to determine the default OSTSIZE to allow for this? I see it is being determined in $TESTDIR/cfg/local.sh to be: OSTSIZE=${OSTSIZE:-200000} Could it be set up such that if this variable isn't already set in the environment, it is calculated such that it is large enough for tests such as 155 to pass? Maybe the code to determine the "big" file size in test 155 could be moved to test-framework.sh and used in that test and in local.sh, to determine the default OSTSIZE? That way it should pass by default, unless the user changed some settings manually of course. |
| Comment by Jian Yu [ 08/Apr/11 ] |
|
Thanks for your suggestion, Prakash. I just looked through the test suite and found out the following tests which had the requirements of filesystem free space or the minimum available size of OST: sanity test 27m: requires the free space <= $((200000 * $OSTCOUNT))KB sanity test 64b: requires the free space <= $((400000 * $STRIPECOUNT))KB sanityn test 15: requires the free space <= $((400000 * $STRIPECOUNT))KB sanity test 116: requires the minimum available size of OST <= 960000KB sanity test 101d: requires > 500MB free space sanity test 78: requires the minimum available size of OST >= 10240KB sanity test 155*: requires the minimum available size of OST >= $((cache_size * 2))KB sanity-benchmark test pios: requires the the minimum available size of OST >= 9020KB parallel-scale test compilebench: requires > 680MB free space parallel-scale test IOR: requires > $(($num_clients * 2))GB free space So, without changing the test scripts, it's hard to figure out a default OSTSIZE which could meet all of the requirements. From the above list, IOR test requires the largest free space by default, and sanity test 27m, 64b, sanityn test 15 require the smallest free space by default (OSTCOUNT=2, STRIPECOUNT=1). It seems OSTSIZE=400000 is a reasonable default value to compare with $((cache_size * 2)) in the cfg/local.sh, and we need change IOR, sanity test 27m, 64b, sanityn test 15 to work with this default value. In the meantime, the skip commands in those tests would be changed to error commands. What do you think of these changes, Chris? |
| Comment by Prakash Surya (Inactive) [ 02/May/11 ] |
|
Updated Yu Jian's patch to error on a small OST size: http://review.whamcloud.com/#change,481 Was there any consensus formed regarding the default OSTSIZE value? |
| Comment by Jian Yu [ 04/May/11 ] |
|
Chris, what's your opinion on the default OSTSIZE value? |
| Comment by Chris Gearing (Inactive) [ 04/May/11 ] |
|
Hi, Jira is currently unavailable so perhaps you can update it for me. To be Andreas can you look at the Jira and help answer Yu Jian's question, thanks. Chris |
| Comment by Andreas Dilger [ 05/May/11 ] |
|
I unfortunately disagree with Chris on this. Per Yu Jian's comment, it is entirely possible that there is no OST size that will satisfy all of the constraints for the different tests. The ENOSPC tests are currently skipped if the OST size is too large, because they would otherwise take many hours to fill the OST, and can be tested much more efficiently with a small OST. The test in question (155) will fail if the OST size is smaller than the amount of RAM on all of the OSS nodes, which is growing increasingly large. The real danger about making these tests always FAIL on certain configurations is that users will put the test in ALWAYS_EXCEPT to avoid the failure, and then it will never be run, even on systems that could potentially run it and exercise the desired functionality. This would have the net effect of reducing the testing coverage instead of increasing it. I agree that the test shouldn't be marked "PASS" if it wasn't run, but skip_env() exists precisely for this reason - to notify about tests that couldn't be run because of the particular environment (too few OSTs, too large, too small, etc). The reporting of skip_env could be improved to distinguish it from a normal "SKIP" due to some test being in ALWAYS_EXCEPT. Chris is also concerned about tests that the user explicitly asked to be run, but are skipped anyway. The code in run_test() will always run a test in ONLY= even if it is in EXCEPT=, but the skip_env() logic needs to be improved to FAIL if the test is in ONLY instead of marking it SKIP. Otherwise, I think returning "SKIP_ENV reason" is a valid test result, as long as this is presented in the test summary where the tester will see it. In the case of this particular test, it looks like the OST size requirements could be reduced somewhat by running with only a single-striped file on a particular OST, and only checking the RAM on that OST's OSS node. That might allow the test to actually be run, because it isn't creating $big to be larger than 7 OSS node's RAM while the file is only striped over a single OST. It makes sense to report the results of running the "small file" parts of the test separately from the $big file, by splitting each test into two parts. This would have the added benefit of always running the small file tests even if the $big test is skipped. |
| Comment by Chris Gearing (Inactive) [ 06/May/11 ] |
|
I think there is huge ambiguity in the whole test environment about whats a pass and whats a fail. There will need to be a project in the future to tighten up the whole thing so we know what PASS means, at present PASS means [probably means] didn't FAIL but if it was an aeroplane you'd not like this and so we shouldn't like it for Lustre. So the method used for this test fits the Lustre process that's in use today - we can work on tightening that in the future. |
| Comment by Prakash Surya (Inactive) [ 10/May/11 ] |
Made this patch here: http://review.whamcloud.com/#change,528 |
| Comment by Prakash Surya (Inactive) [ 10/May/11 ] |
|
Maloo results for test 155 with patches 368 and 528: "small" OSTSIZE: https://maloo.whamcloud.com/test_sets/2a7c14d0-7b4e-11e0-b5bf-52540025f9af |
| Comment by Prakash Surya (Inactive) [ 11/May/11 ] |
Patched created: http://review.whamcloud.com/#change,534 |
| Comment by Build Master (Inactive) [ 11/May/11 ] |
|
Integrated in Oleg Drokin : dc0317c7150f068cff7343051092236b5c9c29eb
|
| Comment by Build Master (Inactive) [ 11/May/11 ] |
|
Integrated in Oleg Drokin : dc0317c7150f068cff7343051092236b5c9c29eb
|
| Comment by Build Master (Inactive) [ 11/May/11 ] |
|
Integrated in Oleg Drokin : dc0317c7150f068cff7343051092236b5c9c29eb
|
| Comment by Build Master (Inactive) [ 11/May/11 ] |
|
Integrated in Oleg Drokin : dc0317c7150f068cff7343051092236b5c9c29eb
|
| Comment by Build Master (Inactive) [ 11/May/11 ] |
|
Integrated in Oleg Drokin : dc0317c7150f068cff7343051092236b5c9c29eb
|
| Comment by Build Master (Inactive) [ 11/May/11 ] |
|
Integrated in Oleg Drokin : dc0317c7150f068cff7343051092236b5c9c29eb
|
| Comment by Build Master (Inactive) [ 11/May/11 ] |
|
Integrated in Oleg Drokin : dc0317c7150f068cff7343051092236b5c9c29eb
|
| Comment by Build Master (Inactive) [ 11/May/11 ] |
|
Integrated in Oleg Drokin : dc0317c7150f068cff7343051092236b5c9c29eb
|
| Comment by Build Master (Inactive) [ 11/May/11 ] |
|
Integrated in Oleg Drokin : dc0317c7150f068cff7343051092236b5c9c29eb
|
| Comment by Build Master (Inactive) [ 11/May/11 ] |
|
Integrated in Oleg Drokin : dc0317c7150f068cff7343051092236b5c9c29eb
|
| Comment by Build Master (Inactive) [ 11/May/11 ] |
|
Integrated in Oleg Drokin : dc0317c7150f068cff7343051092236b5c9c29eb
|
| Comment by Build Master (Inactive) [ 11/May/11 ] |
|
Integrated in Oleg Drokin : dc0317c7150f068cff7343051092236b5c9c29eb
|
| Comment by Build Master (Inactive) [ 11/May/11 ] |
|
Integrated in Oleg Drokin : dc0317c7150f068cff7343051092236b5c9c29eb
|
| Comment by Build Master (Inactive) [ 11/May/11 ] |
|
Integrated in Oleg Drokin : dc0317c7150f068cff7343051092236b5c9c29eb
|
| Comment by Build Master (Inactive) [ 18/May/11 ] |
|
Integrated in Oleg Drokin : f93d0f6a82844f079b5301620301d97bf1ceb1d2
|
| Comment by Build Master (Inactive) [ 18/May/11 ] |
|
Integrated in Oleg Drokin : f93d0f6a82844f079b5301620301d97bf1ceb1d2
|
| Comment by Build Master (Inactive) [ 18/May/11 ] |
|
Integrated in Oleg Drokin : f93d0f6a82844f079b5301620301d97bf1ceb1d2
|
| Comment by Build Master (Inactive) [ 18/May/11 ] |
|
Integrated in Oleg Drokin : f93d0f6a82844f079b5301620301d97bf1ceb1d2
|
| Comment by Build Master (Inactive) [ 18/May/11 ] |
|
Integrated in Oleg Drokin : f93d0f6a82844f079b5301620301d97bf1ceb1d2
|
| Comment by Build Master (Inactive) [ 18/May/11 ] |
|
Integrated in Oleg Drokin : f93d0f6a82844f079b5301620301d97bf1ceb1d2
|
| Comment by Build Master (Inactive) [ 18/May/11 ] |
|
Integrated in Oleg Drokin : f93d0f6a82844f079b5301620301d97bf1ceb1d2
|
| Comment by Build Master (Inactive) [ 18/May/11 ] |
|
Integrated in Oleg Drokin : f93d0f6a82844f079b5301620301d97bf1ceb1d2
|
| Comment by Build Master (Inactive) [ 18/May/11 ] |
|
Integrated in Oleg Drokin : f93d0f6a82844f079b5301620301d97bf1ceb1d2
|
| Comment by Build Master (Inactive) [ 18/May/11 ] |
|
Integrated in Oleg Drokin : f93d0f6a82844f079b5301620301d97bf1ceb1d2
|
| Comment by Build Master (Inactive) [ 18/May/11 ] |
|
Integrated in Oleg Drokin : f93d0f6a82844f079b5301620301d97bf1ceb1d2
|
| Comment by Build Master (Inactive) [ 18/May/11 ] |
|
Integrated in Oleg Drokin : f93d0f6a82844f079b5301620301d97bf1ceb1d2
|
| Comment by Build Master (Inactive) [ 18/May/11 ] |
|
Integrated in Oleg Drokin : f93d0f6a82844f079b5301620301d97bf1ceb1d2
|
| Comment by Build Master (Inactive) [ 18/May/11 ] |
|
Integrated in Oleg Drokin : f93d0f6a82844f079b5301620301d97bf1ceb1d2
|
| Comment by Peter Jones [ 24/May/11 ] |
|
Patch landed for 2.1 so resolving. Please reopen if any further work is necessary |
| Comment by Prakash Surya (Inactive) [ 24/May/11 ] |
|
There is still one more patch related to this bug waiting to be merged: http://review.whamcloud.com/#change,534 I don't think this bug needs to be reopened, but I just wanted to bring some attention to that awaiting changeset. |
| Comment by Peter Jones [ 24/May/11 ] |
|
Ah ok Prakash. In that case I think that we should reopen the ticket until Oleg has completed the inspection on the remaining patch and it has been landed |
| Comment by Build Master (Inactive) [ 25/May/11 ] |
|
Integrated in Oleg Drokin : 08ac163a04f0a9c5e4348174aa835796e2190e28
|
| Comment by Build Master (Inactive) [ 25/May/11 ] |
|
Integrated in Oleg Drokin : 08ac163a04f0a9c5e4348174aa835796e2190e28
|
| Comment by Build Master (Inactive) [ 25/May/11 ] |
|
Integrated in Oleg Drokin : 08ac163a04f0a9c5e4348174aa835796e2190e28
|
| Comment by Build Master (Inactive) [ 25/May/11 ] |
|
Integrated in Oleg Drokin : 08ac163a04f0a9c5e4348174aa835796e2190e28
|
| Comment by Build Master (Inactive) [ 25/May/11 ] |
|
Integrated in Oleg Drokin : 08ac163a04f0a9c5e4348174aa835796e2190e28
|
| Comment by Build Master (Inactive) [ 25/May/11 ] |
|
Integrated in Oleg Drokin : 08ac163a04f0a9c5e4348174aa835796e2190e28
|
| Comment by Build Master (Inactive) [ 25/May/11 ] |
|
Integrated in Oleg Drokin : 08ac163a04f0a9c5e4348174aa835796e2190e28
|
| Comment by Build Master (Inactive) [ 25/May/11 ] |
|
Integrated in Oleg Drokin : 08ac163a04f0a9c5e4348174aa835796e2190e28
|
| Comment by Build Master (Inactive) [ 25/May/11 ] |
|
Integrated in Oleg Drokin : 08ac163a04f0a9c5e4348174aa835796e2190e28
|
| Comment by Build Master (Inactive) [ 25/May/11 ] |
|
Integrated in Oleg Drokin : 08ac163a04f0a9c5e4348174aa835796e2190e28
|
| Comment by Prakash Surya (Inactive) [ 25/May/11 ] |
|
Peter, the last patch I was waiting has been merged. This can be marked as resolved now. Thanks! |
| Comment by Build Master (Inactive) [ 25/May/11 ] |
|
Integrated in Oleg Drokin : 08ac163a04f0a9c5e4348174aa835796e2190e28
|
| Comment by Build Master (Inactive) [ 25/May/11 ] |
|
Integrated in Oleg Drokin : 08ac163a04f0a9c5e4348174aa835796e2190e28
|
| Comment by Build Master (Inactive) [ 25/May/11 ] |
|
Integrated in Oleg Drokin : 08ac163a04f0a9c5e4348174aa835796e2190e28
|
| Comment by Jian Yu [ 25/May/11 ] |
|
All of the patches have been pushed to the master branch in fs/lustre-release repo. The issue was resolved. |
| Comment by Build Master (Inactive) [ 26/May/11 ] |
|
Integrated in Oleg Drokin : 08ac163a04f0a9c5e4348174aa835796e2190e28
|
| Comment by Build Master (Inactive) [ 26/May/11 ] |
|
Integrated in Oleg Drokin : 08ac163a04f0a9c5e4348174aa835796e2190e28
|