[LU-797] Test failure on test suite ost-pools, subtest test_14, test_18, test_23 Created: 27/Oct/11 Updated: 22/Feb/13 Resolved: 19/Dec/12 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.2.0, Lustre 2.1.2, Lustre 2.1.3, Lustre 2.1.4, Lustre 1.8.8 |
| Fix Version/s: | Lustre 2.4.0, Lustre 2.1.4, Lustre 1.8.9 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Maloo | Assignee: | Niu Yawei (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 4734 | ||||||||||||||||
| Description |
|
This issue was created by maloo for sarah <sarah@whamcloud.com> This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/f6c809ee-00ec-11e1-bb4f-52540025f9af. The sub-test test_14 failed with the following error:
Info required for matching: ost-pools 14 |
| Comments |
| Comment by Peter Jones [ 27/Oct/11 ] |
|
Niu will look into this one |
| Comment by Niu Yawei (Inactive) [ 29/Oct/11 ] |
|
Look at the client test log: Filling up OST0 14188+0 records in 14188+0 records out 14877196288 bytes (15 GB) copied, 836.591 seconds, 17.8 MB/s 14188+0 records in 14188+0 records out 14877196288 bytes (15 GB) copied, 908.419 seconds, 16.4 MB/s 14188+0 records in 14188+0 records out 14877196288 bytes (15 GB) copied, 922.504 seconds, 16.1 MB/s It's filling up OST0 (which is ~150G), and it has already used (836 + 908 + 922)=2666 seconds for only 45G, however the timeout of this test is 3600, so it just doesn't have enough time to fill up the OST0. I think this is a test issue, should we adjust the timeout according to the OST size or just skip the test if the OST is too big? |
| Comment by Mikhail Pershin [ 31/Oct/11 ] |
|
This is similar to test 220 issue in sanity.sh, |
| Comment by Jian Yu [ 01/Nov/11 ] |
|
ost-pools test 14 passed on Lustre v1_8_7_WC1_RC1 in 1850s: The test output showed that: Filling up OST0 2756+0 records in 2756+0 records out 2889875456 bytes (2.9 GB) copied, 200.484 seconds, 14.4 MB/s 2756+0 records in 2756+0 records out 2889875456 bytes (2.9 GB) copied, 179.428 seconds, 16.1 MB/s 2756+0 records in 2756+0 records out 2889875456 bytes (2.9 GB) copied, 160.244 seconds, 18.0 MB/s 2756+0 records in 2756+0 records out 2889875456 bytes (2.9 GB) copied, 166.279 seconds, 17.4 MB/s 2756+0 records in 2756+0 records out 2889875456 bytes (2.9 GB) copied, 159.178 seconds, 18.2 MB/s 2756+0 records in 2756+0 records out 2889875456 bytes (2.9 GB) copied, 160.844 seconds, 18.0 MB/s 2756+0 records in 2756+0 records out 2889875456 bytes (2.9 GB) copied, 177.44 seconds, 16.3 MB/s 2756+0 records in 2756+0 records out 2889875456 bytes (2.9 GB) copied, 187.56 seconds, 15.4 MB/s 2756+0 records in 2756+0 records out 2889875456 bytes (2.9 GB) copied, 170.918 seconds, 16.9 MB/s UUID 1K-blocks Used Available Use% Mounted on lustre-MDT0000_UUID 7498624 443048 6555576 6% /mnt/lustre[MDT:0] lustre-OST0000_UUID 30192956 25817228 2846080 90% /mnt/lustre[OST:0] lustre-OST0001_UUID 30192956 454760 28222820 2% /mnt/lustre[OST:1] lustre-OST0002_UUID 30192956 454760 28222820 2% /mnt/lustre[OST:2] lustre-OST0003_UUID 30192956 454760 28222820 2% /mnt/lustre[OST:3] lustre-OST0004_UUID 30192956 454760 28222820 2% /mnt/lustre[OST:4] lustre-OST0005_UUID 30192956 454760 28222820 2% /mnt/lustre[OST:5] lustre-OST0006_UUID 30192956 454760 28222820 2% /mnt/lustre[OST:6] filesystem summary: 211350692 28545788 172183000 14% /mnt/lustre So, the OST size was about 30G. Chris, does the "OSTSIZE" variable has different values on different vm clusters? Could you please adjust the size so that the test would not timeout and skipped? FYI, due to the timeout of test 14, the remaining ost-pools subtests were not run. |
| Comment by Johann Lombardi (Inactive) [ 03/Nov/11 ] |
|
Has anybody rerun the missing tests? |
| Comment by Sarah Liu [ 03/Nov/11 ] |
|
I will rerun the missing tests for both lu-797 and lu-798 and keep you updated. |
| Comment by Sarah Liu [ 03/Nov/11 ] |
|
the remaining tests of ost-pools pass: https://maloo.whamcloud.com/test_sets/caa784fe-0653-11e1-b578-52540025f9af |
| Comment by Andreas Dilger [ 12/Dec/11 ] |
|
It looks like the failing tests on master are creating 10x 14GB files (consistent with 150GB OST size) and the passing 1.8.7 tests are successfully creating 10x 2.7GB files in 1850s. That means the tests on master would need at least 9500s to pass. That is completely unreasonable for a subtest like this, which should complete in a minute or less (given its low importance compared to other tests). ost-pools.sh test_14 is testing the QOS space balancing for OST pools, which is similar to sanity.sh test_116 is doing for QOS space balance without pools. It seems that test_116 is just being skipped when the OST is too large. Probably the best solution is to have an OBD_FAIL_CHECK() that can change the os_bfree number at the fsfilt or OSD level for the OST index as given in fail_val. Even better might be fail_val = "(free_fraction) << 16 | ost_idx" where free_fraction is the 1/256 of the free space (which gives about 0.5% granularity) so that the fail_loc can be useful in other tests. That will avoid all of the IO in these tests and make them MUCH faster. It would be worthwhile to investigate other tests that need to fill up the filesystem to see if they can similarly be sped up. |
| Comment by nasf (Inactive) [ 26/Dec/11 ] |
|
Another failure: https://maloo.whamcloud.com/test_sets/bdf73f6c-2dc5-11e1-819b-5254004bbbd3 |
| Comment by Li Wei (Inactive) [ 18/Jan/12 ] |
|
https://maloo.whamcloud.com/test_sets/ecef3f7a-41db-11e1-9650-5254004bbbd3 (Orion LDiskFS) |
| Comment by Andreas Dilger [ 27/Jan/12 ] |
|
I submitted change http://review.whamcloud.com/2028 to speed up the ost-pools script: For test_14, since it is already reformatting the filesystem, it is just going to format with a smaller OST size. I'd rather prefer that we don't reformat the filesystem at all for these tests, and instead implement an OBD_FAIL_CHECK to allow it to run on any configuration, but this is a simple workaround for now that will reduce the time to run ost-pools by 1.5h on large systems. Also, test_18 was taking a long time because it was trying to create and unlink 3 * 3 * 30000 files, which was also taking a very long time for slow VM systems. Instead, use the createmany feature to run for a fixed time instead of for a specific number of files, so that it can properly assess performance without taking too long or to short on different hardware configurations. This should also reduce the test time by 0.5h or so. |
| Comment by Build Master (Inactive) [ 02/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 02/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 02/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 02/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 02/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 02/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 02/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 02/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 02/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 02/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 02/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 02/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 02/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 02/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Peter Jones [ 02/Feb/12 ] |
|
Landed for 2.2 |
| Comment by Niu Yawei (Inactive) [ 08/Feb/12 ] |
|
Looks the test_23b is still timeout: https://maloo.whamcloud.com/test_sets/8bc6156c-4fef-11e1-9ebc-5254004bbbd3
This test (test_23b) adds 3 osts into a pool, and checks if the available space of the pool is less than $MAXFREE, however, it writes file over all osts at the end. We missed creating file in pool? |
| Comment by Niu Yawei (Inactive) [ 08/Feb/12 ] |
|
Andreas, could you take a look? |
| Comment by Andreas Dilger [ 09/Feb/12 ] |
|
You are correct, it appears the output files are not being created in the pool. However, looking back at the original test_23, that was true at that time also, so I'm not sure what the original intent of this test was. Looking further back into Git/CVS to determine if the test ran differently when it was introduced, I didn't see any significant changes to test_23 since 1.8.5 Note also that I have a second patch which is skipping test_23b() when SLOW=no, because I don't think this is a very useful test (http://review.whamcloud.com/2091). |
| Comment by Build Master (Inactive) [ 15/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 15/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 15/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 15/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 15/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 15/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 15/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 15/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 15/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 15/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 15/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 15/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 15/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 15/Feb/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 17/Feb/12 ] |
|
Integrated in Result = FAILURE
Oleg Drokin : f7b4054cfc1d30fbbfd56acfe4b5a7a334de8212
|
| Comment by Build Master (Inactive) [ 17/Feb/12 ] |
|
Integrated in Result = FAILURE
Oleg Drokin : f7b4054cfc1d30fbbfd56acfe4b5a7a334de8212
|
| Comment by Build Master (Inactive) [ 17/Feb/12 ] |
|
Integrated in Result = ABORTED
Oleg Drokin : f7b4054cfc1d30fbbfd56acfe4b5a7a334de8212
|
| Comment by Jian Yu [ 20/May/12 ] |
|
Lustre Tag: v1_8_8_WC1_RC2 ost-pools test 14 timed out: https://maloo.whamcloud.com/test_sets/23fdf678-a258-11e1-a6cd-52540035b04c |
| Comment by Jian Yu [ 31/May/12 ] |
|
Lustre Tag: v2_1_2_RC2 ost-pools test 23b timed out: https://maloo.whamcloud.com/test_sets/7d305f1e-aa33-11e1-971d-52540035b04c |
| Comment by Jian Yu [ 10/Jun/12 ] |
During the build/release testing cycle, all of the auster tests are run with SLOW=yes, so the ost-pools test 23b timeout issue still exists with large size OSTs: |
| Comment by Jian Yu [ 13/Aug/12 ] |
|
More instances occurred during Lustre 2.1.3 RC1 testing: |
| Comment by Jian Yu [ 21/Aug/12 ] |
|
More instances occurred during Lustre 2.1.3 RC2 testing: |
| Comment by Niu Yawei (Inactive) [ 05/Nov/12 ] |
|
Look at the recent failure: == ost-pools test 23b: OST pools and OOS ============================================================= 21:49:32 (1345438172) running as uid/gid/euid/egid 500/500/500/500, groups: [touch] [/mnt/lustre/d0_runas_test/f13763] CMD: client-30vm3 lctl pool_new lustre.testpool client-30vm3: Pool lustre.testpool created CMD: client-30vm6.lab.whamcloud.com lctl get_param -n lov.lustre-*.pools.testpool 2>/dev/null || echo foo CMD: client-30vm3 lctl pool_add lustre.testpool lustre-OST[0000-0006/3] client-30vm3: OST lustre-OST0000_UUID added to pool lustre.testpool client-30vm3: OST lustre-OST0003_UUID added to pool lustre.testpool client-30vm3: OST lustre-OST0006_UUID added to pool lustre.testpool CMD: client-30vm6.lab.whamcloud.com lctl get_param -n lov.lustre-*.pools.testpool | sort -u | tr '\n' ' ' 1: 32768+0 records in 32768+0 records out 34359738368 bytes (34 GB) copied, 1656.82 seconds, 20.7 MB/s 2: 32768+0 records in 32768+0 records out 34359738368 bytes (34 GB) copied, 1761.27 seconds, 19.5 MB/s 3: 32768+0 records in 32768+0 records out 34359738368 bytes (34 GB) copied, 1921.64 seconds, 17.9 MB/s 4: 32768+0 records in 32768+0 records out 34359738368 bytes (34 GB) copied, 1941.35 seconds, 17.7 MB/s 5: 32768+0 records in 32768+0 records out 34359738368 bytes (34 GB) copied, 2098.63 seconds, 16.4 MB/s 6: 32768+0 records in 32768+0 records out 34359738368 bytes (34 GB) copied, 2030.07 seconds, 16.9 MB/s The MAXFREE should be (2000000(KB) * 7(osts)) / (1024 * 1024) = 13GB, which means the available space should be less than 13GB in this pool, however, tests writes much more than the it and didn't fail with ENOSPC. I think we'd add more debug information in the tests, and also break the writting whenever written bytes is greater than available space. |
| Comment by Niu Yawei (Inactive) [ 05/Nov/12 ] |
| Comment by Niu Yawei (Inactive) [ 29/Nov/12 ] |
|
remove the space of "lfs df" output interllay: http://review.whamcloud.com/4705 |
| Comment by Emoly Liu [ 10/Dec/12 ] |
b2_1 port is at http://review.whamcloud.com/#change,4784 |
| Comment by Jian Yu [ 10/Dec/12 ] |
|
Lustre Branch: b2_1 The same issue occurred: |
| Comment by Emoly Liu [ 14/Dec/12 ] |
|
b2_1 port is at http://review.whamcloud.com/4831 |
| Comment by Peter Jones [ 19/Dec/12 ] |
|
Fix landed for 2.1.4 and 2.4 |
| Comment by Emoly Liu [ 28/Dec/12 ] |
|
The related patches for b1_8 are merged into http://review.whamcloud.com/4898 |
| Comment by Jian Yu [ 30/Dec/12 ] |
|
Lustre Branch: b1_8 The ost-pools test 14 timed out: https://maloo.whamcloud.com/test_sets/f8ded3ce-5187-11e2-bbc3-52540035b04c |