[LU-798] Test failure on test suite performance-sanity,subtest test_5 Created: 27/Oct/11  Updated: 13/Jul/16  Resolved: 13/Jul/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.2, Lustre 2.1.3, Lustre 2.1.4, Lustre 1.8.8
Fix Version/s: None

Type: Bug Priority: Blocker
Reporter: Maloo Assignee: Zhenyu Xu
Resolution: Cannot Reproduce Votes: 0
Labels: None

Issue Links:
Related
Severity: 3
Rank (Obsolete): 5794

 Description   

This issue was created by maloo for sarah <sarah@whamcloud.com>

This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/6924c914-00ed-11e1-bb4f-52540025f9af.
I am not sure if this is duplicate with LU-758



 Comments   
Comment by Peter Jones [ 27/Oct/11 ]

Bobijam will look into this one

Comment by Zhenyu Xu [ 27/Oct/11 ]

from the dmesg log, I don't see any error happens during the test except test_5 took almost an hour, I suspect its 3600 seconds test limit interrupt the test.

client dmesg
Lustre: DEBUG MARKER: == performance-sanity test 5: lookup rate 10M file dir =============================================== 09:45:59 (1319733959)
Lustre: DEBUG MARKER: ===== mdsrate-lookup-1dir.sh ======
Lustre: DEBUG MARKER: Using TIMEOUT=20
Lustre: DEBUG MARKER: ===== mdsrate-lookup-1dir.sh Test preparation: creating 1000000 files.
Lustre: DEBUG MARKER: ===== mdsrate-lookup-1dir.sh
Lustre: DEBUG MARKER: ===== mdsrate-lookup-1dir.sh
Lustre: DEBUG MARKER: == mdsrate-lookup-1dir mdsrate-lookup-1dir.sh: test complete, cleaning up ============================ 10:33:31 (1319736811)
SysRq : Show State

                                                       sibling
  task                 PC          pid father child younger older
init          S ffff81007e3567e0     0     1      0     2               (NOTLB)
 ffff81007ff95a28 0000000000000082 00000000000280d2 ffff810000017c10
 ffff81007ff857a0 000000000000000a ffff81007ff857a0 ffff81007e3567e0
 00000aabaa90b8db 0000000000006e20 ffff81007ff85988 0000000088031cb7
Comment by Zhenyu Xu [ 28/Oct/11 ]

I checked maloo for some passed performance-sanity test_5 to get a rough understanding how much time was needed for successfull test.

This test case https://maloo.whamcloud.com/sub_tests/9e94a664-f710-11e0-a451-52540025f9af takes 2638s, while it only test with 433075 files (by checking test script, I think it was set by NUM_FILES environment variable)

This test case https://maloo.whamcloud.com/sub_tests/116bdd9a-e467-11e0-9909-52540025f9af takes 2493s with 452393 files.

1M files test cases all timed out on cleaning up phase.

Comment by Jian Yu [ 01/Nov/11 ]

performance-sanity test 5 passed on v1_8_7_WC1_RC1 in 2692s with 438805 files:
https://maloo.whamcloud.com/test_sets/eb31d5ac-f4cf-11e0-908b-52540025f9af

UUID                      Inodes       IUsed       IFree IUse% Mounted on
lustre-MDT0000_UUID      5000040          52     4999988   0% /mnt/lustre[MDT:0]
lustre-OST0000_UUID       445440        2202      443238   0% /mnt/lustre[OST:0]
lustre-OST0001_UUID       445440          88      445352   0% /mnt/lustre[OST:1]
lustre-OST0002_UUID       445440          89      445351   0% /mnt/lustre[OST:2]
lustre-OST0003_UUID       445440          89      445351   0% /mnt/lustre[OST:3]
lustre-OST0004_UUID       445440          89      445351   0% /mnt/lustre[OST:4]
lustre-OST0005_UUID       445440          89      445351   0% /mnt/lustre[OST:5]
lustre-OST0006_UUID       445440          89      445351   0% /mnt/lustre[OST:6]

filesystem summary:      5000040          52     4999988   0% /mnt/lustre

===== mdsrate-lookup-1dir.sh Test preparation: creating 438805 files.

The number of files was calculated as follows:

mdsrate-lookup-1dir.sh:

NUM_FILES=${NUM_FILES:-1000000}

IFree=$(mdsrate_inodes_available)
if [ $IFree -lt $NUM_FILES ]; then
    NUM_FILES=$IFree
fi

test-framework.sh:

inodes_available () {
    local IFree=$($LFS df -i $MOUNT | grep ^$FSNAME | awk '{print $4}' | sort -un | head -1) || return 1
    echo $IFree
}

mdsrate_inodes_available () {
    local min_inodes=$(inodes_available)
    echo $((min_inodes * 99 / 100))
}

So, it depends on the minimum number of free inodes on MDT and OST. For RC1, the OSTSIZE was about 30G and the min IFree was about 443238, however, for RC2, the OSTSIZE on the vm cluster was about 150G and the min IFree was about 2246430:

UUID                      Inodes       IUsed       IFree IUse% Mounted on
lustre-MDT0000_UUID      5000040          52     4999988   0% /mnt/lustre[MDT:0]
lustre-OST0000_UUID      2257920       11490     2246430   1% /mnt/lustre[OST:0]
lustre-OST0001_UUID      2257920          89     2257831   0% /mnt/lustre[OST:1]
lustre-OST0002_UUID      2257920          89     2257831   0% /mnt/lustre[OST:2]
lustre-OST0003_UUID      2257920          88     2257832   0% /mnt/lustre[OST:3]
lustre-OST0004_UUID      2257920          89     2257831   0% /mnt/lustre[OST:4]
lustre-OST0005_UUID      2257920          89     2257831   0% /mnt/lustre[OST:5]
lustre-OST0006_UUID      2257920          89     2257831   0% /mnt/lustre[OST:6]

filesystem summary:      5000040          52     4999988   0% /mnt/lustre

===== mdsrate-lookup-1dir.sh Test preparation: creating 1000000 files.

So, this is the similar issue as LU-797, which is caused by large OSTSIZE and limited timeout value.

Chris, could you please take a look at this issue and make some adjustment on the OSTSIZE or timeout value? Thanks!

Comment by Peter Jones [ 03/Nov/11 ]

> Chris, could you please take a look at this issue and make some adjustment on the OSTSIZE or timeout value? Thanks!

Bobijam, it would be better to make such requests via a TT JIRA ticket.

Comment by Sarah Liu [ 03/Nov/11 ]

the remaining tests of performance-sanity are pass: https://maloo.whamcloud.com/test_sets/7d4cb962-0668-11e1-9433-52540025f9af

Comment by Jian Yu [ 11/May/12 ]

Lustre Build: http://build.whamcloud.com/job/lustre-b1_8/194/
Distro/Arch: RHEL5.8/x86_64 (kernel 2.6.18-308.4.1.el5)
Network: TCP (1GigE)
ENABLE_QUOTA=yes

The same issue occurred: https://maloo.whamcloud.com/test_sets/b8d82eb2-9b50-11e1-a0a0-52540035b04c

Comment by Jian Yu [ 31/May/12 ]

Lustre Tag: v2_1_2_RC2
Lustre Build: http://build.whamcloud.com/job/lustre-b2_1/86/
Distro/Arch: RHEL6.2/x86_64(server), RHEL6.2/i686(client)
Network: TCP (1GigE)
ENABLE_QUOTA=yes

performance-sanity test 6 timed out: https://maloo.whamcloud.com/test_sets/9b7b720a-aa34-11e1-971d-52540035b04c

Comment by Jian Yu [ 21/Aug/12 ]

More instances occurred while testing Lustre 2.1.3 RC2:
https://maloo.whamcloud.com/test_sets/b5722254-eb33-11e1-ba73-52540035b04c
https://maloo.whamcloud.com/test_sets/9c1e8436-eb3e-11e1-ba73-52540035b04c

Comment by Jian Yu [ 10/Dec/12 ]

Lustre Branch: b2_1
Lustre Build: http://build.whamcloud.com/job/lustre-b2_1/148
Distro/Arch: RHEL6.3/x86_64 (kernel version: 2.6.32-279.14.1.el6)
Network: TCP (1GigE)

The same issue occurred:
https://maloo.whamcloud.com/test_sets/eba241ba-41dc-11e2-adcf-52540035b04c

Comment by Zhenyu Xu [ 13/Dec/12 ]

b2_1 patch tracking at http://review.whamcloud.com/4830

commit message
    LU-798 test: reduce performance test file number
    
    The performance test takes too much time deleting all the test files
    which timedout the test suite.
    
    Test-Parameters: envdefinitions=SLOW=yes,ENABLE_QUOTA=yes testlist=performance-sanity

Comment by Jian Yu [ 17/Dec/12 ]

Lustre Tag: v2_1_4_RC1

https://maloo.whamcloud.com/test_sets/07afddb8-4831-11e2-8cdc-52540035b04c
https://maloo.whamcloud.com/test_sets/af0b031e-482a-11e2-a406-52540035b04c

Comment by Jian Yu [ 26/Dec/12 ]

From the historical Maloo reports, I found that all of the passed performance-sanity test runs were performed with about 450000 files, while all of the timeouted performance-sanity test runs were performed with 1000000 files due to large OSTSIZE.

Comment by Jian Yu [ 26/Dec/12 ]

Hello Chris,

Oleg suggested to adjust the length of time that is allocated to the performance-sanity test so that it has time to complete. Could you please increase the autotest timeout value for the performance-sanity test? Thanks.

Comment by Chris Gearing (Inactive) [ 02/Jan/13 ]

performance-sanity is set to 2 hours for test_4, test_5 and test_7 so I fear the code might be broken in autotest. I need to figure out how to investigate.

Comment by Chris Gearing (Inactive) [ 03/Jan/13 ]

OK.

What confused me here is that this bug dates back to 2011 and so many of the examples have been fixed a long time ago.

I've increased the timeout for test_7 to 7200 which makes it like 5,6 and 8.

Comment by Chris Gearing (Inactive) [ 03/Jan/13 ]

Looking at the recent examples this issue should now be fixed.

Comment by Jian Yu [ 13/Jan/13 ]

Hi Chris,

The performance-sanity test timed out on the latest b1_8 build #241. This time it's sub-test 8:

https://maloo.whamcloud.com/test_sets/79fc558c-5ce9-11e2-afaf-52540035b04c

I did not find any Lustre error messages in the logs. So, it's still due to low performance and needs more time.

Comment by Andreas Dilger [ 13/Jul/16 ]

Closing this old bug.

Generated at Sat Feb 10 01:10:30 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.