[LU-1233] Test failure on test suite parallel-scale, subtest test_compilebench,no space left Created: 19/Mar/12  Updated: 31/Dec/13  Resolved: 11/Dec/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.3.0, Lustre 2.1.5, Lustre 1.8.9, Lustre 2.4.1, Lustre 2.5.0, Lustre 2.6.0
Fix Version/s: Lustre 2.1.6, Lustre 2.6.0, Lustre 2.4.2, Lustre 2.5.1

Type: Bug Priority: Blocker
Reporter: Maloo Assignee: Jian Yu
Resolution: Fixed Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 5184

 Description   

This issue was created by maloo for sarah <sarah@whamcloud.com>

This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/7722bc68-70e8-11e1-a89e-5254004bbbd3.

The sub-test test_compilebench failed with the following error:

compilebench failed: 1

Info required for matching: parallel-scale compilebench



 Comments   
Comment by Sarah Liu [ 27/Mar/12 ]

another no space left: https://maloo.whamcloud.com/test_sets/8653c500-76ca-11e1-ae2e-5254004bbbd3

Comment by Jian Yu [ 12/Oct/12 ]

More instances:
https://maloo.whamcloud.com/test_sets/e19f266e-1474-11e2-8ca0-52540035b04c
https://maloo.whamcloud.com/test_sets/0631ab8c-1475-11e2-8ca0-52540035b04c
https://maloo.whamcloud.com/test_sets/99581f3c-1474-11e2-8ca0-52540035b04c

Comment by Jian Yu [ 06/Dec/12 ]

Lustre Branch: b2_1
Lustre Build: http://build.whamcloud.com/job/lustre-b2_1/139
Network: o2ib

performance-sanity: https://maloo.whamcloud.com/test_sets/c2a8267c-3ba0-11e2-b98e-52540035b04c
parallel-scale: https://maloo.whamcloud.com/test_sets/141e9cfc-3ba1-11e2-b98e-52540035b04c

Comment by Peter Jones [ 06/Dec/12 ]

Minh

Could you please look at this issue? We need identify the OSTSIZE used for the IB test cluster and do a comparison with the one set on the TCP test cluster so that the settings in autotest for IB clusters can be adjusted accordingly.

Thanks

Peter

Comment by Minh Diep [ 06/Dec/12 ]

Chris confirmed that the OST devices on the ib clusters are only 2G; which means 2G * 7 OST = 14G. This is very small filesystem. Compare to 142G * 7 OST in TCP.

I see that the pv on the cluster is

[root@client-21-ib ~]# pvdisplay
— Physical volume —
PV Name /dev/sda4
VG Name lvm-OSS
PV Size 207.93 GiB / not usable 4.84 MiB
Allocatable yes
PE Size 4.00 MiB
Total PE 53230
Free PE 49516
Allocated PE 3714
PV UUID ytFmgj-Xy6w-Xa4Z-fYHo-cvvI-UzJg-lvPMYX

I suggest we increase the lv for oct to use bigger space. How about 10G each?

Comment by Peter Jones [ 07/Dec/12 ]

Thanks Minh. Chris can you please comment?

Comment by Jian Yu [ 17/Dec/12 ]

Lustre Server: v2_1_4_RC1
Lustre Build: http://build.whamcloud.com/job/lustre-b2_1/159/

Lustre Client: 1.8.8-wc1
Lustre Build: http://build.whamcloud.com/job/lustre-b1_8/198

Distro/Arch: RHEL5.8/x86_64
Network: IB (in-kernel OFED)

The same issue occurred:
https://maloo.whamcloud.com/test_sets/431b1a6a-479c-11e2-876e-52540035b04c
https://maloo.whamcloud.com/test_sets/f2986dea-479b-11e2-876e-52540035b04c

Comment by Jian Yu [ 17/Dec/12 ]

In performance-sanity, after creating large number files hit out of space issue, those files were not unlinked/removed successfully. So, the test script also needs to be improved.

Comment by Jian Yu [ 18/Dec/12 ]

Lustre Client: v2_1_4_RC1
Lustre Server: 2.1.3
Distro/Arch: RHEL6.3/x86_64
Network: IB (in-kernel OFED)

https://maloo.whamcloud.com/test_sets/d03b0306-487d-11e2-8cdc-52540035b04c
https://maloo.whamcloud.com/test_sets/362d462e-487e-11e2-8cdc-52540035b04c
https://maloo.whamcloud.com/test_sets/709191e4-487e-11e2-8cdc-52540035b04c

This issue is blocking the Lustre 2.1.4 release testing on IB network in autotest runs.

Comment by Chris Gearing (Inactive) [ 20/Dec/12 ]

The OST size under autotest is the same for IB or TCP why would this issue only effect IB if it is a OST size issue?

Comment by Minh Diep [ 21/Dec/12 ]

Are TCP runs using VM or real HW?

Comment by Jian Yu [ 20/Jan/13 ]

Lustre Branch: b1_8
Lustre Build: http://build.whamcloud.com/job/lustre-b1_8/249
Distro/Arch: RHEL5.8/x86_64(server), RHEL6.3/x86_64(client)
Network: TCP

MDSSIZE=2097152
OSTSIZE=7416428

The compilebench tests in parallel-scale

{,-nfsv3,-nfsv4}

all failed with:

LustreError: 23875:0:(filter.c:3459:filter_precreate()) create failed rc = -28

Maloo reports:
parallel-scale: https://maloo.whamcloud.com/test_sets/e021f142-6337-11e2-ae8b-52540035b04c
parallel-scale-nfsv3: https://maloo.whamcloud.com/test_sets/85799c1c-6338-11e2-ae8b-52540035b04c
parallel-scale-nfsv4: https://maloo.whamcloud.com/test_sets/ffb1cbd0-6338-11e2-ae8b-52540035b04c

With the following values, the same tests passed on the same Lustre b1_8 build over TCP network on RHEL5.8/x86_64 disto/arch (both server and client):

MDSSIZE=2097152
OSTSIZE=11311139

Maloo report: https://maloo.whamcloud.com/test_sessions/820df576-6353-11e2-ae8b-52540035b04c

Comment by Jian Yu [ 29/Jan/13 ]

Lustre Branch: b2_1
Lustre Build: http://build.whamcloud.com/job/lustre-b2_1/164
Network: o2ib (in-kernel OFED)

MDSSIZE=2097152
OSTSIZE=31061817

performance-sanity: https://maloo.whamcloud.com/test_sets/ac776a56-68dd-11e2-ac0a-52540035b04c
parallel-scale: https://maloo.whamcloud.com/test_sets/d2cc2afc-68dd-11e2-ac0a-52540035b04c

Comment by Jian Yu [ 13/Mar/13 ]

Lustre Branch: b2_1
Lustre Build: http://build.whamcloud.com/job/lustre-b2_1/186
Distro/Arch: RHEL6.3/x86_64
Network: IB (in-kernel OFED)

The issue is still blocking the tests after performance-sanity in full test group from running under IB network configuration:
https://maloo.whamcloud.com/test_sessions/6f41a40a-8b40-11e2-aa18-52540035b04c

Comment by Jian Yu [ 22/Mar/13 ]

Lustre Client: 1.8.9-wc1
Lustre Client Build: http://build.whamcloud.com/job/lustre-b1_8/258/

Lustre Server: v2_1_5_RC1
Lustre Server Build: http://build.whamcloud.com/job/lustre-b2_1/191/

Network: TCP (1GigE)

MDSSIZE=2097152
OSTSIZE=149718677

The tests after performance-sanity were affected by the out of space issue:
https://maloo.whamcloud.com/test_sessions/3bb63464-92c7-11e2-b06e-52540035b04c

Dmesg on MDS node showed that:

Lustre: DEBUG MARKER: ===== mdsrate-stat-large.sh Test preparation: creating 1000000 files.
LustreError: 21464:0:(mdd_dir.c:1889:mdd_create()) error on stripe info copy -28 
LustreError: 21464:0:(mdd_dir.c:1889:mdd_create()) error on stripe info copy -28 
Lustre: DEBUG MARKER: /usr/sbin/lctl mark  performance-sanity test_8: @@@@@@ FAIL: test_8 failed with 1
Comment by Jian Yu [ 26/May/13 ]

Lustre Branch: b2_1
Lustre Build: http://build.whamcloud.com/job/lustre-b2_1/204
Distro/Arch: RHEL5.9/x86_64
Network: IB (in-kernel OFED)

The tests after performance-sanity were affected by the out of space issue:
https://maloo.whamcloud.com/test_sessions/bebd9a1c-c5c8-11e2-9bf1-52540035b04c

Comment by Jian Yu [ 30/May/13 ]

In performance-sanity, after creating large number files hit out of space issue, those files were not unlinked/removed successfully. So, the test script also needs to be improved.

Here is the patch to unlink the files created in performance-sanity.sh through mdsrate-{create,lookup,stat}-*.sh after create/lookup/stat operation fails:
http://review.whamcloud.com/6483

With the above patch, the issue was narrowed that only performance-sanity test 4 hit out of space issue, and the tests after test 4 were not affected.

The next step is to figure out why test 4 hits out of space issue over IB network.

Comment by Jian Yu [ 03/Jun/13 ]

Lustre Client: 1.8.9-wc1
Lustre Client Build: http://build.whamcloud.com/job/lustre-b1_8/258/

Lustre Server: v2_1_6_RC1
Lustre Server Build: http://build.whamcloud.com/job/lustre-b2_1/208/

Network: TCP (1GigE)

performance-sanity test_8 failed with out of space issue:
https://maloo.whamcloud.com/test_sets/58eec84a-cb8c-11e2-a1fe-52540035b04c

Comment by Jian Yu [ 14/Nov/13 ]

Here is the patch to unlink the files created in performance-sanity.sh through mdsrate-{create,lookup,stat}-*.sh after create/lookup/stat operation fails: http://review.whamcloud.com/6483

The above patch has landed on Lustre b2_1 branch.
Here is the patch for master branch: http://review.whamcloud.com/8265. It also needs to be cherry-picked to Lustre b2_5 branch.
And here is the patch for Lustre b2_4 branch: http://review.whamcloud.com/8289.

Comment by Jian Yu [ 02/Dec/13 ]

Patch landed on Lustre b2_4 branch for 2.4.2 and on master branch for 2.6.0.

Comment by Jodi Levi (Inactive) [ 04/Dec/13 ]

Can this ticket be closed?

Comment by Jian Yu [ 05/Dec/13 ]

Can this ticket be closed?

I'll back-port the patch to Lustre b2_5 branch.

Comment by Jodi Levi (Inactive) [ 11/Dec/13 ]

Patches have landed to Master. Yu Jian will backport to b2_5 branch.

Generated at Sat Feb 10 01:14:47 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.