[LU-3878] sanity-benchmark test fsx: Bus error Created: 04/Sep/13  Updated: 13/Oct/21  Resolved: 23/Nov/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.1, Lustre 2.5.0, Lustre 2.6.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Jian Yu Assignee: Oleg Drokin
Resolution: Not a Bug Votes: 0
Labels: None
Environment:

Lustre build: http://build.whamcloud.com/job/lustre-b2_4/44/ (2.4.1 RC1)
Distro/Arch: RHEL6.4/x86_64 + FC18/x86_64 (Server + Client)


Issue Links:
Duplicate
duplicates LU-2909 Failure on test suite sanity-benchmar... Resolved
Related
is related to LU-6729 sanity-benchmark test_fsx: Bus error Resolved
Severity: 3
Rank (Obsolete): 10059

 Description   

sanity-benchmark test fsx failed as follows:

== sanity-benchmark test fsx: fsx ==================================================================== 22:15:58 (1378185358)
debug=0
Using: fsx -c 50 -p 1000 -S 29278 -P /tmp -l 206139         -N 100000  /mnt/lustre/f0.fsxfile
Chance of close/open is 1 in 50
Seed set to 29278
truncating to largest ever: 0xd3af
/usr/lib64/lustre/tests/sanity-benchmark.sh: line 186: 12471 Bus error               $CMD
 sanity-benchmark test_fsx: @@@@@@ FAIL: fsx failed 

Maloo report: https://maloo.whamcloud.com/test_sets/becb9218-14ef-11e3-ac48-52540035b04c

This is a regression on Lustre b2_4 branch.



 Comments   
Comment by Keith Mannthey (Inactive) [ 04/Sep/13 ]

This looks related to https://jira.hpdd.intel.com/browse/LU-2909 an earlier 2.4 blocker.

Comment by Andreas Dilger [ 04/Sep/13 ]

Sorry, it seems LU-2909 is closed, so presumably this is a new issue.

Comment by Jian Yu [ 05/Sep/13 ]

Lustre build: http://build.whamcloud.com/job/lustre-b2_4/44/ (2.4.1 RC1)
Distro/Arch: RHEL6.4/x86_64
FSTYPE=zfs

sanity-benchmark test fsx also hit the same failure:
https://maloo.whamcloud.com/test_sets/e004601a-1556-11e3-8938-52540035b04c

FYI, here is the query result of sanity-benchmark test_fsx with "FAIL" status on Lustre b2_4 branch:
http://tinyurl.com/kxheu43

Comment by Jian Yu [ 05/Sep/13 ]

By searching on Maloo, I found that test fsx passed on FC18 on Lustre b2_4 build #40 and previous builds. Builds #41, #42, #43 were not tested on FC18. It seems that the culprit is in build #42.

Comment by Jian Yu [ 09/Sep/13 ]

After patch http://review.whamcloud.com/7481 was reverted from Lustre b2_4 branch, the failure did not occur on Lustre 2.4.1 RC2.

Comment by Jian Yu [ 02/Nov/13 ]

Lustre build: http://build.whamcloud.com/job/lustre-b2_4/47/
Distro/Arch: RHEL6.4/x86_64
FSTYPE=zfs

sanity-benchmark test fsx hit the same failure again:
https://maloo.whamcloud.com/test_sets/5d2a22d4-43a9-11e3-942a-52540035b04c

Comment by Oleg Drokin [ 18/Nov/13 ]

Is this only happening on zfs?

Only on b2_4, but not on master?

Comment by Jian Yu [ 18/Nov/13 ]

Here is the search result on Maloo:
http://tinyurl.com/ozo5c7a

The failure occurred not only on zfs and b2_4, but also on ldiskfs and master/b2_5:
https://maloo.whamcloud.com/test_sets/87757614-44d1-11e3-8c03-52540035b04c
https://maloo.whamcloud.com/test_sets/6f3d0b5e-4cc7-11e3-826a-52540035b04c
https://maloo.whamcloud.com/test_sets/2582fa8c-3bbf-11e3-b062-52540035b04c

Comment by Jinshan Xiong (Inactive) [ 23/Nov/13 ]

The failure of fax is probably a fallout of the previous failure on iozone. It used up all disk spaces on the OSTs, so there is no any grants on client which made mkwrite() fail.

Comment by Oleg Drokin [ 26/Nov/13 ]

But why does it stop after the patches are reverted?
Also I don't think we see any enospace errors for fsx runs in the logs?

Comment by Jinshan Xiong (Inactive) [ 26/Nov/13 ]

the previous iozone run used up all spaces.

I can't connect this symptom to that patch. But from what I have seen so far, you reverted that patch on Sep 24 but it still occurred after that.

Comment by Jian Yu [ 05/Dec/13 ]

Yes, the failure still occurred on the latest Lustre b2_4 branch with FSTYPE=zfs:
https://maloo.whamcloud.com/test_sets/17d950e4-58ba-11e3-83d7-52540035b04c
https://maloo.whamcloud.com/test_sets/4767b964-57ce-11e3-8d5c-52540035b04c
https://maloo.whamcloud.com/test_sets/f21a0152-4ab1-11e3-8252-52540035b04c

In the above test reports, all of the iozone tests failed as follows:

write: No space left on device

or

Write error No space left on device (rc = -1, len = 4194304)

So, it seems that the out of space failure of iozone caused the fsx failure.

Generated at Sat Feb 10 01:37:42 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.