[LU-3752] sanity-quota test_18: expect 104857600, got 42991616. Verifying file failed! Created: 13/Aug/13  Updated: 17/Jul/17

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.1, Lustre 2.5.0, Lustre 2.6.0, Lustre 2.5.1, Lustre 2.8.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Maloo Assignee: Oleg Drokin
Resolution: Unresolved Votes: 0
Labels: yuc2

Severity: 3
Rank (Obsolete): 9677

 Description   

This issue was created by maloo for sarah <sarah@whamcloud.com>

This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/73a518da-029e-11e3-b384-52540035b04c.

The sub-test test_18 failed with the following error:

expect 104857600, got 42991616. Verifying file failed!

Info required for matching: sanity-quota 18



 Comments   
Comment by Jian Yu [ 14/Aug/13 ]

I just found that this is a regression introduced by the patch in build #28 on Lustre b2_4 branch.

Before build #28, sanity-quota test 18 always passed on Lustre b2_4 branch. Since build #28, there were 6 full test runs on build #29 against the RHEL6 and SLES11SP2 clients, 2 test runs hit sanity-quota test 18 failure:

Failed test runs:
https://maloo.whamcloud.com/test_sets/f6c41656-0421-11e3-90ba-52540035b04c (RHEL6)
https://maloo.whamcloud.com/test_sets/1f547576-0282-11e3-a4b4-52540035b04c (RHEL6)

Passed test runs:
https://maloo.whamcloud.com/test_sets/8a5797c0-0248-11e3-a4b4-52540035b04c (RHEL6)
https://maloo.whamcloud.com/test_sets/31d735f2-02b0-11e3-a4b4-52540035b04c (SLES11SP2)
https://maloo.whamcloud.com/test_sets/b109a6b8-0259-11e3-b384-52540035b04c (RHEL6)
https://maloo.whamcloud.com/test_sets/e4b07626-039f-11e3-9824-52540035b04c (RHEL6)

On master branch, the test passed on build #1582, and failed on build #1591. The builds between them were not tested. By looking over the patches on these builds and those patches in b2_4 build #28, the following ones are intersections:

LU-3643 ofd: get data version only if file exists
LU-3585 ptlrpc: Fix a crash when dereferencing NULL pointer
LU-3636 llapi: llapi_hsm_copy_end() on correct FID on restore.

Comment by Peter Jones [ 14/Aug/13 ]

Oleg

Can you please try to further identify the cause of this regression?

Thanks

Peter

Comment by Oleg Drokin [ 14/Aug/13 ]

A very suspicious common pattern is observed in those test results.

All successful test runs have:

running as uid/gid/euid/egid 60000/60000/60000/60000, groups:
 [dd] [if=/dev/zero] [bs=1M] [of=/mnt/lustre/d0.sanity-quota/d18/f.sanity-quota.18] [count=100] [oflag=direct]
CMD: client-20-ib sync; sync; sync
Filesystem           1K-blocks      Used Available Use% Mounted on
client-20-ib@o2ib:/lustre
                      14222720     13440  14150272   1% /mnt/lustre

All failing runs have:

Write 100M (directio) ...
running as uid/gid/euid/egid 60000/60000/60000/60000, groups:
 [dd] [if=/dev/zero] [bs=1M] [of=/mnt/lustre/d0.sanity-quota/d18/f.sanity-quota.18] [count=100] [oflag=direct]
CMD: client-26vm7 sync; sync; sync
Filesystem               1K-blocks   Used Available Use% Mounted on
client-26vm7@tcp:/lustre   1464484 264460   1118928  20% /mnt/lustre

So, my question is - why newer testruns (that are failing) have 10x less disk space? I bet this is why the test is dying with out of space error now - because striping is also not used and so with previously present files there's just not enough space in the new scheme of things.

Comment by Jian Yu [ 16/Aug/13 ]

For failed test runs:

MDSSIZE=1939865
OSTSIZE=223196

For passed test runs:

MDSSIZE=2097152
OSTSIZE=2097152

The real failure was:

dd: writing `/mnt/lustre/d0.sanity-quota/d18/f.sanity-quota.18': No space left on device

We need improve the test script to check available space.

Comment by Bob Glossman (Inactive) [ 16/Aug/13 ]

space check added
http://review.whamcloud.com/7366

Still leaves open the question of why we're running out of space in the first place.

Comment by Sarah Liu [ 04/Dec/13 ]

hit this issue in lustre-master build # 1784
client is running SLES11 SP3

https://maloo.whamcloud.com/test_sets/67469a2c-5bbe-11e3-8d79-52540035b04c

Comment by Jian Yu [ 17/Jan/14 ]

Lustre client build: http://build.whamcloud.com/job/lustre-b2_4/70/ (2.4.2)
Lustre server build: http://build.whamcloud.com/job/lustre-b2_5/13/

The same failure occurred:
https://maloo.whamcloud.com/test_sets/56bce390-7e7e-11e3-925a-52540035b04c

Comment by Sarah Liu [ 20/Mar/14 ]

Hit this failure in lustre-master tag-2.5.57(build # 1945) testing for zfs:
https://maloo.whamcloud.com/test_sessions/e29960fc-b031-11e3-9bc4-52540035b04c

In the previous build 1944 and 1943, this test passed:
https://maloo.whamcloud.com/test_sessions/28690920-af2e-11e3-bac7-52540035b04c
https://maloo.whamcloud.com/test_sessions/1af5c146-ae6d-11e3-a4ae-52540035b04c

Comment by Sarah Liu [ 20/Jan/16 ]

hit this on current master build#3305 RHEL6.7 zfs
https://testing.hpdd.intel.com/test_sets/91fc1ebc-bc84-11e5-b3b7-5254006e85c2

Comment by Saurabh Tandan (Inactive) [ 04/Feb/16 ]

Encountered another instance for FULL - EL6.7 Server/EL6.7 Client - ZFS , master , build# 3314
https://testing.hpdd.intel.com/test_sets/9e6de21c-cb47-11e5-a59a-5254006e85c2

Another instance on master for FULL - EL7.1 Server/EL7.1 Client - ZFS, build# 3314
https://testing.hpdd.intel.com/test_sets/e109a106-cb88-11e5-b49e-5254006e85c2

Comment by Saurabh Tandan (Inactive) [ 10/Feb/16 ]

Another instance found for Full tag 2.7.66 - EL6.7 Server/EL6.7 Client - ZFS, build# 3314
https://testing.hpdd.intel.com/test_sets/9e6de21c-cb47-11e5-a59a-5254006e85c2

Another instance found for Full tag 2.7.66 -EL7.1 Server/EL7.1 Client - ZFS, build# 3314
https://testing.hpdd.intel.com/test_sets/e109a106-cb88-11e5-b49e-5254006e85c2

Generated at Sat Feb 10 01:36:36 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.