[LU-12232] replay-ost-single test 6 fails with ''space grew after dd: before:13442048 after_dd:13442048'' Created: 26/Apr/19  Updated: 12/Oct/20  Resolved: 12/Oct/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.13.0, Lustre 2.12.1
Fix Version/s: Lustre 2.14.0

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: Hongchao Zhang
Resolution: Fixed Votes: 0
Labels: zfs

Issue Links:
Related
is related to LU-4265 replay-ost-single test_6: space grew ... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

replay-ost-single test 6 fails with ''space grew after dd: before:X after_dd:Y” for some values of X and Y.

Looking at the suite_log for a recent failure, logs at https://testing.whamcloud.com/test_sets/d53ee48a-665d-11e9-8bb1-52540065bddc , we see

CMD: trevis-45vm9 lctl set_param fail_loc=0x80000119
fail_loc=0x80000119
before: 13442048 after_dd: 13442048 took 20 seconds
 replay-ost-single test_6: @@@@@@ FAIL: space grew after dd: before:13442048 after_dd:13442048 

Some of the failure have the before and after values the same and some failures have different values for before and after.

There are no errors in any of the node console logs.

This failure looks like LU-4265. I’ve opened a new ticket because this test has failed with this error message six times in the past year and a half. Of those six failures, four have been seen this month. Thus, maybe something landed recently that is increasing the frequency of this failure or there a different/new cause.

There are several examples of this failure, but here are just a couple of additional links to logs
https://testing.whamcloud.com/test_sets/bf51ebce-65f7-11e9-a6f9-52540065bddc
https://testing.whamcloud.com/test_sets/043e5b44-62fd-11e9-aeec-52540065bddc
https://testing.whamcloud.com/test_sets/4fd6a78e-6170-11e9-9720-52540065bddc



 Comments   
Comment by Peter Jones [ 29/Apr/19 ]

Hongchao

Could you please investigate?

Peter

Comment by Patrick Farrell (Inactive) [ 29/Apr/19 ]

Here's the failure check:

        log "before: $before after_dd: $after_dd took $i seconds"
        (( $before > $after_dd )) ||
                error "space grew after dd: before:$before after_dd:$after_dd" 

It would be nice to rewrite this a bit when we fix it - These are actually checks on free space.  This is verifying that free space didn't grow.  It would be nice if the test made that clearer.

 

Comment by Hongchao Zhang [ 05/May/19 ]

this issue is caused by the side effect of previous test, the previous transactions are not committed
yet when getting the "free disk space" before "dd".

Comment by Gerrit Updater [ 05/May/19 ]

Hongchao Zhang (hongchao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34808
Subject: LU-12232 test: commit before df
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: fbefd48a492768c1c877bf11a164e3fddb2e67f9

Comment by Gerrit Updater [ 21/May/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34808/
Subject: LU-12232 test: commit before df
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: f1cbfb96c820aa7e1e5a84176619679d696a117a

Comment by Peter Jones [ 21/May/19 ]

Landed for 2.13

Comment by Gerrit Updater [ 21/May/19 ]

Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34927
Subject: LU-12232 test: commit before df
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 584c4bb7dd05a6102bc1e567db03223b1b835bcf

Comment by Gerrit Updater [ 08/Jun/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34927/
Subject: LU-12232 test: commit before df
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: 989217db39b2832c48ae58503363a4939c115d5a

Comment by James Nunez (Inactive) [ 15/Nov/19 ]

It looks like we are still experiencing replay-ost-single test 6 failing with the modified error message 'free grew after dd: before:15371264 after_dd:15371264'' . Please see https://testing.whamcloud.com/test_sets/6f49e8d2-07de-11ea-8e77-52540065bddc for one recent failure on b2_13.

Comment by Hongchao Zhang [ 16/Nov/19 ]

By searching the fails on Maloo, the new occurrences began at Sept 03, and all are with ZFS backend.
the ZFS version is 0.7.13, 0.8.1,

Comment by Gerrit Updater [ 18/Nov/19 ]

Hongchao Zhang (hongchao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36772
Subject: LU-12232 test: call dt_sync after dd
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 0d9758d6f907cdff65e0b4dd712a34d88cb89cbf

Comment by Gerrit Updater [ 12/Oct/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36772/
Subject: LU-12232 test: call dt_sync after dd
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 4ab3138c3342ccc6f33c896409cd4c8795832f12

Comment by Peter Jones [ 12/Oct/20 ]

Latest fix landed

Generated at Sat Feb 10 02:50:47 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.