[LU-4643] sanity-hsm test_60: FAIL: Expected progress update within 10 seconds Created: 18/Feb/14  Updated: 29/Sep/15  Resolved: 29/Sep/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.6.0, Lustre 2.5.1
Fix Version/s: Lustre 2.6.0, Lustre 2.5.1

Type: Bug Priority: Critical
Reporter: Jian Yu Assignee: Bruno Faccini (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 12700

 Description   

While validating patch http://review.whamcloud.com/9290 on Lustre b2_5 branch, sanity-hsm test 60 failed as follows:

Updated after 9s: wanted '5242880' got '5242880'
 sanity-hsm test_60: @@@@@@ FAIL: Expected progress update within 10 seconds 

Maloo report: https://maloo.whamcloud.com/test_sets/b8823022-987c-11e3-98a2-52540035b04c

More instances occurred while validating patches on Lustre b2_5 branch:
http://review.whamcloud.com/9260
http://review.whamcloud.com/9103
http://review.whamcloud.com/9071

The same failure also occurred while validating patches on master branch:
http://review.whamcloud.com/9243
http://review.whamcloud.com/9221

The test was introduced by the patches for LU-4512.



 Comments   
Comment by Jian Yu [ 18/Feb/14 ]

Hi Michael,

This failure is a regression introduced by the patch for LU-4512. Could you please take a look at this failure? Thanks.

Comment by Jian Yu [ 18/Feb/14 ]

Another similar failure:

Update not seen after 100s: wanted '5242880' got ''
 sanity-hsm test_60: @@@@@@ FAIL: request on 0x400000401:0x1c1:0x0 has not made progress 5242880 on mds1

Maloo report: https://maloo.whamcloud.com/test_sets/8e7ec8f2-97b0-11e3-a9f6-52540035b04c

Comment by Jian Yu [ 19/Feb/14 ]

For "Expected progress update within 10 seconds" failure, I checked the outputs of all of the failed tests and found that:

Updated after 9s: wanted '5242880' got '5242880'
Updated after 9s: wanted '5242880' got '5242880'
Updated after 12s: wanted '5242880' got '5242880'
Updated after 12s: wanted '5242880' got '5242880'
Updated after 11s: wanted '5242880' got '5242880'
Updated after 11s: wanted '5242880' got '5242880'
Updated after 11s: wanted '5242880' got '5242880'
Updated after 13s: wanted '5242880' got '5242880'

The current values of "interval" and "progress_timeout" in sanity-hsm test_60() are as follows:

        local interval=5
        local progress_timeout=$((interval * 2))

I just uploaded patches to increase the value of "progress_timeout".
For master branch: http://review.whamcloud.com/9304
For b2_5 branch: http://review.whamcloud.com/9305

Comment by Jian Yu [ 19/Feb/14 ]

For "Update not seen after 100s: wanted '5242880' got ''" failure, I checked the outputs of all of the failed tests and found that the updating of "done=" values were all as follows:

Changed after 0s: from '' to '0'
Changed after 4s: from '0' to '4194304'
Changed after 9s: from '4194304' to '13631488'
Changed after 13s: from '13631488' to '28311552'
Changed after 17s: from '28311552' to '48234496'
Changed after 22s: from '48234496' to '73400320'
Changed after 26s: from '73400320' to '103809024'
Changed after 30s: from '103809024' to '139460608'
Changed after 34s: from '139460608' to ''

In sanity-hsm test_60(), the expected value of "data_moved" is:

        wait_request_progress $fid ARCHIVE 5242880

I'm confused here about how "5242880" was calculated because the size of archived file was 39000000 bytes:

        dd if=/dev/urandom of=$file2 count=39 bs=1000000 conv=fsync ||
                error "cannot create $file2"
Comment by Bruno Faccini (Inactive) [ 19/Feb/14 ]

To allow for further debugging of sanity-hsm/test_60 issues, I pushed the 2 patches http://review.whamcloud.com/9313 and http://review.whamcloud.com/9314, respectively for master and b2_5, that will disable it from inside sanity-hsm suite itself. Then it could be re-enabled as a further patch that will fix the issues.

Possible issues to be addressed then are 1) a mismatch in the way the CT reports progress and how the CDT interpret/report it in the log, 2) a too short grace_period request causing request to disappear from the log, 3) a wrong 5242880 size to be checked, …??

Comment by Michael MacDonald (Inactive) [ 24/Feb/14 ]

Hi, sorry I'm just now replying. Was on vacation last week.

So, I think the problem with this test is that it's too fragile. The intent of the test is to ensure that changing the copytool reporting interval via command-line arguments works. The bug that LU-4512 fixed was that the --report argument wasn't parsed correctly and I believe what happened was that the interval was interpreted as milliseconds rather than seconds. In any case, it wasn't working, and it works now.

I believe that this is an intermittent failure due to load variations on the test VMs. The tests in this suite artificially limit bandwidth to 1MB/sec (see line 164 in copytool_setup()) which is fine when the cluster is actually capable of 1MB/sec. When it's not, test_60 can fail because the expected value isn't seen.

In hindsight, I see now that assuming the test could always count on 1MB/sec was a mistake. Also, the test is sort of testing the wrong thing. We don't really care what the value is, just that there was a progress report within the expected window. I will work on a patch to make this test more robust.

Comment by Michael MacDonald (Inactive) [ 25/Feb/14 ]

proposed improvement:
master: http://review.whamcloud.com/#/c/9376/
b2_5: http://review.whamcloud.com/#/c/9378/

Comment by Bruno Faccini (Inactive) [ 28/Feb/14 ]

Michael, not sure if you have seen that patches #9313/#9314, respectively for master and b2_5, should land soon to disable sanity-hsm/test_60 until problem is fixed.
This means that when it will have landed you will need to re-enable it inside your patches #9376/#9378.

Comment by Michael MacDonald (Inactive) [ 28/Feb/14 ]

Given that my patches have +1s for code review and are working through testing, I wonder why we would land the patches to disable the test? Wouldn't it be better to just land the fixed version of the test and not bother with disabling it?

Comment by Bruno Faccini (Inactive) [ 28/Feb/14 ]

The patches to disable the failing test is frequent way we use now to reduce the time/frequency tests sessions are impacted with the issue and then allow for enough time and no pressure during the test debugging. It is easier then to re-enable the test with and additional line/change (to modify ALWAYS_EXCEPT setting) in the final patch/fix.

Comment by Michael MacDonald (Inactive) [ 28/Feb/14 ]

I understand, but as a hypothetical example, if the patch to disable the test and the patch to fix the test are both ready at the same time for landing, it doesn't make sense to me to disable the test and then re-enable it with two separate patches. Am I missing something?

http://review.whamcloud.com/#/c/9376/ has 2 +1 code-reviews, a +1 from jenkins, and a +1 from maloo. Just waiting for the gatekeeper to cherry-pick it and land it.

Comment by Bruno Faccini (Inactive) [ 28/Feb/14 ]

I have no problem at all with that Michael, if you are sure/confident about imminent landing of #9376/#9378 just abandon #9313/#9314 in this case.

Comment by Peter Jones [ 11/Mar/14 ]

Landed for 2.5.1 and 2.6

Comment by Saurabh Tandan (Inactive) [ 29/Sep/15 ]

Encountered similar issue for interop testing for 2.7.60 tag
Client : master branch/ build 3194, RHEL 7
Server: b2_7 branch

5242880 bytes copied in 3 seconds.
 sanity-hsm test_60: @@@@@@ FAIL: Expected progress update after at least 5 seconds 
Comment by Peter Jones [ 29/Sep/15 ]

Saurabh

Given how old this issue is it would be much better to open a new ticket even if the issue does appear similar to this older one

Peter

Generated at Sat Feb 10 01:44:35 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.