[LU-5079] conf-sanity test_47 timeout Created: 19/May/14  Updated: 27/Nov/14  Resolved: 27/Nov/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.6.0
Fix Version/s: Lustre 2.7.0, Lustre 2.5.4

Type: Bug Priority: Critical
Reporter: Maloo Assignee: Jian Yu
Resolution: Fixed Votes: 0
Labels: llnl, patch

Issue Links:
Duplicate
is duplicated by LU-5773 obdfilter-survey test 1c: oom occurre... Resolved
Related
is related to LU-5077 insanity test_1: out of memory on MDT... Resolved
is related to LU-5805 tgt_recov blocked and "waking for gap... Resolved
is related to LU-4578 Early replies do not honor at_max Resolved
is related to LU-5358 Failure on test suite replay-vbr test_7e Resolved
is related to LU-5077 insanity test_1: out of memory on MDT... Resolved
is related to LU-5900 replay-dual test_11: rm: cannot remov... Resolved
is related to LU-5901 replay-dual test_15a: import is not i... Resolved
is related to LU-5902 replay-dual test_20: FAIL: recovery t... Resolved
is related to LU-5803 This server is not able to keep up wi... Resolved
is related to LU-5724 IR recovery doesn't behave properly w... Closed
Severity: 3
Rank (Obsolete): 14015

 Description   

This issue was created by maloo for Nathaniel Clark <nathaniel.l.clark@intel.com>

This issue relates to the following test suite run:
http://maloo.whamcloud.com/test_sets/7f09a2f6-dd9d-11e3-9262-52540035b04c
https://maloo.whamcloud.com/test_sets/99ea9712-dc88-11e3-9450-52540035b04c

The sub-test test_47 failed with the following error:

test failed to respond and timed out

Info required for matching: conf-sanity 47



 Comments   
Comment by Andreas Dilger [ 20/May/14 ]

Seems that replies are being dropped after only 1s due to at_max. This is probably caused by the recently landed patch http://review.whamcloud.com/9100 that changes AT for the first time in a long time.

Comment by James A Simmons [ 20/May/14 ]

Is the at_max to low?

Comment by Bob Glossman (Inactive) [ 13/Jun/14 ]

another in master:
https://maloo.whamcloud.com/test_sets/e70d95f4-f2f8-11e3-b88b-52540035b04c

Comment by Andreas Dilger [ 09/Jul/14 ]

Still failing occasionally: https://testing.hpdd.intel.com/test_sessions/a356cb0e-075d-11e4-92f3-5254006e85c2

Comment by Chris Horn [ 23/Jul/14 ]

http://review.whamcloud.com/9100 introduced a regression for the case where we're sending early replies in recovery. The deadline increase (service estimate increase) is calculated as:

Recovery case:

                at_measured(&svcpt->scp_at_estimate, min(at_extra,
                            req->rq_export->exp_obd->obd_recovery_timeout / 4));

Normal case is:

                at_measured(&svcpt->scp_at_estimate, at_extra +
                            cfs_time_current_sec() -
                            req->rq_arrival_time.tv_sec);

We probably want something like the following for recovery case:

                at_measured(&svcpt->scp_at_estimate, min(at_extra,
                            req->rq_export->exp_obd->obd_recovery_timeout / 4) +
                            cfs_time_current_sec() - req->rq_arrival_time.tv_sec);

Alex Boyko at Xyratex pointed out the regression to me, and I believe they will submit a patch.

Comment by Alexander Boyko [ 24/Jul/14 ]

patch http://review.whamcloud.com/#/c/11213/

Comment by James A Simmons [ 17/Oct/14 ]

This appears to be related to the LU-5077 problems. Since the patch for LU-4578 is in the b2_5 branch I believe this might be the source of our recovery problems. I was seeing recovery issues on 2.5 in my test bed until I applied the patch for this ticket here.

Comment by Jian Yu [ 21/Oct/14 ]

Here is the back-ported patch for Lustre b2_5 branch: http://review.whamcloud.com/12365

Comment by Jian Yu [ 24/Oct/14 ]

patch http://review.whamcloud.com/#/c/11213/

The above patch introduced regression failure in replay-vbr test 7e.

On master branch:
https://testing.hpdd.intel.com/test_sets/92fe44e0-5b80-11e4-a35f-5254006e85c2

On b2_5 branch:
https://testing.hpdd.intel.com/test_sets/4d02235e-59c2-11e4-aa32-5254006e85c2
https://testing.hpdd.intel.com/test_sets/25213bfe-59c2-11e4-aa32-5254006e85c2
https://testing.hpdd.intel.com/test_sets/3238224a-59bc-11e4-816e-5254006e85c2
https://testing.hpdd.intel.com/test_sets/166cc778-59bc-11e4-816e-5254006e85c2

Comment by Christopher Morrone [ 24/Oct/14 ]

I tried http://review.whamcloud.com/12365 on top of LLNL's 2.5.3-1chaos tag, and saw lots of problems. I don't know if they were actually cause by it or not. Experience is recorded in LU-5805.

Comment by Jian Yu [ 28/Oct/14 ]

patch http://review.whamcloud.com/#/c/11213/

I tried to manually run replay-vbr test and found it finally passed and took about 8 hours:
https://testing.hpdd.intel.com/test_sets/18fb6cc0-5eb1-11e4-a2a3-5254006e85c2

Among the sub-tests, test 7e took 3876s, which exceeded the 3600s timeout value set by autotest system. That was why the test was stopped in autotest runs.

Comment by Jodi Levi (Inactive) [ 28/Oct/14 ]

Patch landed to Master.

Comment by Jian Yu [ 30/Oct/14 ]

Here is the patch for master branch to speed up replay-vbr test 7*: http://review.whamcloud.com/12490
And here is the test result: https://testing.hpdd.intel.com/test_sets/0ae4ab72-5fcb-11e4-895a-5254006e85c2

With the above patch, total run time for replay-vbr test 7* was reduced from 18796s to 4742s.

Comment by Andreas Dilger [ 31/Oct/14 ]

Reopen this bug until the replay-vbr test patch has been landed to master, otherwise it will not be tracked properly.

Comment by Peter Jones [ 04/Nov/14 ]

It has landed to master now

Comment by Jian Yu [ 12/Nov/14 ]

The patches caused regression failures LU-5900, LU-5901 and LU-5902 on master and b2_5 branches.

Comment by Andreas Dilger [ 12/Nov/14 ]

Alexander, could you please take a look at LU-5900, LU-5901, LU-5902 failures? This looks to be caused by the http://review.whamcloud.com/11213 patch backported to b2_5 (http://review.whamcloud.com/12365). Is this something that is just causing the tests to fail that won't affect real users, or is there something bad in b2_5 that will cause recovery problems for users also?

Comment by Jian Yu [ 14/Nov/14 ]

I just pushed a patch to Lustre b2_5 branch to fix the calculation of of service_time in max_recovery_time(): http://review.whamcloud.com/12714.
I'll check the test results.

Comment by Gerrit Updater [ 15/Nov/14 ]

Jian Yu (jian.yu@intel.com) uploaded a new patch: http://review.whamcloud.com/12714
Subject: LU-5079 tests: fix service_time in max_recovery_time()
Project: fs/lustre-release
Branch: b2_5
Current Patch Set: 3
Commit: 264ea0643d46240b8b45dd673aeff3b4fe76bd10

Comment by Gerrit Updater [ 15/Nov/14 ]

Jian Yu (jian.yu@intel.com) uploaded a new patch: http://review.whamcloud.com/12724
Subject: LU-5079 tests: fix service_time in max_recovery_time()
Project: fs/lustre-release
Branch: master
Current Patch Set: 2
Commit: 4c0c53265dc9fb17bb8999548d656103ced58928

Comment by Jian Yu [ 15/Nov/14 ]

Test results showed that by increasing the time in max_recovery_time() in test framework, the failures in LU-5900 and LU-5901 were resolved. And by adding some margin into the recovery time comparison condition, the failure in LU-5902 was resolved.

Comment by Gerrit Updater [ 16/Nov/14 ]

Jian Yu (jian.yu@intel.com) uploaded a new patch: http://review.whamcloud.com/12724
Subject: LU-5079 tests: fix service_time in max_recovery_time()
Project: fs/lustre-release
Branch: master
Current Patch Set: 3
Commit: 1c58e970d684edfefba3725f869ffd36c6a4475e

Comment by Gerrit Updater [ 27/Nov/14 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12724/
Subject: LU-5079 tests: fix service_time in max_recovery_time()
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 9bb24bf1ce4977b32d4bf9b55cef5a25072cef5e

Comment by Gerrit Updater [ 27/Nov/14 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12714/
Subject: LU-5079 tests: fix service_time in max_recovery_time()
Project: fs/lustre-release
Branch: b2_5
Current Patch Set:
Commit: 4e8def3e32ad76808a5b8336d43430a5318e20aa

Generated at Sat Feb 10 01:48:21 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.