[LU-5079] conf-sanity test_47 timeout Created: 19/May/14 Updated: 27/Nov/14 Resolved: 27/Nov/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.6.0 |
| Fix Version/s: | Lustre 2.7.0, Lustre 2.5.4 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Maloo | Assignee: | Jian Yu |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | llnl, patch | ||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| Rank (Obsolete): | 14015 | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| Description |
|
This issue was created by maloo for Nathaniel Clark <nathaniel.l.clark@intel.com> This issue relates to the following test suite run: The sub-test test_47 failed with the following error:
Info required for matching: conf-sanity 47 |
| Comments |
| Comment by Andreas Dilger [ 20/May/14 ] |
|
Seems that replies are being dropped after only 1s due to at_max. This is probably caused by the recently landed patch http://review.whamcloud.com/9100 that changes AT for the first time in a long time. |
| Comment by James A Simmons [ 20/May/14 ] |
|
Is the at_max to low? |
| Comment by Bob Glossman (Inactive) [ 13/Jun/14 ] |
|
another in master: |
| Comment by Andreas Dilger [ 09/Jul/14 ] |
|
Still failing occasionally: https://testing.hpdd.intel.com/test_sessions/a356cb0e-075d-11e4-92f3-5254006e85c2 |
| Comment by Chris Horn [ 23/Jul/14 ] |
|
http://review.whamcloud.com/9100 introduced a regression for the case where we're sending early replies in recovery. The deadline increase (service estimate increase) is calculated as: Recovery case: at_measured(&svcpt->scp_at_estimate, min(at_extra,
req->rq_export->exp_obd->obd_recovery_timeout / 4));
Normal case is: at_measured(&svcpt->scp_at_estimate, at_extra +
cfs_time_current_sec() -
req->rq_arrival_time.tv_sec);
We probably want something like the following for recovery case: at_measured(&svcpt->scp_at_estimate, min(at_extra,
req->rq_export->exp_obd->obd_recovery_timeout / 4) +
cfs_time_current_sec() - req->rq_arrival_time.tv_sec);
Alex Boyko at Xyratex pointed out the regression to me, and I believe they will submit a patch. |
| Comment by Alexander Boyko [ 24/Jul/14 ] |
| Comment by James A Simmons [ 17/Oct/14 ] |
|
This appears to be related to the |
| Comment by Jian Yu [ 21/Oct/14 ] |
|
Here is the back-ported patch for Lustre b2_5 branch: http://review.whamcloud.com/12365 |
| Comment by Jian Yu [ 24/Oct/14 ] |
|
The above patch introduced regression failure in replay-vbr test 7e. On master branch: On b2_5 branch: |
| Comment by Christopher Morrone [ 24/Oct/14 ] |
|
I tried http://review.whamcloud.com/12365 on top of LLNL's 2.5.3-1chaos tag, and saw lots of problems. I don't know if they were actually cause by it or not. Experience is recorded in |
| Comment by Jian Yu [ 28/Oct/14 ] |
|
I tried to manually run replay-vbr test and found it finally passed and took about 8 hours: Among the sub-tests, test 7e took 3876s, which exceeded the 3600s timeout value set by autotest system. That was why the test was stopped in autotest runs. |
| Comment by Jodi Levi (Inactive) [ 28/Oct/14 ] |
|
Patch landed to Master. |
| Comment by Jian Yu [ 30/Oct/14 ] |
|
Here is the patch for master branch to speed up replay-vbr test 7*: http://review.whamcloud.com/12490 With the above patch, total run time for replay-vbr test 7* was reduced from 18796s to 4742s. |
| Comment by Andreas Dilger [ 31/Oct/14 ] |
|
Reopen this bug until the replay-vbr test patch has been landed to master, otherwise it will not be tracked properly. |
| Comment by Peter Jones [ 04/Nov/14 ] |
|
It has landed to master now |
| Comment by Jian Yu [ 12/Nov/14 ] |
|
The patches caused regression failures |
| Comment by Andreas Dilger [ 12/Nov/14 ] |
|
Alexander, could you please take a look at |
| Comment by Jian Yu [ 14/Nov/14 ] |
|
I just pushed a patch to Lustre b2_5 branch to fix the calculation of of service_time in max_recovery_time(): http://review.whamcloud.com/12714. |
| Comment by Gerrit Updater [ 15/Nov/14 ] |
|
Jian Yu (jian.yu@intel.com) uploaded a new patch: http://review.whamcloud.com/12714 |
| Comment by Gerrit Updater [ 15/Nov/14 ] |
|
Jian Yu (jian.yu@intel.com) uploaded a new patch: http://review.whamcloud.com/12724 |
| Comment by Jian Yu [ 15/Nov/14 ] |
|
Test results showed that by increasing the time in max_recovery_time() in test framework, the failures in |
| Comment by Gerrit Updater [ 16/Nov/14 ] |
|
Jian Yu (jian.yu@intel.com) uploaded a new patch: http://review.whamcloud.com/12724 |
| Comment by Gerrit Updater [ 27/Nov/14 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12724/ |
| Comment by Gerrit Updater [ 27/Nov/14 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12714/ |