[LU-868] Add early reply and netlatency into the timeout of waiting next replay. Created: 20/Nov/11 Updated: 03/Jun/16 Resolved: 03/Jun/16 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.2.0 |
| Type: | Improvement | Priority: | Minor |
| Reporter: | Di Wang | Assignee: | Di Wang |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Rank (Obsolete): | 4771 |
| Description |
|
In current handle_recovery_req, when setting the timeout for waiting next replay to = lustre_msg_get_timeout(req->rq_reqmsg); It did not consider the netlatency and early reply might extend the req timeout on client side, which might cause some replay tests failed like replay-dual 9, 10. |
| Comments |
| Comment by Jinshan Xiong (Inactive) [ 22/Nov/11 ] |
|
There exists a quite similar scenario at: https://maloo.whamcloud.com/test_sets/eb23c4ba-1510-11e1-b669-52540025f9af The failure on patch set 2 is quite similar to 0. server started recovery window at: 1321951273.840132, and recovery timeout was set to 60s; It looks like that it's not enough to extend the recovery window by rq_timeout only. We should add RECONNECT time as well so that it has enough time for the client to reconnect and replay again. Another question is why rq_timeout of replaying request was set to 60 seconds, I think it should be much shorter. |
| Comment by Mikhail Pershin [ 24/Nov/11 ] |
|
I'd agree that reconnect time is good to have included but it should be also added to the initial recovery timeout then. As for 60s timeout, this can be due to network latency, I recall from discussion with Eric that it is normal to have delays up to 60s, e.g. with routers. In current situation the recovery time in 60s looks more unnatural than request timeout, it is just too short. |
| Comment by Mikhail Pershin [ 24/Nov/11 ] |
|
The check_and_start_recovery_timer() looks pretty reasonable but at the end it has: 1487 service_time -= obd->obd_recovery_timeout;
1488 if (service_time > 0)
1489 extend_recovery_timer(obd, service_time);
I think that Maloo issue you are referring was before your fixes for extend_recovery_timer(), so I suspect that can cause not correct total recovery time making it smaller, what do you think? |
| Comment by Build Master (Inactive) [ 11/Jan/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 11/Jan/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 11/Jan/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 11/Jan/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 11/Jan/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 11/Jan/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 11/Jan/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 11/Jan/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 11/Jan/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 11/Jan/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 11/Jan/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 11/Jan/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 11/Jan/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 11/Jan/12 ] |
|
Integrated in Result = SUCCESS
|