[LU-1944] extend recovery timer even during the request queue phase Created: 14/Sep/12 Updated: 20/Oct/14 Resolved: 18/Apr/13 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.4.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Di Wang | Assignee: | Di Wang |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Severity: | 3 |
| Rank (Obsolete): | 6324 |
| Description |
|
We should extend recovery timer even when we got the replay request instead of until the req is being processed. And also we need add another net_latency in timer extend, one for balance rq_deadline(see ptl_send_rpc), one for resend the req to server. |
| Comments |
| Comment by Mikhail Pershin [ 15/Sep/12 ] |
|
Could you describe a bit more why this bug occur and how you discover these changes are needed? |
| Comment by Di Wang [ 17/Sep/12 ] |
|
I found these problem during replay-dual 9 and 10 (resending create/unlink replay req). In these tests MDS will drop the replay req and waiting client to resend the replay request, so the client needs to resend the replay request before the server get timeout and then evict the client. In current master, server calculate the timeout by at_estimate_srv_time + net_latency, and client also calculate the request timeout by at_estimate_srv_time + net_latency, so they are roughly same. i.e. client is actually being evicted on the server side before it resend the replay request, I saw a few failures because of this in replay-dual 9 and 10, in my local test. Here we actually need add another net_latency to server timeout, so once reply drop happens, the resend replay req can arrive server in time. That is why we need 2 * netlatency in the patch. I actually not sure whether we should extend the timeout when we queue the request? what is your idea? |
| Comment by Di Wang [ 17/Sep/12 ] |
|
Just update the patch to not extending timeout during queue phase. Will add it back when we are clear. |
| Comment by Di Wang [ 17/Sep/12 ] |
| Comment by John Hammond [ 20/Oct/14 ] |