Details
-
Bug
-
Resolution: Fixed
-
Minor
-
None
-
3
-
12504
Description
There seems to be a flaw in the logic which extends the request deadline when sending early reply. This bit of code is supposed to check whether we're able to extend the deadline:
/* Fake our processing time into the future to ask the clients * for some extra amount of time */ at_measured(&svcpt->scp_at_estimate, at_extra + cfs_time_current_sec() - req->rq_arrival_time.tv_sec); /* Check to see if we've actually increased the deadline - * we may be past adaptive_max */ if (req->rq_deadline >= req->rq_arrival_time.tv_sec + at_get(&svcpt->scp_at_estimate)) { DEBUG_REQ(D_WARNING, req, "Couldn't add any time " "(%ld/%ld), not sending early reply\n", olddl, req->rq_arrival_time.tv_sec + at_get(&svcpt->scp_at_estimate) - cfs_time_current_sec()); RETURN(-ETIMEDOUT); }
This logic looks sound to me. The service estimate is bounded by at_max, so arrival time + service estimate is guaranteed to be less than or equal to at_max seconds after arrival time. If the current deadline is equal or greater than that, we cannot extend the deadline. The problem is with calculating the new deadline when the current deadline is less than the arrival time + service estimate sum:
newdl = cfs_time_current_sec() + at_get(&svcpt->scp_at_estimate);
Here we offset the current time rather than the arrival time. This is not only inconsistent with the above logic, but it can also result in a situation where we extend the deadline well past at_max seconds from the arrival time. In one instance we noticed timeouts on servers of 1076 seconds, and associated timeouts on clients of over 1300 seconds.
Similarly, on the client side, when an early reply is received we currently extend the deadline thusly:
req->rq_deadline = cfs_time_current_sec() + req->rq_timeout + ptlrpc_at_get_net_latency(req);
There are two things I think are potentially wrong with this. The first is that we again are offsetting the current time rather than, say, rq_sent or rq_arrival, which would make more sense to me. Secondly, and more importantly, the sum of req->rq_timeout and net latency can exceed at_max seconds. If my reading of the code is correct, both rq_timeout and net latency are bounded by at_max, so in worst case the sum could be 2*at_max.
I plan to push a patch to fix the server side issue.
- newdl = cfs_time_current_sec() + at_get(&svcpt->scp_at_estimate); + newdl = req->rq_arrival_time.tv_sec + at_get(&svcpt->scp_at_estimate);
One concern I have with this change is how it might affect things during recovery. I'm not as familiar with the recovery mode, so there may be some assumptions in there that this patch violates. Hopefully Maloo will let me know if this is the case
I have similar patch for the client side
- req->rq_deadline = cfs_time_current_sec() + req->rq_timeout + - ptlrpc_at_get_net_latency(req); + req->rq_deadline = req->rq_sent + min_t(int, at_max, req->rq_timeout + + ptlrpc_at_get_net_latency(req));
Same sort of concern w.r.t recovery mode. For now, I'm just planning on pushing the server side change. Thanks in advance for any feedback.