Details
-
Improvement
-
Resolution: Unresolved
-
Major
-
None
-
None
-
9223372036854775807
Description
The adaptive timeout code currently works on a granularity of full seconds, and ignores timeouts of "0". This means the MDS adaptive timeout code doesn't really adjust the timeouts there.
This means, for example, the bl_ast timeout stays at the default value of 100 seconds * 1.5 (ldlm_bl_timeout), so, 150 seconds.
This is a very long time to wait, and the AT code is supposed to shorten this.
There are two obvious approaches here.
- Stop ignoring "0" values in the adaptive timeout code, and set a default non-zero at_min (setting it to 1 second should mean no behavioral change, as that's the current minimum real value). This solution should be simple and shouldn't affect existing installs too much. (configuring at_min is pretty common anyway)
- Update the adaptive timeout code to use more precise time intervals than 1 second.
I'm inclined to #1. But in real configs, at_min is generally recommended to be something like 40 seconds. So perhaps we should default to that instead.
Note specifically in the ldlm_bl_timeout we use the max() of this and ldlm_enqueue_min (default is OBD_TIMEOUT_DEFAULT, 100 seconds), so we'll only get down to that value there.
A few open questions here.
One proposal that might help here (and in other places) is for the servers to persistently track the maximum number of connected clients, so that the MDS/OSS knows after a restart how many clients might connect and can set at_min to an appropriate value right from the start.