Details
-
Improvement
-
Resolution: Unresolved
-
Major
-
None
-
None
-
None
-
9223372036854775807
Description
The adaptive timeout code currently works on a granularity of full seconds, and ignores timeouts of "0". This means the MDS adaptive timeout code doesn't really adjust the timeouts there.
This means, for example, the bl_ast timeout stays at the default value of 100 seconds * 1.5 (ldlm_bl_timeout), so, 150 seconds.
This is a very long time to wait, and the AT code is supposed to shorten this.
There are two obvious approaches here.
- Stop ignoring "0" values in the adaptive timeout code, and set a default non-zero at_min (setting it to 1 second should mean no behavioral change, as that's the current minimum real value). This solution should be simple and shouldn't affect existing installs too much. (configuring at_min is pretty common anyway)
- Update the adaptive timeout code to use more precise time intervals than 1 second.
I'm inclined to #1. But in real configs, at_min is generally recommended to be something like 40 seconds. So perhaps we should default to that instead.
Note specifically in the ldlm_bl_timeout we use the max() of this and ldlm_enqueue_min (default is OBD_TIMEOUT_DEFAULT, 100 seconds), so we'll only get down to that value there.
A few open questions here.