Details
-
New Feature
-
Resolution: Unresolved
-
Critical
-
None
-
None
-
None
-
9223372036854775807
Description
This is to capture my discussion with vitaly_fertman and perhaps let other interested parties to chime in.
We have a bit of a problem that some operations require multiple locks held from multiple targets (e.g. writes under O_APPEND opened files, truncate, striped dir operations, ...) when one or more of the targets goes down. Partially acquired locks are then subject to conflicts, but could not be released by the clients that hold them due to being stuck trying to acquire the other locks or trying to flush some state to other servers and the like.
Now while it's possible to proclaim "let's get rid of such dependencies", it seems to be mostly a pipe dream, esp. considering potential second and third order inter-dependencies (see LU-13131 as an example).
Some might advocate that "let's just increase ldlm lock cancel timeout universally" and while that would help some, it would also not help entirely in situations when servers take extremely long time to come back up or if there are genuine client side problems with deadlocks or other logic flaws.
Here's my rough idea of how this could be solved that most likely needs additional fleshing out:
disable evicting clients for lock cancel timeouts as long as there are "unhealthy" targets in the system (either down or in recovery).
MGS would be tasked with determining this state and distributing it to servers.
There are multiple approaches to how it could get this information, from mundane HA monitoring already universally in place to detect dead servers and trigger failover (but not telling other Lustre parts about it) to advanced "health network" implementations of the future.
Additionally it's possible to implement more rapid ping evictor mechanism between MGS and servers if desired.
An extra change would be needed to notify MGS when servers voluntarily go down (esp. if for failover).
Notably this is only half the problem. The other half is if we allow locks to be indefinitely stuck while servers are in failover - that would quickly lead to all available server threads stuck waiting for such locks.
To address this second I propose we'll need to mark resources that encountered such "stuck" locks in a special way. any attempts at getting server side locks in there would return with some error similar to EAGAIN that would then be always leading to abort of current operation processing and result in returning the EAGAIN or similar state backto the client to be retried at some later time.
It is ok to allow normal (non intent) client lock requests on such resources since this does not plug a server thread and instead returns ungranted lock to the client without plugging a server thread (granted there re timeout implications here too so it might also be possible to universally turn away any lock requests to such resources)
And lastly there's a potential AT consideration where we might want to exclude elevated processing time for requests while the cluster is unhealthy though considering AT self corrects withing 600 seconds, it might not be ass important.
Attachments
Issue Links
- is related to
-
LU-16770 Client evictions with overstriped files
-
- Resolved
-
after giving it more thoughts I have a couple more ideas:
first of all, once more why lock callback timeout is needed:
1. deadlock on ldlm locks
2. client error - client side deadlock or not going to send proper RPCs to servers
3. unhealthy target
why already not needed:
4. dead client - leaving it for pinger to resolve
and in contrast:
5. everything is very slow (resends) or a lot of work (wide striping)
== idea #1.
this is what was discussed above. leaving lock callback timeout (LCT) relatively small and introduce prolong RPCs to extend it while a slow IO (5) is in progress. such prolong is sent only if a new step is completed by an IO job. e.g. if a client gets stuck on a mutex - no progress, no prolong.
== idea #2.
It is pity that in a case of unhealthy targets (3) the client keep trying to connect and gets evicted finally (no new steps completed, no progress, no prolong) - it was not this client fault. therefore, prolong RPC only is not enough. Oleg already suggested to inform servers about an unhealthy one through MGS and disable LCT, but it leads to locks stuck in cluster. I have another idea:
Client could control how connect is going, and in case of connect timeouts/errors, it does not repeat endlessly but have a global timeout for establishing a connection or connection counter. once expired:
it let to avoid evictions at all in case of unhealthy target,
also no even errors on enqueue phase (which takes most of the time)
== idea #3.
While waiting for an RPC (enqueue, trunc, connect) timeout, despite the fact there is no progress, the prolong mechanism must guarantee we have it prolonged until RPC completion or timeout, and then a decision if the prolongation is to be disabled is made.
== idea #4
An interesting case here is a lock completion. in contrast to RPC reply, it may take much longer time to come, especially with this prolong mechanism. If not to prolong, the client will be evicted while the other conflicting client is slowly progressing. if to prolong, deadlock until LCT hard limit expires. An idea here is to introduce a client side timeout similar to the idea #2, which would be of the same order of magnitude as the current server-side LCT. And once expired, interrupted and restarted as above. while not expired, keep prolong going.
this way we can avoid evictions.
and ldlm deadlock will be resolved too.
== idea #5.
Due to #3 and #4, another idea appears.
instead of introducing a cross osc lock chain, we may tell ldlm at the beginning of IO (cl_io_lock) - do prolong for this IO locks until said opposite. and cancel this instruction at the end (cl_io_unlock). the implementation looks more simple.
a possible downside is an error in the code (2), eviction will occur on LCT hard limit in this case, as prolong will keep going. as this is an error case it might be considered as OK. the potential problem Oleg pointed to here is inability to get logs for this whole very long period of time.
thoughts ?