[LU-14519] Cascading evictions due to lock timeouts when some of the servers are unhealthy Created: 13/Mar/21 Updated: 25/Apr/23 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | New Feature | Priority: | Critical |
| Reporter: | Oleg Drokin | Assignee: | WC Triage |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
This is to capture my discussion with vitaly_fertman and perhaps let other interested parties to chime in. We have a bit of a problem that some operations require multiple locks held from multiple targets (e.g. writes under O_APPEND opened files, truncate, striped dir operations, ...) when one or more of the targets goes down. Partially acquired locks are then subject to conflicts, but could not be released by the clients that hold them due to being stuck trying to acquire the other locks or trying to flush some state to other servers and the like. Now while it's possible to proclaim "let's get rid of such dependencies", it seems to be mostly a pipe dream, esp. considering potential second and third order inter-dependencies (see Some might advocate that "let's just increase ldlm lock cancel timeout universally" and while that would help some, it would also not help entirely in situations when servers take extremely long time to come back up or if there are genuine client side problems with deadlocks or other logic flaws. Here's my rough idea of how this could be solved that most likely needs additional fleshing out: disable evicting clients for lock cancel timeouts as long as there are "unhealthy" targets in the system (either down or in recovery). Notably this is only half the problem. The other half is if we allow locks to be indefinitely stuck while servers are in failover - that would quickly lead to all available server threads stuck waiting for such locks. To address this second I propose we'll need to mark resources that encountered such "stuck" locks in a special way. any attempts at getting server side locks in there would return with some error similar to EAGAIN that would then be always leading to abort of current operation processing and result in returning the EAGAIN or similar state backto the client to be retried at some later time. It is ok to allow normal (non intent) client lock requests on such resources since this does not plug a server thread and instead returns ungranted lock to the client without plugging a server thread (granted there re timeout implications here too so it might also be possible to universally turn away any lock requests to such resources) And lastly there's a potential AT consideration where we might want to exclude elevated processing time for requests while the cluster is unhealthy though considering AT self corrects withing 600 seconds, it might not be ass important. |
| Comments |
| Comment by Andreas Dilger [ 13/Mar/21 ] |
|
I think there are a few options here:
Depending on an external framework for correctness is not great, IMHO. |
| Comment by Oleg Drokin [ 13/Mar/21 ] |
|
We actually discussed ERESTARTSYS and it seems to be very nontrivial. Depending on external frameworks is not great, but we already do it so why not get some extra direct benefits and make it provide direct Lustre feedback too? Think of evict by nid that was used to preemptively evicting dead clients for example, the idea is similar and everything works even in absence of this 3rd party input. Like I said we could have internal mechanism of more frequent pings (for ping evictor to work faster for mgs) + proactive notifications by targets when they go down. |
| Comment by Vitaly Fertman [ 15/Mar/21 ] |
|
the problem concerns not only lock enqueue but an operation under the taken locks. what means we cannot simply return ERESTATRTSYS and re-try as we need to remember the used xid's and re-use them, otherwise who knows if the operation has been already applied to the server the the retry will be just a new one. the approach Oleg suggests looks problematic because it requires some time to detect a node failure, report it to MGS, report it server wide. the AT which may be even gathered somewhere is unknown to the current node which is waiting for a lock cancel, thus it is 3 obd_timeout (ok, considering just 1 delivery failure at a time - 1 obd_timeout) - by this time the client will be already evicted. having this in mind, I also suggested Oleg what Andreas described above - the problem client has the AT of communication with all the servers are can report to the involved OSTs about an appeared failure in the system. at the same time, looking for a more simple solution, my original question was - why large enough lock callback timeout is so bad? how a small timeout may be useful? e.g.: if so, probably the only question is what if a large timeout will be not large enough to avoid evictions during failover? a simple answer could be - a tunable parameter for a dedicated system to survive just 1 server failover? or the hard failover limit ? |
| Comment by Vitaly Fertman [ 15/Jul/21 ] |
|
another problem here happens on a healthy cluster, having 256 stripes it is enough to have 1-2 resends for enqueues while trying to get all these locks ordered to get lock cb timeout on the first lock. |
| Comment by Vitaly Fertman [ 03/Aug/21 ] |
|
after giving it more thoughts I have a couple more ideas: first of all, once more why lock callback timeout is needed: == idea #1. == idea #2. Client could control how connect is going, and in case of connect timeouts/errors, it does not repeat endlessly but have a global timeout for establishing a connection or connection counter. once expired:
it let to avoid evictions at all in case of unhealthy target, == idea #3. == idea #4 this way we can avoid evictions. == idea #5. instead of introducing a cross osc lock chain, we may tell ldlm at the beginning of IO (cl_io_lock) - do prolong for this IO locks until said opposite. and cancel this instruction at the end (cl_io_unlock). the implementation looks more simple. a possible downside is an error in the code (2), eviction will occur on LCT hard limit in this case, as prolong will keep going. as this is an error case it might be considered as OK. the potential problem Oleg pointed to here is inability to get logs for this whole very long period of time. thoughts ? |