Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-14519

Cascading evictions due to lock timeouts when some of the servers are unhealthy

    XMLWordPrintable

Details

    • New Feature
    • Resolution: Unresolved
    • Critical
    • None
    • None
    • None
    • 9223372036854775807

    Description

      This is to capture my discussion with vitaly_fertman and perhaps let other interested parties to chime in.

      We have a bit of a problem that some operations require multiple locks held from multiple targets (e.g. writes under O_APPEND opened files, truncate, striped dir operations, ...) when one or more of the targets goes down. Partially acquired locks are then subject to conflicts, but could not be released by the clients that hold them due to being stuck trying to acquire the other locks or trying to flush some state to other servers and the like.

      Now while it's possible to proclaim "let's get rid of such dependencies", it seems to be mostly a pipe dream, esp. considering potential second and third order inter-dependencies (see LU-13131 as an example).

      Some might advocate that "let's just increase ldlm lock cancel timeout universally" and while that would help some, it would also not help entirely in situations when servers take extremely long time to come back up or if there are genuine client side problems with deadlocks or other logic flaws.

      Here's my rough idea of how this could be solved that most likely needs additional fleshing out:

      disable evicting clients for lock cancel timeouts as long as there are "unhealthy" targets in the system (either down or in recovery).
      MGS would be tasked with determining this state and distributing it to servers.
      There are multiple approaches to how it could get this information, from mundane HA monitoring already universally in place to detect dead servers and trigger failover (but not telling other Lustre parts about it) to advanced "health network" implementations of the future.
      Additionally it's possible to implement more rapid ping evictor mechanism between MGS and servers if desired.
      An extra change would be needed to notify MGS when servers voluntarily go down (esp. if for failover).

      Notably this is only half the problem. The other half is if we allow locks to be indefinitely stuck while servers are in failover - that would quickly lead to all available server threads stuck waiting for such locks.

      To address this second I propose we'll need to mark resources that encountered such "stuck" locks in a special way. any attempts at getting server side locks in there would return with some error similar to EAGAIN that would then be always leading to abort of current operation processing and result in returning the EAGAIN or similar state backto the client to be retried at some later time.

      It is ok to allow normal (non intent) client lock requests on such resources since this does not plug a server thread and instead returns ungranted lock to the client without plugging a server thread (granted there re timeout implications here too so it might also be possible to universally turn away any lock requests to such resources)

      And lastly there's a potential AT consideration where we might want to exclude elevated processing time for requests while the cluster is unhealthy though considering AT self corrects withing 600 seconds, it might not be ass important.

      Attachments

        Issue Links

          Activity

            People

              wc-triage WC Triage
              green Oleg Drokin
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated: