Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-17435

improved reliability in the face of intermittent network errors

    XMLWordPrintable

Details

    • Improvement
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.14.0, Lustre 2.16.0
    • None
    • 3
    • 9223372036854775807

    Description

      In cases of unstable client network interfaces, it is useful to improve the reliability of the overall filesystem by tracking the history of RPC timeouts and resends to each peer NID to determine how long a node should wait before resending (or not resending) an RPC to that node, and when to give up completely even when the peer is partially responsive.

      On the server side, if there are repeated RPC timeouts to a client that succeed with a resend (e.g. blocking AST) we might consider to reduce the RPC timeout duration to the client and send more often, and eventually evict the client if it is repeatedly unresponsive to lock callbacks (while other clients are not unresponsive during the same time period) even if the client eventually replies. While this would be "unfair" to that client, it would put the burden of bad behavior on that client instead of other well-behaved clients also accessing the filesystem. That makes it more obvious that there is a problem with a specific node, instead of hard-to-debug timeout issue distributed across all nodes in the cluster.

      We might also consider implementing a "deny list" to block specific client NIDs from connecting to the filesystem. This could be used by the peer history mechanism to semi-permanently (at least until reboot) block client NIDs from reconnecting to the filesystem after eviction, so that they are not flapping their Lustre mountpoint, but are "hard down". This might be implemented as part of LU-17217 "Allow server to control client connections".

      On the client side, this "deny" should show some kind of clear error "refused connection" message, similar to the case when a very old client is connecting to a newer server.

      Attachments

        Issue Links

          Activity

            People

              wc-triage WC Triage
              adilger Andreas Dilger
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated: