Details
-
Improvement
-
Resolution: Unresolved
-
Minor
-
None
-
None
-
None
-
3
-
9223372036854775807
Description
Right now, it's possible for the server to initiate a connection to client if it has lost the connection and server has to notify the client, for example in case of a DLM lock callback (AST). This can cause issues in cloud environments because the server and client may not belong to the same virtual network.
The current solution is to configure the firewall so that it allows connection to client. @Andreas has shared me this PR: https://review.whamcloud.com/41021 ("LU-14224 misc: add firewalld service configuration") that makes things a little bit easier.
It would be helpful if we can avoid this totally. That being said, clients should maintain such connections if they have resource caching locally. If server detects that a connection is lost, it should wait for the client to connect back. LNET should notify server if connection is restored in order to reduce latency.
As long as the server does not immediately evict a client that has lost its connection, the server will try to resend the DLM lock callback (AST) after a timeout. However, this can take tens or potentially hundreds of seconds (depending on the configured/adaptive timeout), during which time the server may be blocking access to the filesystem for other clients trying to access that lock.
In addition to LU-17493 which is trying to reduce the impact of clients with flakey networks by allowing the server to unilaterally cancel DLM locks, a mechanism for LNet to notify/wake the Lustre service thread when the connection is restored by the client would allow the immediate resending of the lock callback to the client instead of waiting for the timeout to expire.