Details
-
Bug
-
Resolution: Fixed
-
Minor
-
Lustre 2.15.0
-
3
-
9223372036854775807
Description
When a server generates a blocking AST, it needs to send a request to the client holding the lock, so that the lock can be handed over to another client. If the client holding the lock fails to respond to that request, then it gets evicted. When Kerberos or SSK is enabled, all client-to-server communications need to be authenticated, as well as server-to-client communications, such as this blocking AST. With Kerberos or SSK, client nodes are responsible for establishing a valid GSS context, that will be used to communicate with servers. Client nodes manage their GSS contexts, and servers associate with each client context what is called a reverse context. The expiration time of the client context is set to a value close to, but a little bit smaller than the Kerberos ticket expiration time or the SSK ctx expiration time. The expiration time of the reverse context is set to the same value as the Kerberos ticket expiration time or the SSK ctx expiration time, which means longer than its client counterpart. It is up to the client to refresh its GSS context in a timely fashion, and doing so updates the reverse context on server side. Under normal operations, every time the client needs to communicate with a server, it examines the validity of its corresponding GSS context, and if it has expired, proceed to renewal before reaching out to the server as it would normally do. This is how the server hears about context renewal, and updates its reverse context. In the absence of IOs, the obd ping requests give the opportunity to maintain contexts up-to-date.
The consequence of the reverse context design is that servers are not able to refresh their reverse contexts by themselves, it has to come from a client context refresh. So it is crucial that clients do refresh their GSS context in time. If they fail to do so, we can end up in the following situation: when a server needs to send a blocking AST, if its reverse context has expired, then it simply cannot communicate with the client that holds the lock, and because this request cannot be sent by the server it declares that the blocking AST failed, and decides to evict the client.
As explained, the validity duration difference between client-side context and server-side reverse context usually gives a safety net, ie a time window that would normally give plenty of time to proceed to context refresh. This time difference is 10 seconds, but under some circumstances this could turn out to be too small.