Details
-
Improvement
-
Resolution: Unresolved
-
Major
-
None
-
Lustre 2.14.0, Lustre 2.16.0, Lustre 2.17.0
-
None
-
3
-
9223372036854775807
Description
In the old days of Catamount on ASCI Red with liblustre running in the Catamount OS that did not have any CPU interrupts. That meant any server-to-client requests (such as DLM lock cancellations) must be handled asynchronously on the client when the application yielded the processor to filesystem administrative tasks.
In that environment, the server would immediately assume that a DLM lock was cancelled as soon as the AST was sent on a lock with LDLM_FL_CANCEL_ON_BLOCK set on a lock, rather than waiting for the client to reply to the AST and actually cancel the lock. This avoided potentially significant delays for servers granting locks.
In large clusters, there are some locks that are invariably highly contended (e.g. ROOT/, /home/ or /project directories, either because many clients are holding a read lock and some client wants to modify the directory, or because of conflicting workloads (e.g. "ls -l" or "rm" in a directory (tree) that is actively in use by other clients. If any client holding a contended lock has a problem, for example LU-17453/LU-17476, then other nodes accessing that lock may block for tens or hundreds of seconds until it is cancelled or the client is evicted.
It would be useful if LDLM_FL_CANCEL_ON_BLOCK was used for such highly-contended resources when requested with LCK_PR mode, so that the server can send asynchronous ASTs to all clients and then cancel the DLM locks rapidly and perform the required operation without getting blocked by unresponsive clients. Any responsive client will receive the AST and not even need to send the cancel RPC, while unresponsive clients are already unlikely to know or care whether the server sent the AST, so they will have an inconsistent local state until they again contact the server (as they already do today).
This could potentially also be tied into "ls" (readdir()) being able to run with "LDLM_FL_CANCEL_ON_BLOCK" locks, or no DLM locks at all on the directory or inodes. Per comments in LU-3308, POSIX does not require readdir() to be fully cache coherent even among processes on the same node, only that the readdir cache is reset with rewinddir() and close().