Motivate: DLM lock overhead and scalability problem
In LU-16365, we discussed the problem that there is quite a bit of overhead in the LDLM hash code. When perform "ls -l" to list the files within a large directory, the time remains constant between calls (cached/uncached case). The cached case (2nd or 3rd "ls -l") were even slower than non-cached case (1st "ls -l"). After analyzed the traces, found that when manage more than 100K locks on a node, the looking up the lock handle and resource by hash takes lots of time. Managing and searching a LDLM lock existed scalability problem in Lustre. As the count of managed DLM locks increases on a node, the scalability issue becomes more severe.
Design and implementation
In this section, we will propose Lustre with multiple consistency levels (MCL) that expose the consistency/performance trace-off to the programmer or applications. We use timeout-based consistency to achieved the relaxed semantic caches for Lustre.
Attribute and dentry caching
Our timeout-based consistency is similar to the implementation in NFS. Attributes and directory entry are cached for a duration determined by the client. At the next use after the end of the predefined timeout, the client will query the server to see if the file system object has changed. If the server reported that the file object has been deleted or its permission is changed, the client will invalidate the attribute cache or remove dentry from dcache.
In timeout-based consistency, data in the cache is expired after a specified timeout period, regardless of whether it has been updated at the back-end or not.
Timeout-based consistency suffers from the disadvantage of increased latency and network message overheads as the cache needs to validate the data with the back-end or fetch any modified data.
A timeout based caching approach forces a client to release a potentially correct cache entry due to uncertainty of its validity.
A timeout also results in a client maintain (and return) an incorrect cache entry for a period of time, resulting in application or user confusion.
The most advantage of timeout-based consistency is simple. In this protocol, clients poll the server to find out when the file or directory was last modified, and determine whether the cached version is valid. This scheme cannot keep caches coherent. However, it is simple in that servers keep no lock state and do nothing when a failure occurs.
Concurrent control for data I/O
NFS uses the technique of close-to-open (CTO) consistency for data caching and concurrent control on the client. It has provided sufficient consistency fro most applications and users.
The initial main aim of the relaxed consistency is to optimize the metadata performance for Lustre. Timeout-based consistency can be mainly used for metadata caching. For the data I/O, it can still use the original locking protocol for I/O concurrent control with strong consistency: extent DLM locking for data on OSTs and DoM ibits locking for DoM files.
We can also implement a locking mechanism similar to NFS delegation feature. In NFS, by granting a file delegation, the server voluntarily cedes control of operations on the file to a client for the duration of the client lease or until the delegation is recalled. When a file is delegated, all file access and modification requests can be handled locally by the client without sending any network requests to the server [5]. When a file is being referenced by a single client, responsibility for handling all of the OPEN and CLOSE and READ|WRITE locking operations may be delegated to the client by the server. Since the server on granting a delegation guarantees the client that there can be no conflicting operations, the cached data is assumed valid. This can borrow the exist DLM ibits locking mechanism for DoM. For read, the server grants <PR, OPEN|DATA> to the client; For write, the server grants <PW, OPEN|DATA> to the client. Here DATA ibit lock is similar to DOM ibit lock. The main difference is that DOM lock can be only used for DOM file; while DATA ibit lock can be used for all data layouts (DoM and data on OSTs).
This DATA ibit lock can be piggyback to the client with the open request, protecting the subsequent whole data access, thus it can eliminate the lock traffic. It allows common patterns of limited sharing and read-only sharing to be dealt with efficiently, avoiding extra latency associated with frequent communication with the server. When these accesses patterns were broken or no longer obtain, i.e. the file is shared conflicting access by multiple clients, the DATA ibit lock can be revoked and normal client-side caching logic is used.
A read delegation (DATA ibit lock) is awarded by the server to a client on a file OPENed for reading (that does not deny read access to others). The decision to award a delegation can be made by the server based on a set of conditions that take into account the recent history of the file or the client requests it explicitly. For example, the read delegation is awarded on the second OPEN by the same client.
Similar to read delegations, write delegations are awarded by the server when a client opens a file for write (or read/write) access. While the delegation is granted, all OPEN, READ, WRITE, CLOSE, LOCK, GETATTR, SETATTR requests for the file can be handled locally by the client.
Recovery when switching between DATA ibit locking and extent DLM locking for data I/O?
Client-side metadata writeback with relaxed consistency
To simlify the implementation, all directories are created on MDT synchronously via reint operations. For regular files under a directory, it uses client-side write-back caching of metadata to deliver ultra high throughput. Like MetaWBC, the regular files under a directory are first created on client-side embeded memory file system MemFS. After creating more than a certain number of files (i.e. 1024) in MemFS, the client can flush dirty metadata to the server asynchronously in a batch manner. The dirty inodes for regular files can also be checked and flushed to MDT periodically via kernel writeback mechanism.
This metadata writeback strategy can support efficient batch creations. This can benefit the IO500 mdtest-easy and mdtest-hard-write|read. For mdtest-hard-write, each sub request in the batching creation RPC can return <PW, OPEN|DATA> ibits lock to the client, and then the client can batch small writes from multiple files to send to OST (if data is on OST) or MDT (if data is on MDT). Or we can batch the creation and data of 3091 bytes to the server for multiple DoM-only files in a hight efficient way. However, global file system semantics may no longer be guaranteed and it relies the applications themselves to solve the access conflicts and cache consistency.
Capabilities and flag for relaxed consistency
Like CephFS, we can also define various capabilities for directories and files in Lustre.
A directory only marked with LUSTRE_RELAXED_FL will be created and accessed with relaxed consistency. This flag is stored on LMA xattr on MDT. And all sub files under this directory inherits LUSTRE_RELAXED_FL flag and access with relaxed consistency.
We can convert a directory with strong consistency into relaxed consistency level by level. It just needs to take and full EX lock on the directory to clear all DLM locks on the directory and set the directory with LUSTRE_RELAXED_FL, and then release the full EX lock. After that all I/Os under this directory can be performed in relaxed consistency mode.
To convert a directory with relaxed consistency into the old strong consistency, it needs to first clear LUSTRE_RELAXED_FL flag (level by level) and then wait for the maximal timeout period (lease) to invalidate all timeout-based caching on the clients. After that, all data and metadata I/O under the directory will operate with the old strong consistency.
References
[1] Vilayannur M, Nath P, Sivasubramaniam A. Providing Tunable Consistency for a Parallel File Store[C]//FAST. 2005, 5: 2-2.
[2] NFS. https://linux.die.net/man/5/nfs
[3] Samba Oplock. [Oplocks - Windows drivers | Microsoft Learn](https://learn.microsoft.com/en-us/windows-hardware/drivers/ifs/oplock-overview)
[4] [CephFS Distributed Metadata Cache](https://link.zhihu.com/?target=https%3A//docs.ceph.com/en/quincy/cephfs/mdcache/)
[5] Gulati, A., Naik, M., & Tewari, R. (2007, February). Nache: Design and Implementation of a Caching Proxy for NFSv4. In FAST (Vol. 7, pp. 27-27).
Very interesting discussion.
On the usability stand point, I think that doing this per-application is not really feasible. Lustre specific changes to application are hard to be added to real life apps. Striping is usually the best we can do, and even that is not that simple. It is easier to sell a POSIX API or new Linux specific syscall, rather than a Lustre specific call. On the contrary, this is way easier for admins to change Lustre behavior per mount point, as admins will more or less know what kind of apps will be running there and if that behavior change is acceptable. I've seen more and more people using Lustre as home directory, where NFS is typically use. So, changing a Lustre client behavior, telling it to behave like NFS will be almost transparent to application. I'm not against having that being controlled per apps, but that should not be the only way, IMO. Per directory is acceptable.