Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-17329

Relaxed POSIX Consistency for Lustre

Details

    • New Feature
    • Resolution: Unresolved
    • Minor
    • None
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      If performance is a criterion, consistency requirements for applications might be best decided by applications or users themselves. Forcing an application that has little or no sharing to use a strong or strict consistency model may lead to unnecessarily reduced I/O performance. Traditional techniques to provide strong file system consistency guarantees for both meta-data and data use vairants of locking techniques. For example, Lustre and GPFS use DLM locking to implement POSIX with strong consistency. Rather than locking when enforcing serialization for read-write sharing or write-write shareing for the entire file system, we can use optimistic concurrency control mechanism with the presumption that these are rare events. Avoidance of distributed locking enhances the scalability and performance of the system.

      Since different applications can have different sharing behavior, designing for performance and consistency would force the design to cater to all their needs simultaneously. Parallel cluster file systems (such as Lustre and GPFS) enforce data consistency by using byte-range distributed locking to allow simultaneous file access from multiple clients to its shared disks. Such fine-grained file locking schemes allow multiple processes to simultaneously write to different regions of a shared file. However, they also restrict scalability because of the overhead associated with maintaining state of a large number of locks, eventually leading to performance degradation.

      In a POSIX-compliant distributed file system, the behavior of serving multiple processes on multiple client nodes should be the same as the behavior of a local file system. Lustre provides POSIX-compliant consistency. However, the POSIX consistency semantics could be carefully relaxed in some cases in order to better align with the needs of specific applications and to improve the system performance. A user can define not just standard consistency policies like POSIX, but also custom policies like session, lease and NFS, at a chosen granularity (sub tree, file). A client can be using several different consistency policies for different files or even changing the consistency policy for a given file at runtime, without having to restart the file system. Leaving the choice of the consistency policy and allowing the user to change it at runtime enable tuning performance at a very fine granularity.

      One approach for relaxing consistency is to decouple the namespace. i.e. a client can lock the subtree it wants exclusive access to in MetaWBC mode, then the file system can optimize performance via lockless I/O mode, merging updates. The file system could enter a mode for a such given subtree to perform operations locally and bulk merge their updates at completion. This delayed merge (i.e. a form of eventual consistency) and relaxed durability improves performance and scalability by avoiding the costs of remote procedure calls (RPCs), synchronization, false sharing, and serialization.

      We present an API and framework that allows administrators dynamically control the consistency guarantees for subtrees in the global file system namespace. Allowing different semantics co-exist in a global namespace scales further and performs better than systems that use one POSIX consistency mode.

      Our initial expected goal is to improve the IO500 performance by using the loose relaxed consistency model similar to NFS in Lustre.

      Attachments

        Issue Links

          Activity

            [LU-17329] Relaxed POSIX Consistency for Lustre

            Very interesting discussion.

            On the usability stand point, I think that doing this per-application is not really feasible. Lustre specific changes to application are hard to be added to real life apps. Striping is usually the best we can do, and even that is not that simple. It is easier to sell a POSIX API or new Linux specific syscall, rather than a Lustre specific call. On the contrary, this is way easier for admins to change Lustre behavior per mount point, as admins will more or less know what kind of apps will be running there and if that behavior change is acceptable. I've seen more and more people using Lustre as home directory, where NFS is typically use. So, changing a Lustre client behavior, telling it to behave like NFS will be almost transparent to application. I'm not against having that being controlled per apps, but that should not be the only way, IMO. Per directory is acceptable.

            adegremont_nvda Aurelien Degremont added a comment - Very interesting discussion. On the usability stand point, I think that doing this per-application is not really feasible. Lustre specific changes to application are hard to be added to real life apps. Striping is usually the best we can do, and even that is not that simple. It is easier to sell a POSIX API or new Linux specific syscall, rather than a Lustre specific call. On the contrary, this is way easier for admins to change Lustre behavior per mount point, as admins will more or less know what kind of apps will be running there and if that behavior change is acceptable. I've seen more and more people using Lustre as home directory, where NFS is typically use. So, changing a Lustre client behavior, telling it to behave like NFS will be almost transparent to application. I'm not against having that being controlled per apps, but that should not be the only way, IMO. Per directory is acceptable.

            Yingjin, thanks for writing this clear proposal document. I think there are a number of interesting areas here that could be explored, but I think it is important that all of these semantic changes need to be isolated to applications that opt-in to the relaxed semantics, and do not become global defaults that affect other applications. While relaxed POSIX can be very beneficial for applications that understand what behavior they are asking for, many applications also depend on strict POSIX semantics, and we can't realistically set these on a full mountpoint basis because it is uncommon that there is only one application running per mountpoint (unless this is done inside a single-application container mountpoint).

            Our timeout-based consistency is similar to the implementation in NFS. Attributes and directory entry are cached for a duration determined by the client.

            We already have "ldlm..namespaces..lru_max_age" that can be tuned on a per-mountpoint basis. Allowing applications to set a different LRU timeout for DLM locks on specific files or directory trees they are accessing (maybe via llapi_ladvise()) would be relatively straight forward to implement. This would not fundamentally change the consistency semantics of the lock handling, just expire unused locks more quickly. See also below about replacing the simple LRU with some other cache management so that frequently accessed locks (e.g. parent directory locks) are kept on the client longer even with a small lru_max_age.

            A timeout also results in a client maintain (and return) an incorrect cache entry for a period of time, resulting in application or user confusion.

            The most advantage of timeout-based consistency is simple. In this protocol, clients poll the server to find out when the file or directory was last modified, and determine whether the cached version is valid. This scheme cannot keep caches coherent. However, it is simple in that servers keep no lock state and do nothing when a failure occurs.

            I definitely do not think it is good for clients to assume a DLM lock is valid just because it hasn't timed out yet. It would mean that the DLM LRU needs a more sophisticated data structure than just a linked list where we only check the age on the oldest lock, so that locks with a short max_age are put into a different list/tree than locks with the default max_age. At large scale, it is not practical to have clients polling servers every few seconds to refresh dozens or hundreds of DLM locks that they want to keep alive. Conversely, explicitly cancelling locks with a short lifetime is relatively inexpensive since multiple lock cancels can be piggy-backed on other RPCs.

            The cached case (2nd or 3rd "ls -l") were even slower than non-cached case (1st "ls -l"). After analyzed the traces, found that when manage more than 100K locks on a node, the looking up the lock handle and resource by hash takes lots of time. Managing and searching a LDLM lock existed scalability problem in Lustre. As the count of managed DLM locks increases on a node, the scalability issue becomes more severe.

            This sounds like a good reason to review and test patch https://review.whamcloud.com/45882 "LU-8130 ldlm: convert ldlm_resource hash to rhashtable" to see if it fixes the LDLM cache performance? Also, we currently use a simple LRU for the DLM cache, it would be useful to improve the LDLM lock cache management on the client to use a more sophisticated algorithm (this is tracked under LU-11509).

            Close-to-open cache consistency

            Normally, file sharing is completely sequential: first client A opens a file, writes something to it, then closes it; then client B opens the same file, and reads the changes.

            When an application opens a file stored on an NFS server, the NFS client checks that it still exists on the server and is permitted to the opener by sending a GETATTR or ACCESS request. When the application closes the file, the NFS client writes back any pending changes to the file so that the next opener can view the changes. This also gives the NFS client an opportunity to report any server write errors to the application via the return code from close(2). The behavior of checking at open time and flushing at close time is referred to as close-to-open cache consistency.

            Having applications request flush-on-close (and possibly also dropping the DLM lock on close) is relatively straight forward, and does not impact data consistency. This is tracked under LU-16049 and is a feature that is suitable for being set on a per-file, per-directory basis, or could be set by an application on a per-file-descriptor basis.

            Attribute caching

            Use the *noac mount option to achieve attribute cache coherence among multiple clients. Almost every file system operation checks file attribute information. The client keeps this information cached for a period of time to reduce network and server load. When noac* is in effect, a client's file attribute cache is disabled, so each operation that needs to check a file's attributes is forced to go back to the server. This permits a client to see changes to a file very quickly, at the cost of many extra network operations.

            Note that the "performance gain" of "noac" is directly a result of the timeout-based cache consistency of NFS. If attributes are cached for some fixed time, then they may be wrong for some time, and they may be cancelled too early even when valid. Lustre DLM ensures always have the minimum time for the correct attributes to be retrieved, since they immediately know if the local attributes are valid, or if they need updated attributes from the server.

            Directory entry caching

            The Linux NFS client caches the result of all NFS LOOKUP requests. If the requested directory entry exists on the server, the result is referred to as a positive lookup result. If the requested directory entry does not exist on the server (that is, the server returned ENOENT), the result is referred to as negative lookup result.

            To detect when directory entries have been added or removed on the server, the Linux NFS client watches a directory's mtime. If the client detects a change in a directory's mtime, the client drops all cached LOOKUP results for that directory. Since the directory's mtime is a cached attribute, it may take some time before a client notices it has changed. See the descriptions of the acdirmin,acdirmax, and noac mount options for more information about how long a directory's mtime is cached.

            Caching directory entries improves the performance of applications that do not share files with applications on other clients. Using cached information about directories can interfere with applications that run concurrently on multiple clients and need to detect the creation or removal of files quickly, however. The lookupcache mount option allows some tuning of directory entry caching behavior.

            The Lustre client is currently even more strict than POSIX on caching directory contents, since it will revoke the entire directory entry contents if the directory is modified. According to POSIX, the client readdir() contents are valid until rewinddir() or close() is called. See discussion in LU-3308 and in particular this comment which means that the readdir() cache could be shifted from the inode to the open file descriptor if the DLM lock is revoked. That would allow "rm r" to avoid ping-pong on the readdir contents while it is deleting entries in the directory. It should also be possible to do name>FID lookups directly from the client readdir cache (LU-10999) if this was fed into a proper data structure instead of just a linear list of entries.

            A directory only marked with LUSTRE_RELAXED_FL will be created and accessed with relaxed consistency. This flag is stored on LMA xattr on MDT. And all sub files under this directory inherits LUSTRE_RELAXED_FL flag and access with relaxed consistency.

            We can convert a directory with strong consistency into relaxed consistency level by level. It just needs to take and full EX lock on the directory to clear all DLM locks on the directory and set the directory with LUSTRE_RELAXED_FL, and then release the full EX lock. After that all I/Os under this directory can be performed in relaxed consistency mode.

            What does "relaxed" actually mean? It is not good to overload a lot of different meanings into one flag, since that makes it impossible to request only "partly" relaxed semantics. It also is not possible (or at least not safe) to change the meaning of "relaxed" in the future without breaking some clients. Instead, there should be different flags with very specific meanings assigned to each flag, and if there is a desire for "relaxed" semantics by an application then it will request multiple different flags that it understands the meaning of. If there is some common combination of flags (e.g. "NFS like") then multiple separate bits could be grouped together into a single name for convenience.

            Separately, storing a "LUSTRE_RELAXED_FL" on a directory tree may be problematic if this includes semantic changes that the filesystem cannot enforce itself (e.g. "timeout based locking"), since there may be applications accessing this directory tree (e.g. backup tools, tar, shells, etc.) that do not understand the relaxed semantics. So any persistent settings on files/directories should be for semantics that the filesystem also understands.

            I definitely agree that there is a lot of room for improvements in this area. I think each of these improvements should have its own LU ticket in Jira which describes how the change will be used and how it affects the behavior, so that it can be reviewed and prioritized separately, instead of being aggregated into a single huge ticket/patch.

            adilger Andreas Dilger added a comment - Yingjin, thanks for writing this clear proposal document. I think there are a number of interesting areas here that could be explored, but I think it is important that all of these semantic changes need to be isolated to applications that opt-in to the relaxed semantics, and do not become global defaults that affect other applications. While relaxed POSIX can be very beneficial for applications that understand what behavior they are asking for, many applications also depend on strict POSIX semantics, and we can't realistically set these on a full mountpoint basis because it is uncommon that there is only one application running per mountpoint (unless this is done inside a single-application container mountpoint). Our timeout-based consistency is similar to the implementation in NFS. Attributes and directory entry are cached for a duration determined by the client. We already have " ldlm. .namespaces. .lru_max_age " that can be tuned on a per-mountpoint basis. Allowing applications to set a different LRU timeout for DLM locks on specific files or directory trees they are accessing (maybe via llapi_ladvise() ) would be relatively straight forward to implement. This would not fundamentally change the consistency semantics of the lock handling, just expire unused locks more quickly. See also below about replacing the simple LRU with some other cache management so that frequently accessed locks (e.g. parent directory locks) are kept on the client longer even with a small lru_max_age . A timeout also results in a client maintain (and return) an incorrect cache entry for a period of time, resulting in application or user confusion. The most advantage of timeout-based consistency is simple. In this protocol, clients poll the server to find out when the file or directory was last modified, and determine whether the cached version is valid. This scheme cannot keep caches coherent. However, it is simple in that servers keep no lock state and do nothing when a failure occurs. I definitely do not think it is good for clients to assume a DLM lock is valid just because it hasn't timed out yet. It would mean that the DLM LRU needs a more sophisticated data structure than just a linked list where we only check the age on the oldest lock, so that locks with a short max_age are put into a different list/tree than locks with the default max_age . At large scale, it is not practical to have clients polling servers every few seconds to refresh dozens or hundreds of DLM locks that they want to keep alive. Conversely, explicitly cancelling locks with a short lifetime is relatively inexpensive since multiple lock cancels can be piggy-backed on other RPCs. The cached case (2nd or 3rd "ls -l") were even slower than non-cached case (1st "ls -l"). After analyzed the traces, found that when manage more than 100K locks on a node, the looking up the lock handle and resource by hash takes lots of time. Managing and searching a LDLM lock existed scalability problem in Lustre. As the count of managed DLM locks increases on a node, the scalability issue becomes more severe. This sounds like a good reason to review and test patch https://review.whamcloud.com/45882 " LU-8130 ldlm: convert ldlm_resource hash to rhashtable " to see if it fixes the LDLM cache performance? Also, we currently use a simple LRU for the DLM cache, it would be useful to improve the LDLM lock cache management on the client to use a more sophisticated algorithm (this is tracked under LU-11509 ). Close-to-open cache consistency Normally, file sharing is completely sequential: first client A opens a file, writes something to it, then closes it; then client B opens the same file, and reads the changes. When an application opens a file stored on an NFS server, the NFS client checks that it still exists on the server and is permitted to the opener by sending a GETATTR or ACCESS request. When the application closes the file, the NFS client writes back any pending changes to the file so that the next opener can view the changes. This also gives the NFS client an opportunity to report any server write errors to the application via the return code from  close (2). The behavior of checking at open time and flushing at close time is referred to as close-to-open cache consistency. Having applications request flush-on-close (and possibly also dropping the DLM lock on close) is relatively straight forward, and does not impact data consistency. This is tracked under LU-16049 and is a feature that is suitable for being set on a per-file, per-directory basis, or could be set by an application on a per-file-descriptor basis. Attribute caching Use the * noac  mount option to achieve attribute cache coherence among multiple clients. Almost every file system operation checks file attribute information. The client keeps this information cached for a period of time to reduce network and server load. When  noac * is in effect, a client's file attribute cache is disabled, so each operation that needs to check a file's attributes is forced to go back to the server. This permits a client to see changes to a file very quickly, at the cost of many extra network operations. Note that the "performance gain" of " noac " is directly a result of the timeout-based cache consistency of NFS. If attributes are cached for some fixed time, then they may be wrong for some time, and they may be cancelled too early even when valid. Lustre DLM ensures always have the minimum time for the correct attributes to be retrieved, since they immediately know if the local attributes are valid, or if they need updated attributes from the server. Directory entry caching The Linux NFS client caches the result of all NFS LOOKUP requests. If the requested directory entry exists on the server, the result is referred to as a  positive  lookup result. If the requested directory entry does not exist on the server (that is, the server returned ENOENT), the result is referred to as  negative  lookup result. To detect when directory entries have been added or removed on the server, the Linux NFS client watches a directory's mtime. If the client detects a change in a directory's mtime, the client drops all cached LOOKUP results for that directory. Since the directory's mtime is a cached attribute, it may take some time before a client notices it has changed. See the descriptions of the  acdirmin , acdirmax , and noac  mount options for more information about how long a directory's mtime is cached. Caching directory entries improves the performance of applications that do not share files with applications on other clients. Using cached information about directories can interfere with applications that run concurrently on multiple clients and need to detect the creation or removal of files quickly, however. The  lookupcache  mount option allows some tuning of directory entry caching behavior. The Lustre client is currently even more strict than POSIX on caching directory contents, since it will revoke the entire directory entry contents if the directory is modified. According to POSIX, the client readdir() contents are valid until rewinddir() or close() is called. See discussion in LU-3308 and in particular this comment which means that the readdir() cache could be shifted from the inode to the open file descriptor if the DLM lock is revoked. That would allow " rm r " to avoid ping-pong on the readdir contents while it is deleting entries in the directory. It should also be possible to do name >FID lookups directly from the client readdir cache ( LU-10999 ) if this was fed into a proper data structure instead of just a linear list of entries. A directory only marked with LUSTRE_RELAXED_FL will be created and accessed with relaxed consistency. This flag is stored on LMA xattr on MDT. And all sub files under this directory inherits LUSTRE_RELAXED_FL flag and access with relaxed consistency. We can convert a directory with strong consistency into relaxed consistency level by level. It just needs to take and full EX lock on the directory to clear all DLM locks on the directory and set the directory with LUSTRE_RELAXED_FL , and then release the full EX lock. After that all I/Os under this directory can be performed in relaxed consistency mode. What does "relaxed" actually mean? It is not good to overload a lot of different meanings into one flag, since that makes it impossible to request only "partly" relaxed semantics. It also is not possible (or at least not safe) to change the meaning of "relaxed" in the future without breaking some clients. Instead, there should be different flags with very specific meanings assigned to each flag, and if there is a desire for "relaxed" semantics by an application then it will request multiple different flags that it understands the meaning of. If there is some common combination of flags (e.g. "NFS like") then multiple separate bits could be grouped together into a single name for convenience. Separately, storing a " LUSTRE_RELAXED_FL " on a directory tree may be problematic if this includes semantic changes that the filesystem cannot enforce itself (e.g. "timeout based locking"), since there may be applications accessing this directory tree (e.g. backup tools, tar, shells, etc.) that do not understand the relaxed semantics. So any persistent settings on files/directories should be for semantics that the filesystem also understands. I definitely agree that there is a lot of room for improvements in this area. I think each of these improvements should have its own LU ticket in Jira which describes how the change will be used and how it affects the behavior, so that it can be reviewed and prioritized separately, instead of being aggregated into a single huge ticket/patch.
            qian_wc Qian Yingjin added a comment -

            Motivate: DLM lock overhead and scalability problem

            In LU-16365, we discussed the problem that there is quite a bit of overhead in the LDLM hash code. When perform "ls -l" to list the files within a large directory, the time remains constant between calls (cached/uncached case). The cached case (2nd or 3rd "ls -l") were even slower than non-cached case (1st "ls -l"). After analyzed the traces, found that when manage more than 100K locks on a node, the looking up the lock handle and resource by hash takes lots of time. Managing and searching a LDLM lock existed scalability problem in Lustre. As the count of managed DLM locks increases on a node, the scalability issue becomes more severe.

            Design and implementation

            In this section, we will propose Lustre with multiple consistency levels (MCL) that expose the consistency/performance trace-off to the programmer or applications. We use timeout-based consistency to achieved the relaxed semantic caches for Lustre.

            Attribute and dentry caching

            Our timeout-based consistency is similar to the implementation in NFS. Attributes and directory entry are cached for a duration determined by the client. At the next use after the end of the predefined timeout, the client will query the server to see if the file system object has changed. If the server reported that the file object has been deleted or its permission is changed, the client will invalidate the attribute cache or remove dentry from dcache.

            In timeout-based consistency, data in the cache is expired after a specified timeout period, regardless of whether it has been updated at the back-end or not.

            Timeout-based consistency suffers from the disadvantage of increased latency and network message overheads as the cache needs to validate the data with the back-end or fetch any modified data.

            A timeout based caching approach forces a client to release a potentially correct cache entry due to uncertainty of its validity.

            A timeout also results in a client maintain (and return) an incorrect cache entry for a period of time, resulting in application or user confusion.

            The most advantage of timeout-based consistency is simple. In this protocol, clients poll the server to find out when the file or directory was last modified, and determine whether the cached version is valid. This scheme cannot keep caches coherent. However, it is simple in that servers keep no lock state and do nothing when a failure occurs.

            Concurrent control for data I/O

            NFS uses the technique of close-to-open (CTO) consistency for data caching and concurrent control on the client. It has provided sufficient consistency fro most applications and users.

            The initial main aim of the relaxed consistency is to optimize the metadata performance for Lustre. Timeout-based consistency can be mainly used for metadata caching. For the data I/O, it can still use the original locking protocol for I/O concurrent control with strong consistency: extent DLM locking for data on OSTs and DoM ibits locking for DoM files.

            We can also implement a locking mechanism similar to NFS delegation feature. In NFS, by granting a file delegation, the server voluntarily cedes control of operations on the file to a client for the duration of the client lease or until the delegation is recalled. When a file is delegated, all file access and modification requests can be handled locally by the client without sending any network requests to the server [5]. When a file is being referenced by a single client, responsibility for handling all of the OPEN and CLOSE and READ|WRITE locking operations may be delegated to the client by the server. Since the server on granting a delegation guarantees the client that there can be no conflicting operations, the cached data is assumed valid. This can borrow the exist DLM ibits locking mechanism for DoM. For read, the server grants <PR, OPEN|DATA> to the client; For write, the server grants <PW, OPEN|DATA> to the client. Here DATA ibit lock is similar to DOM ibit lock. The main difference is that DOM lock can be only used for DOM file; while DATA ibit lock can be used for all data layouts (DoM and data on OSTs).

            This DATA ibit lock can be piggyback to the client with the open request, protecting the subsequent whole data access, thus it can eliminate the lock traffic. It allows common patterns of limited sharing and read-only sharing to be dealt with efficiently, avoiding extra latency associated with frequent communication with the server. When these accesses patterns were broken or no longer obtain, i.e. the file is shared conflicting access by multiple clients, the DATA ibit lock can be revoked and normal client-side caching logic is used.

            A read delegation (DATA ibit lock) is awarded by the server to a client on a file OPENed for reading (that does not deny read access to others). The decision to award a delegation can be made by the server based on a set of conditions that take into account the recent history of the file or the client requests it explicitly. For example, the read delegation is awarded on the second OPEN by the same client.

            Similar to read delegations, write delegations are awarded by the server when a client opens a file for write (or read/write) access. While the delegation is granted, all OPEN, READ, WRITE, CLOSE, LOCK, GETATTR, SETATTR requests for the file can be handled locally by the client.

            Recovery when switching between DATA ibit locking and extent DLM locking for data I/O?

            Client-side metadata writeback with relaxed consistency

            To simlify the implementation, all directories are created on MDT synchronously via reint operations. For regular files under a directory, it uses client-side write-back caching of metadata to deliver ultra high throughput. Like MetaWBC, the regular files under a directory are first created on client-side embeded memory file system MemFS. After creating more than a certain number of files (i.e. 1024) in MemFS, the client can flush dirty metadata to the server asynchronously in a batch manner. The dirty inodes for regular files can also be checked and flushed to MDT periodically via kernel writeback mechanism.

            This metadata writeback strategy can support efficient batch creations. This can benefit the IO500 mdtest-easy and mdtest-hard-write|read. For mdtest-hard-write, each sub request in the batching creation RPC can return <PW, OPEN|DATA> ibits lock to the client, and then the client can batch small writes from multiple files to send to OST (if data is on OST) or MDT (if data is on MDT). Or we can batch the creation and data of 3091 bytes to the server for multiple DoM-only files in a hight efficient way. However, global file system semantics may no longer be guaranteed and it relies the applications themselves to solve the access conflicts and cache consistency.

            Capabilities and flag for relaxed consistency

            Like CephFS, we can also define various capabilities for directories and files in Lustre.

            A directory only marked with LUSTRE_RELAXED_FL will be created and accessed with relaxed consistency. This flag is stored on LMA xattr on MDT. And all sub files under this directory inherits LUSTRE_RELAXED_FL flag and access with relaxed consistency.

            We can convert a directory with strong consistency into relaxed consistency level by level. It just needs to take and full EX lock on the directory to clear all DLM locks on the directory and set the directory with LUSTRE_RELAXED_FL, and then release the full EX lock. After that all I/Os under this directory can be performed in relaxed consistency mode.

            To convert a directory with relaxed consistency into the old strong consistency, it needs to first clear LUSTRE_RELAXED_FL flag (level by level) and then wait for the maximal timeout period (lease) to invalidate all timeout-based caching on the clients. After that, all data and metadata I/O under the directory will operate with the old strong consistency.

            References

            [1] Vilayannur M, Nath P, Sivasubramaniam A. Providing Tunable Consistency for a Parallel File Store[C]//FAST. 2005, 5: 2-2.

            [2] NFS. https://linux.die.net/man/5/nfs

            [3] Samba Oplock. [Oplocks - Windows drivers | Microsoft Learn](https://learn.microsoft.com/en-us/windows-hardware/drivers/ifs/oplock-overview)

            [4] [CephFS Distributed Metadata Cache](https://link.zhihu.com/?target=https%3A//docs.ceph.com/en/quincy/cephfs/mdcache/)

            [5] Gulati, A., Naik, M., & Tewari, R. (2007, February). Nache: Design and Implementation of a Caching Proxy for NFSv4. In FAST (Vol. 7, pp. 27-27).

            qian_wc Qian Yingjin added a comment - Motivate: DLM lock overhead and scalability problem In LU-16365 , we discussed the problem that there is quite a bit of overhead in the LDLM hash code. When perform "ls -l" to list the files within a large directory, the time remains constant between calls (cached/uncached case). The cached case (2nd or 3rd "ls -l") were even slower than non-cached case (1st "ls -l"). After analyzed the traces, found that when manage more than 100K locks on a node, the looking up the lock handle and resource by hash takes lots of time. Managing and searching a LDLM lock existed scalability problem in Lustre. As the count of managed DLM locks increases on a node, the scalability issue becomes more severe. Design and implementation In this section, we will propose Lustre with multiple consistency levels (MCL) that expose the consistency/performance trace-off to the programmer or applications. We use timeout-based consistency to achieved the relaxed semantic caches for Lustre. Attribute and dentry caching Our timeout-based consistency is similar to the implementation in NFS. Attributes and directory entry are cached for a duration determined by the client. At the next use after the end of the predefined timeout, the client will query the server to see if the file system object has changed. If the server reported that the file object has been deleted or its permission is changed, the client will invalidate the attribute cache or remove dentry from dcache. In timeout-based consistency, data in the cache is expired after a specified timeout period, regardless of whether it has been updated at the back-end or not. Timeout-based consistency suffers from the disadvantage of increased latency and network message overheads as the cache needs to validate the data with the back-end or fetch any modified data. A timeout based caching approach forces a client to release a potentially correct cache entry due to uncertainty of its validity. A timeout also results in a client maintain (and return) an incorrect cache entry for a period of time, resulting in application or user confusion. The most advantage of timeout-based consistency is simple. In this protocol, clients poll the server to find out when the file or directory was last modified, and determine whether the cached version is valid. This scheme cannot keep caches coherent. However, it is simple in that servers keep no lock state and do nothing when a failure occurs. Concurrent control for data I/O NFS uses the technique of close-to-open (CTO) consistency for data caching and concurrent control on the client. It has provided sufficient consistency fro most applications and users. The initial main aim of the relaxed consistency is to optimize the metadata performance for Lustre. Timeout-based consistency can be mainly used for metadata caching. For the data I/O, it can still use the original locking protocol for I/O concurrent control with strong consistency: extent DLM locking for data on OSTs and DoM ibits locking for DoM files. We can also implement a locking mechanism similar to NFS delegation feature. In NFS, by granting a file delegation, the server voluntarily cedes control of operations on the file to a client for the duration of the client lease or until the delegation is recalled. When a file is delegated, all file access and modification requests can be handled locally by the client without sending any network requests to the server [5] . When a file is being referenced by a single client, responsibility for handling all of the OPEN and CLOSE and READ|WRITE locking operations may be delegated to the client by the server. Since the server on granting a delegation guarantees the client that there can be no conflicting operations, the cached data is assumed valid. This can borrow the exist DLM ibits locking mechanism for DoM. For read, the server grants <PR, OPEN|DATA> to the client; For write, the server grants <PW, OPEN|DATA> to the client. Here DATA ibit lock is similar to DOM ibit lock. The main difference is that DOM lock can be only used for DOM file; while DATA ibit lock can be used for all data layouts (DoM and data on OSTs). This DATA ibit lock can be piggyback to the client with the open request, protecting the subsequent whole data access, thus it can eliminate the lock traffic. It allows common patterns of limited sharing and read-only sharing to be dealt with efficiently, avoiding extra latency associated with frequent communication with the server. When these accesses patterns were broken or no longer obtain, i.e. the file is shared conflicting access by multiple clients, the DATA ibit lock can be revoked and normal client-side caching logic is used. A read delegation ( DATA ibit lock) is awarded by the server to a client on a file OPENed for reading (that does not deny read access to others). The decision to award a delegation can be made by the server based on a set of conditions that take into account the recent history of the file or the client requests it explicitly. For example, the read delegation is awarded on the second OPEN by the same client. Similar to read delegations, write delegations are awarded by the server when a client opens a file for write (or read/write) access. While the delegation is granted, all OPEN, READ, WRITE, CLOSE, LOCK, GETATTR, SETATTR requests for the file can be handled locally by the client. Recovery when switching between DATA ibit locking and extent DLM locking for data I/O? Client-side metadata writeback with relaxed consistency To simlify the implementation, all directories are created on MDT synchronously via reint operations. For regular files under a directory, it uses client-side write-back caching of metadata to deliver ultra high throughput. Like MetaWBC, the regular files under a directory are first created on client-side embeded memory file system MemFS. After creating more than a certain number of files (i.e. 1024) in MemFS, the client can flush dirty metadata to the server asynchronously in a batch manner. The dirty inodes for regular files can also be checked and flushed to MDT periodically via kernel writeback mechanism. This metadata writeback strategy can support efficient batch creations. This can benefit the IO500 mdtest-easy and mdtest-hard-write|read. For mdtest-hard-write, each sub request in the batching creation RPC can return <PW, OPEN|DATA> ibits lock to the client, and then the client can batch small writes from multiple files to send to OST (if data is on OST) or MDT (if data is on MDT). Or we can batch the creation and data of 3091 bytes to the server for multiple DoM-only files in a hight efficient way. However, global file system semantics may no longer be guaranteed and it relies the applications themselves to solve the access conflicts and cache consistency. Capabilities and flag for relaxed consistency Like CephFS, we can also define various capabilities for directories and files in Lustre. A directory only marked with LUSTRE_RELAXED_FL will be created and accessed with relaxed consistency. This flag is stored on LMA xattr on MDT. And all sub files under this directory inherits LUSTRE_RELAXED_FL flag and access with relaxed consistency. We can convert a directory with strong consistency into relaxed consistency level by level. It just needs to take and full EX lock on the directory to clear all DLM locks on the directory and set the directory with LUSTRE_RELAXED_FL, and then release the full EX lock. After that all I/Os under this directory can be performed in relaxed consistency mode. To convert a directory with relaxed consistency into the old strong consistency, it needs to first clear LUSTRE_RELAXED_FL flag (level by level) and then wait for the maximal timeout period (lease) to invalidate all timeout-based caching on the clients. After that, all data and metadata I/O under the directory will operate with the old strong consistency. References [1] Vilayannur M, Nath P, Sivasubramaniam A. Providing Tunable Consistency for a Parallel File Store [C] //FAST. 2005, 5: 2-2. [2] NFS. https://linux.die.net/man/5/nfs [3] Samba Oplock. [Oplocks - Windows drivers | Microsoft Learn] ( https://learn.microsoft.com/en-us/windows-hardware/drivers/ifs/oplock-overview ) [4] [CephFS Distributed Metadata Cache] ( https://link.zhihu.com/?target=https%3A//docs.ceph.com/en/quincy/cephfs/mdcache/ ) [5] Gulati, A., Naik, M., & Tewari, R. (2007, February). Nache: Design and Implementation of a Caching Proxy for NFSv4. In  FAST  (Vol. 7, pp. 27-27).

            People

              qian_wc Qian Yingjin
              qian_wc Qian Yingjin
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated: