Details
-
New Feature
-
Resolution: Unresolved
-
Minor
-
None
-
None
-
None
-
3
-
9223372036854775807
Description
Cross files' readahead
There are use cases for cross files's readahead.
= some AI/ML applications or mdtest-hard-read read small files in file naming index order;
open(); read(); close(); on mdtest.\$rank.\$i
= read and write file in readdir() order, i.e. grep, cp;
In the first version, we do not add support with batching RPC (The reason is that we are still not clear about how to solve the replay recovery for batch open-ahead), instead, we use one RPC per operation request.
After detect that the I/O workload follows a certain access pattern and can be optimized via cross-files' read-ahead, we can do the following optimizations:
== open-ahead: A client-side open cache mechanism is designed in Lustre to hide the open latency. If a client knows in advance or via predicting that it will open/close a file repeatedly or in nearly future, it can request OPEN ibits lock from MDT. The cached open lock granted by MDT to a client protects the validity of the open handle. Thus we can open-ahead the files that will be accessed in the future asynchronous. After that, the client can directly perform open() locally without interaction to MDS if the corresponding open handle is valid protecting by the OPEN ibits lock.
== read-ahead for DoM-only files: During the open-ahead for a DoM-only regular file, pack both file data and attributes into the inline reply buffer and grant a <PR, OPEN|UPDATE|PERM|DOM> lock to the client. After received the reply for prefetching, the client stores the prefetched file data for the DoM file into page cache.
== read-ahead for files with data on OSTs: After open-ahead a regular file with data on OSTs, we can use work queue mechanism to prefetch the data into page cache on the client. In the work queue thread, we take extent DLM lock for prefetch data (LU-15155(https://jira.whamcloud.com/browse/LU-15155)) and then read-ahead data into page cache. The cache page can be hit and reused by subsequent read() system calls.
== Simplify the I/O path for small files with data on OSTs: In the original implementation, Lustre client must first obtain the corresponding extent DLM lock form OSTs and then it can read data from OSTs. The lock latency is considerable for small I/Os. We can simplify the I/O path for small files as follows. It can limit to only read first stripe size of data (less than 1MiB stripe size for larger files), thus the read will not be across the stripe boundary. Thus we can compound the lock request and data read into one RPC request (like intent lock for open and by using short I/O?), and this could also the first step to integrate with batching cross files' read-ahead.
Once a file is appropriately prefetched and validly cached on the client under the protection of the granted ibits lock from MDT, the subsequent user space calls such as stat() and open() can both be preformed locally without any interaction to the server. The subsequent read() can also directly read data from prefetched cache pages on the client.
Attachments
Issue Links
- is related to
-
LU-10280 DoM cross-file open+readahead via statahead
- Open