Metadata writeback cache support
(LU-10938)
|
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Technical task | Priority: | Minor |
| Reporter: | Qian Yingjin | Assignee: | Qian Yingjin |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
For a file or directory flagged with Protect(P) state under the protection of EX WBC lock, the open() system call does not need to communicate with MDS, can also be executed locally in MemFS of Lustre WBC. Complete(C): the directory is 1) cached in MemFS; 2) under the protection of subtree EX lock; 3) contains the complete direct subdirs. In this state, results of readdir() and lookup() operations under the directory can directly be obtained from client-side MemFS. All file operations can be performed on the MemFS without communication with the server. However, Lustre is a stateful filesystem. Each open keeps a state on the MDS. We must keep transparency for applications once the EX WBC lock is cancelled. To achieve this goal, each local open will be recorded in the inode’s open list (or maintain pre dentry?); When the EX WBC lock is cancelling, it must reopen the files in the open list from MDS. For WBC directories, it must be handled carefully for the ->readdir() call. Thus, we must find a method to bridge two implementation firstly. The proposal solution is to transform readdir() by using hash order same as current Lustre as follows:
Besides the proposal above that building a memory resident sorting tree during the whole life of a directory, we can build the hashed index rbtree during readdir() call in runtime, return the dentries in hash order. Upon closing the file, destroy the hashed index rbtree. The proposal solution has shortcoming:
Any comments and suggestion are welcome! |
| Comments |
| Comment by Lai Siyao [ 06/May/20 ] |
|
For small directories (whose sub dir entries can fit into one page), we can call dcache_readdir() to read all entries in one blow, while for large directories, I'd suggest readdir via ll_readdir() directly, though it may force flush of cached data. |
| Comment by Qian Yingjin [ 06/May/20 ] |
|
Siyao, Thanks for your comment. Please note that the root EX WBC lock on the root WBC directory would be revoked at any time during reading directory entries after open the directory. After root WBC EX lock was revoked, all the sub dentries are flushed to MDT. After that, the dentry reading are in HASH order not the previous linear order via dcache_readdir(). This may cause that we read the repeated or wrong dentries... |
| Comment by Qian Yingjin [ 06/May/20 ] |
|
Of course, we can limit the number of children dentries created under a directory. If reach the limits, we can decomplete the directory, which means that flush all children dentries to MDT and unmask Complete(C) flag from the directory. |
| Comment by Qian Yingjin [ 08/May/20 ] |
|
After discussed with Siyao, we thought we can read all entires in one blow for small directories. The reason is as follows: To block any normal operations under a root WBC directory when the root EX WBC is revoking, it hold a write readwrite semaphore during the whole processing: flush all children files to MDT and get back the root EX WBC locks on these children files, reopen the files... While for the normal operations such as open/close/read/write/readdir/seek/setattr/getattr, it just hold a read semaphore. Thus, we can account the sum of size of all children dentries when create/delete a file under a directory flagged with Complete(C). If the prepared buffer size is larger enough to fill all dentires, we can read all these entries in one blow. Fortunately, the default buffer size allocated for getdents system call in the tool coreutils/ls is 32768 (32K). Usually, it is larger enough to fill 100~1000 entries.
When the directory is too larger, can not fill all dentires in a blow, we could decomplete the directory first, and then read the dentries from MDT. |
| Comment by Andreas Dilger [ 09/May/20 ] |
|
One comment about tracking local opens per inode - this would also be useful for the "imperative recovery" mechanism, to clean up/remove the complex saved RPC replay mechanism that exists for opens. This should be done as a separate patch from WBC. |
| Comment by Andreas Dilger [ 09/May/20 ] |
|
Sorry, I meant "simplified interoperability" LU-5703 needed the cleanup of open handle replays. |
| Comment by Gerrit Updater [ 12/May/20 ] |
|
Yingjin Qian (qian@ddn.com) uploaded a new patch: https://review.whamcloud.com/38578 |