Type: Technical task
Affects Version/s: None
Fix Version/s: None
For a file or directory flagged with Protect(P) state under the protection of EX WBC lock, the open() system call does not need to communicate with MDS, can also be executed locally in MemFS of Lustre WBC.
Complete(C): the directory is 1) cached in MemFS; 2) under the protection of subtree EX lock; 3) contains the complete direct subdirs. In this state, results of readdir() and lookup() operations under the directory can directly be obtained from client-side MemFS. All file operations can be performed on the MemFS without communication with the server.
However, Lustre is a stateful filesystem. Each open keeps a state on the MDS. We must keep transparency for applications once the EX WBC lock is cancelled. To achieve this goal, each local open will be recorded in the inode’s open list (or maintain pre dentry?); When the EX WBC lock is cancelling, it must reopen the files in the open list from MDS.
For WBC directories, it must be handled carefully for the ->readdir() call.
Currently the mechanism adopted by MemFS (tmpfs) is to simply scan the in-memory sub dentries of the directory in dcache linearly to fill the content returned to readdir call: ->dcache_readdir().
While Lustre new readdir implementation is much complex. It does readdir in hash order and uses hash of a file name as a telldir/seekdir cookie stored in the file handle.
Thus, we must find a method to bridge two implementation firstly.
The proposal solution is to transform readdir() by using hash order same as current Lustre as follows:
1. At the connection time, decide the hash function used both on the OSD backend on MDSes and MemFS on the client (supposed that all MDT servers are isomorphic, all formated as ldiskfs or zfs and use the same name hash function).
2. Use rbtree (or in-memory Htree used by ext4/ldiskfs) to manage children dentries under a directory. It sorts according to the hash value of the file name.
2. When create a new file under a directory, add the corresponding dentry into the sorting tree.
3. when unlink a file under a directory, remove the corresponding dentry from the sorting tree.
4. For readdir(), do same as what current Lustre does, fill the dentry in the hash order.
5. The memory used for rbtree and its nodes should be accounted. When reaching the limits of the memory usage for WBC, decomplete the directory.
6. When decomplete a directory, all metadata operations under this directory must go to the server synchronously. At this time, destroy the sorting tree for children dentries under the directory.
Besides the proposal above that building a memory resident sorting tree during the whole life of a directory, we can build the hashed index rbtree during readdir() call in runtime, return the dentries in hash order. Upon closing the file, destroy the hashed index rbtree.
The proposal solution has shortcoming:
- For an isomerism MDT environment, the hash function used by backend may be different, cause stack layer volatile.
- implementing and managing sorting tree according the hash of file name maybe complex.
- Even all MDTs are formatted as ldiskfs, each ldiskfs filesystem on MDT may have different hash seeds, it must reach an agreement on hash seeds that all MDTs not use hash seed or use the same hash seed.
Any comments and suggestion are welcome!