[LU-16429] Create-ahead: create massive files by batched RPC Created: 23/Dec/22 Updated: 13/Jan/24 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | New Feature | Priority: | Minor |
| Reporter: | Qian Yingjin | Assignee: | Qian Yingjin |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
To improve the efficiency and overall throughput, a create-ahead mechanism is proposed to optimize the creation operation for Lustre. In the traditional POSIX API, a user generally invokes a open() system call with O_CREAT flag to create a new file. We use the open cache mechanism to cache results of create-ahead files with strong consistence. And the application can use the valid cached open handle for create-ahead files directly without interaction to the server. To avoid the dependencies of metadata hierarchy, our create-ahead mechanism only does create-ahead on files under a same directory. To begin creat-ahead, the client must first know file names in advance. Some applications with the batch access pattern obey certain naming rules. mdtest creation benchmark is a such typical example in which the file naming format is mdtest.$rank.$i. The kernel can automically detect such I/O pattern if the file naming in the creation sequence is predictable, then enable the create-ahead against this directory. Some other kinds of applications, such as mpiFileUtils/dcp, usually provide a file name list as input to do batch creations with irregular file names. In this case, a simple API is provided to programmers through which an application can inform the kernel with file name list that will be created. The file name list must be given as the same order with the creation sequence in the application.
create-ahead mechanism works as follows: Once detected automically or informed via API to enable create-ahead against a directory, the client starts a dedicated kernel thread to do create-ahead work. The thread packs file names and build a set of creations into one compound RPC, then send it to the metadata server asynchronously. For each sub creation request, the server first creates the file with the file name and then return an open lock with extra inode bits accordingly. If the file is created with DoM layout and detect or inform that subsequent operations of the application will write the file, the server returns a <PW, OPEN|DOM> lock to the client where the DOM ibits lock can be resued for the later write, avoiding extra lock traffic furture. Otherwsie, it grants a <CR, OPEN> lock for batch read access or a <CW, OPEN> for batch write access.
Since the order of create-ahead files is same as the application, the sliding window is used to control the create-ahead progress. The control algorithm is same as stat ahead. Each directory with create-ahead enabled maintains a hash table for create-ahead files. When create ahead a file, a corresponding file entry is inserted into the hash table using the file name as a key. When an application calls open() with O_CREAT flag on a new file and its parent directory is enabled for create ahead, the client first searches the hash table. If there is no file entry with the same name in the hash table, the client must to contract with the server to perform the creation operation synchronously. Otherwise, the application can open the file locally without interaction to the server If create-ahead for this file entry has been finished; If not finished, the application must wait for the result of the asynchronous compound RPC. A successful hit on create ahead will move forward the sliding winodw, and drop the corresponing file entry from the hash table.
By using open cache and compounding, our create-ahead mechanism can save substantial latency by eliminating many costly round-trips, improving the creation performance significantly.
Combined with batching massive small buffered writes: LU-16355, it should improve the performance of IO500 mdtest-hard-write.
TODO: Open relay data for batched open/create... |
| Comments |
| Comment by Andreas Dilger [ 23/Dec/22 ] |
|
For auto-detected create-ahead, how does the client know when to stop creating files, and what happens to the extra files that are created when they shouldn't have been? Not only would the files need to be deleted on the MDT(s) but the OST objects allocated for those files would also need to be deleted, adding overhead to the operation. Auto-detection of "create ahead" filenames seems pretty risky and I'm not sure that this is worthwhile as a standalone feature. Instead, I think it would be better to implement batched file creates as part of WBC (where the client already knows the filenames). Also, it would be generally more useful to improve WBC to work with pre-existing directories by doing a readdir to fetch all of the filenames to the client after it gets the EX directory lock. |
| Comment by Qian Yingjin [ 24/Dec/22 ] |
|
Yes, It exists the problem that the extra files that are created for auto-detected create-ahead. Maybe it still needs some hint to the kernel to inform how many files should be created-ahead.
For WBC, we have already supported batching creations and setattrs. |
| Comment by Gerrit Updater [ 27/Dec/22 ] |
|
"Qian Yingjin <qian@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49519 |
| Comment by Gerrit Updater [ 04/Jan/23 ] |
|
"Qian Yingjin <qian@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49547 |
| Comment by Qian Yingjin [ 09/Jan/23 ] |
|
mkdir $dir || error "failed to mkdir $dir"
time $LFS ahead -c create -B 256 -s 0 -e 5000 -b $tfile.i -Y -d $dir
real 0m0.217s user 0m0.000s sys 0m0.005s mdc.lustre-MDT0000-mdc-ffff99c2da964800.batch_stats= snapshot_time: 1673235150.233498772 (secs.nsecs) subreqs per batch batchs % cum % 1: 0 0 0 2: 0 0 0 4: 0 0 0 8: 0 0 0 16: 0 0 0 32: 0 0 0 64: 0 0 0 128: 0 0 0 256: 20 100 100
time create many -o $DIR/$tdir/$tfile 5000 real 0m1.344s user 0m0.039s sys 0m0.465s
Time 0.217 vs. 1.344s.
|
| Comment by Gerrit Updater [ 09/Jan/23 ] |
|
"Qian Yingjin <qian@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49580 |
| Comment by Gerrit Updater [ 12/Jan/23 ] |
|
"Qian Yingjin <qian@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49615 |
| Comment by Gerrit Updater [ 09/Feb/23 ] |
|
"Qian Yingjin <qian@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49951 |
| Comment by Gerrit Updater [ 24/Feb/23 ] |
|
"Qian Yingjin <qian@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50131 |
| Comment by Gerrit Updater [ 28/Feb/23 ] |
|
"Qian Yingjin <qian@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50153 |
| Comment by Gerrit Updater [ 03/Mar/23 ] |
|
"Qian Yingjin <qian@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50190 |
| Comment by Gerrit Updater [ 06/Mar/23 ] |
|
"Qian Yingjin <qian@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50215 |
| Comment by Andreas Dilger [ 13/Jan/24 ] |
|
I think this batched create-ahead should only be done with an explicit request from the application and/or an "lfs" command (part of "ladvise"?) that specifies the number of files to create. This is a non-POSIX interface, but could be used effectively by libraries like MPI-IO. I don't think it is safe to just guess at filenames and create them, as this could cause errors for applications, or leave stray files that appear corrupted if the client crashes after creating too many files and before cleaning them up. |