[LU-16429] Create-ahead: create massive files by batched RPC Created: 23/Dec/22  Updated: 13/Jan/24

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: New Feature Priority: Minor
Reporter: Qian Yingjin Assignee: Qian Yingjin
Resolution: Unresolved Votes: 0
Labels: None

Rank (Obsolete): 9223372036854775807

 Description   

To improve the efficiency and overall throughput, a create-ahead mechanism is proposed to optimize the creation operation for Lustre.

In the traditional POSIX API, a user generally invokes a open() system call with O_CREAT flag to create a new file.

We use the open cache mechanism to cache results of create-ahead files with strong consistence. And the application can use the valid cached open handle for create-ahead files directly without interaction to the server.

To avoid the dependencies of metadata hierarchy, our create-ahead mechanism only does create-ahead on files under a same directory. To begin creat-ahead, the client must first know file names in advance. Some applications with the batch access pattern obey certain naming rules. mdtest creation benchmark is a such typical example in which the file naming format is mdtest.$rank.$i. The kernel can automically detect such I/O pattern if the file naming in the creation sequence is predictable, then enable the create-ahead against this directory.

Some other kinds of applications, such as mpiFileUtils/dcp, usually provide a file name list as input to do batch creations with irregular file names. In this case, a simple API is provided to programmers through which an application can inform the kernel with file name list that will be created. The file name list must be given as the same order with the creation sequence in the application.

 

create-ahead mechanism works as follows:

Once detected automically or informed via API to enable create-ahead against a directory, the client starts a dedicated kernel thread to do create-ahead work. The thread packs file names and build a set of creations into one compound RPC, then send it to the metadata server asynchronously. For each sub creation request, the server first creates the file with the file name and then return an open lock with extra inode bits accordingly. If the file is created with DoM layout and detect or inform that subsequent operations of the application will write the file, the server returns a <PW, OPEN|DOM> lock to the client where the DOM ibits lock can be resued for the later write, avoiding extra lock traffic furture. Otherwsie, it grants a <CR, OPEN> lock for batch read access or a <CW, OPEN> for batch write access.

 

Since the order of create-ahead files is same as the application, the sliding window is used to control the create-ahead progress. The control algorithm is same as stat ahead. Each directory with create-ahead enabled maintains a hash table for create-ahead files. When create ahead a file, a corresponding file entry is inserted into the hash table using the file name as a key.

When an application calls open() with O_CREAT flag on a new file and its parent directory is enabled for create ahead, the client first searches the hash table. If there is no file entry with the same name in the hash table, the client must to contract with the server to perform the creation operation synchronously. Otherwise, the application can open the file locally without interaction to the server If create-ahead for this file entry has been finished; If not finished, the application must wait for the result of the asynchronous compound RPC. A successful hit on create ahead will move forward the sliding winodw, and drop the corresponing file entry from the hash table.

 

By using open cache and compounding, our create-ahead mechanism can save substantial latency by eliminating many costly round-trips, improving the creation performance significantly.

 

Combined with batching massive small buffered writes:  LU-16355, it should improve the performance of IO500 mdtest-hard-write.

 

TODO:

Open relay data for batched open/create...



 Comments   
Comment by Andreas Dilger [ 23/Dec/22 ]

For auto-detected create-ahead, how does the client know when to stop creating files, and what happens to the extra files that are created when they shouldn't have been? Not only would the files need to be deleted on the MDT(s) but the OST objects allocated for those files would also need to be deleted, adding overhead to the operation. Auto-detection of "create ahead" filenames seems pretty risky and I'm not sure that this is worthwhile as a standalone feature.

Instead, I think it would be better to implement batched file creates as part of WBC (where the client already knows the filenames). Also, it would be generally more useful to improve WBC to work with pre-existing directories by doing a readdir to fetch all of the filenames to the client after it gets the EX directory lock.

Comment by Qian Yingjin [ 24/Dec/22 ]

Yes, It exists the problem that the extra files that are created for auto-detected create-ahead. Maybe it still needs some hint to the kernel to inform how many files should be created-ahead.

 

For WBC, we have already supported batching creations and setattrs.

Comment by Gerrit Updater [ 27/Dec/22 ]

"Qian Yingjin <qian@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49519
Subject: LU-16429 ahead: move statahead into more generic ahead
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: f9094fbdf5a8e8f61517474b24a86e4da18e445f

Comment by Gerrit Updater [ 04/Jan/23 ]

"Qian Yingjin <qian@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49547
Subject: LU-16429 ahead: basic framework for ahead operations
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: f12a91040b9dd94a6abfa1e09426b1a2d347f50e

Comment by Qian Yingjin [ 09/Jan/23 ]

 

mkdir $dir || error "failed to mkdir $dir"
time $LFS ahead -c create -B 256 -s 0 -e 5000 -b $tfile.i -Y -d $dir

 

real 0m0.217s
user 0m0.000s
sys 0m0.005s
mdc.lustre-MDT0000-mdc-ffff99c2da964800.batch_stats=
snapshot_time:         1673235150.233498772 (secs.nsecs)
subreqs per batch   batchs   % cum %
1:         0   0   0
2:         0   0   0
4:         0   0   0
8:         0   0   0
16:         0   0   0
32:         0   0   0
64:         0   0   0
128:         0   0   0
256:         20 100 100

 

 

time create many -o $DIR/$tdir/$tfile 5000
real 0m1.344s
user 0m0.039s
sys 0m0.465s

 

Time 0.217 vs. 1.344s.

 

 

Comment by Gerrit Updater [ 09/Jan/23 ]

"Qian Yingjin <qian@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49580
Subject: LU-16429 ahead: batch reint creations via lfs ahead
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: a7be652826260aaa7786a9190b19a04dd7047fcd

Comment by Gerrit Updater [ 12/Jan/23 ]

"Qian Yingjin <qian@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49615
Subject: LU-16429 llite: batch intent create and open via lfs ahead
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 3dc3ffaecca35ff726999886c603aa5134545237

Comment by Gerrit Updater [ 09/Feb/23 ]

"Qian Yingjin <qian@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49951
Subject: LU-16429 llite: use sliding window for open|create ahead
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 3a91a4bb6dd941072e5e345f04b904d49193c9a0

Comment by Gerrit Updater [ 24/Feb/23 ]

"Qian Yingjin <qian@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50131
Subject: LU-16429 dom: batch open|create and write for DoM files
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 68e950380eaaf6be7c692c778e687e8207d41496

Comment by Gerrit Updater [ 28/Feb/23 ]

"Qian Yingjin <qian@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50153
Subject: LU-16429 recovery: resend recovery for batch intent creation
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 3fdb74660e1591f88dd15e5b9690fe791ae3e1e5

Comment by Gerrit Updater [ 03/Mar/23 ]

"Qian Yingjin <qian@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50190
Subject: LU-16429 mdt: add stats for batch resend count
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 2a5d9de08663fbe7a50d40d95f86a9db920e916a

Comment by Gerrit Updater [ 06/Mar/23 ]

"Qian Yingjin <qian@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50215
Subject: LU-16429 mdt: add batch index into open file data on MDT
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: c0df187470b5eedfc388f380e66074adbc5e7a13

Comment by Andreas Dilger [ 13/Jan/24 ]

I think this batched create-ahead should only be done with an explicit request from the application and/or an "lfs" command (part of "ladvise"?) that specifies the number of files to create. This is a non-POSIX interface, but could be used effectively by libraries like MPI-IO. I don't think it is safe to just guess at filenames and create them, as this could cause errors for applications, or leave stray files that appear corrupted if the client crashes after creating too many files and before cleaning them up.

Generated at Sat Feb 10 03:26:57 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.