Details
-
New Feature
-
Resolution: Unresolved
-
Minor
-
None
-
None
-
None
-
9223372036854775807
Description
For buffered I/O mode, I/O can cache on the client side until flush is needed.
Small I/O is not good supported in the current Lustre.
To improve the small I/O performance, Lustre has already implemented short I/O feature in which the data is transferring using inline buffer of an I/O RPC request. However, the performance improvement is limited.
After batched RPC introduced, It can batch many dirty pages of many small files at OSC layer into a large RPC and transfer the I/O in bulk I/O mode.
The max dirty pages allowed by an OSC is reach 2G. Thus, it can cache lots of dirty data from OSC objects before hit max dirty limit or the space grants which needs to write out the data.
In OSC layer, It can scan the dirty objects, and batching the dirty pages of these objects, and send the I/O requests in a batched way.
It expects that this feature can benefit the write I/O from many small files and sync() at the end of the writes. (i.e. mdtest-hard-write).
Here there are two design choices:
1. Use the existed short I/O mechanism to store the data into the batched RPC.
The advantage is that
- It can store very small file data (i.e. less than 1024 bytes) in a much efficient way and does not need a whole page to hold the data for small files.
- It can better integrate with the batched RPC.
The disadvantage is that the data movement is not zero-copy. The dirty pages is needed to copy to the inline buffer of the RPC on the client side, and the inline data still needs to copy into the prepared page buffer to doing I/O to the backend filesystem on the server side.
2. Use RDMA mechanism, bind the dirty page IOV from multiple objects to the Bulk I/O directly on the client side, transfer the data into the prepared page IOV on the server side.
The advantage of this mechanism is that all data movement is zero copy from a client to a server.
The disadvantage is that: - The implementation may be complex. The bulk IOV contains the I/O pages from multiple objects, which may change the I/O logic at server side a lot.
- the min I/O per object is page size, not much efficient for small objects just with several bytes.
Any suggestion and comment is welcome!