[LU-16355] batch dirty buffered write of small files Created: 30/Nov/22 Updated: 31/Aug/23 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | New Feature | Priority: | Minor |
| Reporter: | Qian Yingjin | Assignee: | Qian Yingjin |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
For buffered I/O mode, I/O can cache on the client side until flush is needed. After batched RPC introduced, It can batch many dirty pages of many small files at OSC layer into a large RPC and transfer the I/O in bulk I/O mode. It expects that this feature can benefit the write I/O from many small files and sync() at the end of the writes. (i.e. mdtest-hard-write). Here there are two design choices:
Any suggestion and comment is welcome! |
| Comments |
| Comment by Andreas Dilger [ 30/Nov/22 ] |
If we agreed that zero-copy IO was not needed/possible for very small writes (smaller than 4KB), then it would be possible to pack the dirty data from multiple files into a single RDMA transfer, and then copy the data out of the pages on the server into the server side inode pages again. Even with the extra memcpy() it would likely still be faster than sending separate RPCs for each file. This also would fit very well with WBC since it could create DoM layouts directly for small files, and skip the DoM component for larger files that will store data on OSTs. It isn't very clear if packing the pages would help with mdtest-hard-write since those files are 3901 bytes, so only 195 bytes smaller than 4096-byte pages (4%). However, packing multiple objects into a single RPC should hopefully improve performance. If it is helpful, there was at one time support in the OBD_BRW_WRITE RPC for handling multiple objects, since there could be an array of struct obd_ioobj in the request, but I think much of this support was removed, because there could only be a single struct obdo with file attributes per RPC (timestamps, UID, GID, PRJID, etc), so it didn't make sense to have writes to multiple objects. However, if the writes are (commonly) all from the same UID/GID/PRJID then this might be possible. Alternately, having the batching at the RPC level is likely still helpful, and this would also allow mballoc to do a better job to aggregate the blocks of many small file writes together in the filesystem (e.g. 256x4KB writes into a single 1MB group allocation on disk). For very small files (< 512 bytes) it would be possible to use the "inline data" feature of ldiskfs (LU-5603) to store data directly in the inode. This could be used with DoM files to store them most efficiently, but may need some added testing/fixing in ldiskfs to work correctly with other ldiskfs features. |
| Comment by Gerrit Updater [ 08/Dec/22 ] |
|
"Qian Yingjin <qian@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49342 |
| Comment by Andreas Dilger [ 07/Jul/23 ] |
|
Have you looked at implementing batched read support, if the reads can be generated asynchronously (e.g. via AIO, io_uring, or statahead for mdtest-easy/hard-read)? |
| Comment by Qian Yingjin [ 18/Jul/23 ] |
|
I have some thoughts about ahead operations (batched open+read-ahead for DoM files), not implemented yet. But we have ahead operations framework. |
| Comment by Gerrit Updater [ 28/Aug/23 ] |
|
"Qian Yingjin <qian@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52129 |
| Comment by Gerrit Updater [ 31/Aug/23 ] |
|
"Qian Yingjin <qian@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52200 |