[LU-15223] Improve partial page read/write Created: 15/Nov/21 Updated: 14/Jun/23 |
|
| Status: | In Progress |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.0, Lustre 2.12.4 |
| Fix Version/s: | None |
| Type: | Improvement | Priority: | Minor |
| Reporter: | Xinliang Liu | Assignee: | Xinliang Liu |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | arm, arm-server, ppc | ||
| Environment: |
Arch: aarch64 (client) |
||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||
| Description |
|
Currently, when the user issues a partial page read/write, Lustre (including client side and server side) will convert it into a full page read/write. This is not efficient for small read/write operations , say if you want to read/write several bytes from a file, actually Lustre will read/write a full page. This will become even worse for large PAGE_SIZE 64KB. Make Lustre do a real partial page read/write which the read/write range is actual from start to end(which are given by the user), so that we can get an efficient small read/write. |
| Comments |
| Comment by Xinliang Liu [ 15/Nov/21 ] |
|
Copy partial page write comments from xinliang added a comment Test file created at home dir: $ getconf PAGESIZE 65536 $ echo "123456789" > ~/testfile $ stat ~/testfile File: /root/testfile Size: 10 Blocks: 8 IO Block: 65536 regular file Device: fc02h/64514d Inode: 12863429 Links: 1 Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2021-09-16 02:51:21.268641287 +0000 Modify: 2021-09-16 03:08:05.382557951 +0000 Change: 2021-09-16 03:08:05.382557951 +0000 Birth: - $ stat -c %b ~/testfile 8 $ stat -c %B ~/testfile 512 $ stat -c %s ~/testfile 10 $ stat -f ~/testfile File: "/root/testfile" ID: fc0200000000 Namelen: 255 Type: xfs Block size: 4096 Fundamental block size: 4096 Blocks: Total: 52272379 Free: 45840170 Available: 45840170 Inodes: Total: 104549824 Free: 104176363 Test file created at Lustre dir: $ getconf PAGESIZE 65536 $ echo "123456789" > /mnt/lustre/testfile $ stat -c %s /mnt/lustre/testfile 10 $ stat -c %B /mnt/lustre/testfile 512 $ stat -c %b /mnt/lustre/testfile 128 $ stat /mnt/lustre/testfile File: /mnt/lustre/testfile Size: 10 Blocks: 128 IO Block: 4194304 regular file Device: 2c54f966h/743766374d Inode: 144115205272502274 Links: 1 Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2021-09-16 02:53:57.000000000 +0000 Modify: 2021-09-16 03:07:28.000000000 +0000 Change: 2021-09-16 03:07:28.000000000 +0000 Birth: - $ stat -f /mnt/lustre/testfile File: "/mnt/lustre/testfile" ID: 2c54f96600000000 Namelen: 255 Type: lustre Block size: 4096 Fundamental block size: 4096 Blocks: Total: 78276 Free: 77931 Available: 71141 Inodes: Total: 100000 Free: 99726 But the Lustre test file's inode blocks is 128. This should be wrong? xinliang added a comment - 29/Oct/21 5:55 PM I found that this issue happens at Arm 64K PAGE_SIZE OST server. When create a file, blocks are allocated with PAGE_SIZE aligned, see function osd_ldiskfs_map_inode_pages(). E.g. for 64K PAGE_SIZE Arm64 OST server, if create a file with size less than 64K, it actually allocates 128 blocks each block 512 Bytes. We need to adjust the test for 64K PAGE_SIZE OST server. 298 int osc_io_commit_async(const struct lu_env *env, 299 const struct cl_io_slice *ios, 300 struct cl_page_list *qin, int from, int to, 301 cl_commit_cbt cb) 302 { ... 315 /* Handle partial page cases */ 316 last_page = cl_page_list_last(qin); 317 if (oio->oi_lockless) { 318 page = cl_page_list_first(qin); 319 if (page == last_page) { 320 cl_page_clip(env, page, from, to); 321 } else { 322 if (from != 0) 323 cl_page_clip(env, page, from, PAGE_SIZE); 324 if (to != PAGE_SIZE) 325 cl_page_clip(env, last_page, 0, to); 326 } 327 } 328 329 ll_pagevec_init(pvec, 0); Currently, it seems a normal write don't go into this "if (oio->oi_lockless) {" part code. Anyone know why it is oi_lockless? @Andreas Dilger I am not 100% sure I understand your question - Are you saying it is oi_lockless? It should not be. This (commit_async) code is buffered, and lockless buffered is broken and also off by default. I have a patch to remove it, but it's normally off anyway. What are you looking for/hoping for here? Note we clip pages in other places too. Can you talk more about what you're thinking? I am not quite what the implication of changing block allocation on the server would be on the client. Why does changing server block allocation filter back to the client like this? More generally, about partial page i/o: In general, RDMA can be unaligned at the start, and unaligned at the end, but that's it. This applies even when combining multiple RDMA regions - it's some limitation of the hardware/drivers. So we have a truly unaligned I/O(with a partial page at beginning and end), but then we can't combine it with other I/Os. There is also a page cache limitation here. The Linux page cache insists on working with full pages - It will only allow partial pages at file_size. So, eg, a 3K file is a single page with 3K in it, and we can write just 3K. But if we want to write 3K in to a large 'hole' in a file, Linux will enforce writing PAGE_SIZE. This is not a restriction we can easily remove, it is an important part of the page cache. I'm aware of the RDMA limitations, but I'm wondering if those can be bypassed (if necessary) by transferring a whole page over the network, but store it into a temporary page and copy the data for a cached/unaligned read-modify-write on the server to properly align the data. The content of the start/end of the page sent from the client would be irrelevant, since it will be trimmed by the server anyway when the copy is done While the copy might be expensive for very large writes, my expectation is that this would be most useful for small writes. That does raise the question of whether the data could be transferred in the RPC as a short write, but for GPU direct we require RDMA to send the data directly from the GPU RAM to the OSS. Maybe it is just a matter of generalizing the short write handling to allow copying from the middle of an RDMA page? For ldiskfs backend filesystem, I see that if the user issue a partial page cached write the Lustre (including client side and server side) will convert it in to a full page write. I want to make Lustre do a real partial page write inside which with the length less than a PAGE_SIZE no matter the start is zero or non-zero , so that Lustre can handle bellow sanity 317 test partial page write for a large PAGE_SIZE e.g. 64 KB and pass the test. That's the problem I want to solve.
sanity.sh
test_317() {
...
23836 #
23837 # sparse file test
23838 # Create file with a hole and write actual two blocks. Block count
23839 # must be 16.
23840 #
23841 dd if=/dev/zero of=$DIR/$tfile bs=$grant_blk_size count=2 seek=5 \
23842 conv=fsync || error "Create file : $DIR/$tfile"
23843
...
I am trying to understand all the details and limitation including some mentioned by you e.g. RDMA partial page write, GPU direct write etc. I have a draft patch now which make client side send a niobuf, which contains non-zero file start offset and the real file end offset , to the server. This requires clip the page in the client side. And in the server side it only writes the necessary range(i.e. from the real non-zero file start offset to the file end offset). I will send the patch for review soon. Let's try if we can work out a solution. Thanks. But say you write this clipped partial page - What happens when you read it on the client which wrote it? What is in the rest of the page? And, going on from there: Basically what I am saying is unless you get very clever, this will break the page cache. You would also need to mark this page as non-mergable to avoid the RDMA issue, but that's easy to do. The real sticking point is the page cache. This would mean the page was effectively uncached, which is a bit weird, but could work - I think the benefit is pretty limited since you can't easily combine these partial pages in to larger writes. (RDMA issue again) But anyway, not setting written pages up to date turned out to be really complicated, and I decided it was unworkable. The write code assumes pages are up to date as part writing them, and while I was able to work around a few things, I decided it felt like I was very much going against the intent of the code. The benefit will be pretty limited if we can't also solve the RDMA issue. The benefit would only apply for < page writes, and each one would have to be sent to disk by itself. One way to solve the RDMA problem would be to send full pages over the network, but attach extra data in the RPC telling the server the actual range for each page. This would be very complicated, I think, and involve new ways of handling writes on the client and server. And this assumes we can solve the page cache issue! |
| Comment by Xinliang Liu [ 15/Nov/21 ] |
|
Hi paf0186, I am not understanding all the details yet. As adilger said partial write is so complicated. But I will try my best to answer your questions. Jira is not convenient to discuss in threads. I will paste your words and answer them.
------------------------------------------------------------------------------------------------------------------------------------------------------------------------ Yeah, I think that is the thing I want to accomplish. Partial write at the start and at the end of each write, maybe because page cache only records the start(offset) and the end(added by count).
------------------------------------------------------------------------------------------------------------------------------------------------------------------------ But say you write this clipped partial page - What happens when you read it on the client which wrote it? What is in the rest of the page? And, going on from there: Basically what I am saying is unless you get very clever, this will break the page cache. You would also need to mark this page as non-mergable to avoid the RDMA issue, but that's easy to do. The real sticking point is the page cache. Some questions maybe you can get answers from the patch. I will answer some of them.
------------------------------------------------------------------------------------------------------------------------------------------------------------------------ This would mean the page was effectively uncached, which is a bit weird, but could work - I think the benefit is pretty limited since you can't easily combine these partial pages in to larger writes. (RDMA issue again) But anyway, not setting written pages up to date turned out to be really complicated, and I decided it was unworkable. The write code assumes pages are up to date as part writing them, and while I was able to work around a few things, I decided it felt like I was very much going against the intent of the code. If a partial page can not be combined, why not just leave it as a separate extent to write? This is the small read/write. As I know, ll_do_tiny_write() will handle a partial write which will update content of the range the user wants.
" The benefit will be pretty limited if we can't also solve the RDMA issue. The benefit would only apply for < page writes, and each one would have to be sent to disk by itself. One way to solve the RDMA problem would be to send full pages over the network, but attach extra data in the RPC telling the server the actual range for each page. This would be very complicated, I think, and involve new ways of handling writes on the client and server. And this assumes we can solve the page cache issue! .
|
| Comment by Gerrit Updater [ 15/Nov/21 ] |
|
"xinliang <xinliang.liu@linaro.org>" uploaded a new patch: https://review.whamcloud.com/45569 |