[LU-16897] Optimize sparse file reads Created: 14/Jun/23 Updated: 12/Dec/23 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Improvement | Priority: | Minor |
| Reporter: | Andreas Dilger | Assignee: | Cyril Bordage |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||
| Severity: | 3 | ||||
| Rank (Obsolete): | 9223372036854775807 | ||||
| Description |
|
Currently, when a client is reading a sparse file, it doesn't know whether there is data for any particular file offset, so it will pre-allocate pages and set up the RDMA for the full range of the reads. If the file is sparse (has a hole in the allocation) or has unwritten extents that return zeroes on read, the OST will zero full all of the requested pages, and transfer the zeroes over the network. It would be desirable to avoid sending the pages of zeroes over the network, to reduce network bandwidth and some CPU overhead on the OSS to zero out the pages. IIRC, the BRW WRITE reply returns an array of rc values for each page to indicate success/failure for each one. I would expect BRW READ to return a special state for each page that indicates that it is a hole. However, this is also complicated by the RDMA configuration, since it has already mapped the pages from the read buffer (which may be specific page cache pages). The best solution would be for the LNet bulk transfer to "not send" those pages in the middle of the RDMA, and have LNet (or the RDMA engine) zero-fill the pages on the client without actually sending them over the wire, but i have no idea about how easy or hard that is to implement. Failing that, if the holes are packed in the server-side bulk transfer setup (and it is possible to send only the first pages in a shorter bulk transfer), then the client would need to memcpy() the data into the correct pages (from last page to first) and zero the pages in the hole itself. That would add CPU/memory overhead on the client, and would not work for RDMA offload like GDS. |
| Comments |
| Comment by Andreas Dilger [ 14/Jun/23 ] |
|
Serguei, Chris, I'm wondering if there is any mechanism like I describe above to avoid bulk transfer of zeroes over the wire in LNet/OFED/IB/OFI? It seems like something that might exist (eg. like WRITE_SAME in the SCSI layer that can be used to zero-fill disk blocks without writing all of the zeroes over the bus). It might be useful to see how NVMeoF is implementing ioctl(BLKZEROOUT) over the wire to see if there is protocol support for this. Implementing this in the LNet bulk transfer layer (with suitable input from Lustre about which pages are zeroes) would be far less complex than doing it at the Lustre protocol level. That would work with GPU Direct to zero fill the pages in the GPU without having to shift the data around on the client. The other solution that comes to mind would be for the client to "pre-map" the list of valid pages before setting up the bulk transfer and do a sparse read, but that would likely add a lot of overhead (one extra RPC round trip for every read, plus waiting in the RPC queue under congestion) and would be racy unless the whole RPC range is locked at the time. |
| Comment by Andreas Dilger [ 19/Jun/23 ] |
|
Cyril said he would look into what is possible in this area. |
| Comment by Patrick Farrell [ 19/Jun/23 ] |
|
So a note from the person who thinks about the page cache all the time (It seems Andreas knows this already based on his comments, but I wanted to write this out explicitly.): It's a requirement that we end up with real, zero filled pages in the page cache. The Linux page cache doesn't know about holes - it expects the layers below it to give it zero-filled pages when it tries to read a hole in a file and it will just keep those zero filled pages around. This isn't something we can change. So if we can zero-fill the pages efficiently somehow in the LNet/RDMA layers without sending zeroes over the wire, great, otherwise we'll have to capture the info in the Lustre protocol and have the client (I think at the BRW level) zero-fill the pages locally. Either way we have to zero-fill those pages. (This would also be complicated because there's no obvious way to express hole boundaries in the protocol, since the most important holes for us are probably the holes created by compression, and those holes are only part of an niobuf.) |
| Comment by Cyril Bordage [ 23/Jun/23 ] |
|
Just a littel update. adilger, FWIK, there is no such mechanism to avoid bulk transfer of zeroes over the wire in LNet/OFED/IB/OFI. I studied BLKZEROOUT in NVMe-oF and haven't found anything interesting. Some details are missing in my analysis. I will keep working on that. In the same time, I am studying how Lustre uses LNet to be sure what we would have to modify in LNet and how. Do you have good pointers for that. |
| Comment by Andreas Dilger [ 24/Jun/23 ] |
|
The interesting interaction on the client in this case is prlrpc_register_bulk() where the pages are registered with LNet for RDMA transfer. There is a corresponding function on the server, whose name escapes me at this point. I'll take a look at the code to find it. |
| Comment by Cyril Bordage [ 26/Jul/23 ] |
|
To have less modification on Lustre, I did not choose to modify prlrpc_register_bulk, but instead to make change closer to LNet levels. I wrote a proof of concept where I modified lnet_md_build to take into account sparse pages when building lnet_libmd. With this change at this location, I guess less modifications will be needed, and it will be more transparent. For now, I am trying to fix some issues to take into account the size. My next step will be to get retrieve sparseness information. |
| Comment by Colin Faber [ 25/Sep/23 ] |
|
Hi cbordage How is this progressing? Do you have a patch ready to submit? |
| Comment by Gerrit Updater [ 03/Nov/23 ] |
|
"Cyril Bordage <cbordage@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52970 |
| Comment by Cyril Bordage [ 03/Nov/23 ] |
|
Hello, I wanted to give some updates before my two-week leave. I do not know yet how to retrieve sparseness information. In my tests, I simulated that by detecting pages filled with 0s (in ptlrpc_fill_bulk_md). Of course, it won't be used in the final code. Here is my current strategy for TCP:
On the OSS side
On client side
Previously, in the code I pushed (even if some parts do not work anymore), I was using a "sparse" page in the kiov to store sparseness information, it won't be the case in the new implementation. |
| Comment by Patrick Farrell [ 03/Nov/23 ] |
|
Cyril, This sounds good to me, and a header seems better than a page in the kiov. What is the issue about the source of the sparseness information exactly? Like, when and where in the code do you need it? (Maybe something someone could help with that, which is why I ask.) |
| Comment by Andreas Dilger [ 03/Nov/23 ] |
|
Are there 256 bits in the header to indicate sparseness? That is 32 bytes per MB, which is not too bad. The size of this bitmap should be flexible/stored in the header itself, in case the RDMA size changes? Mind you, even for large PAGE_SIZE the number of pages will be the same... As for sparseness information, that is available in osd-ldiskfs and possibly OFD via FIEMAP. This could be pushed down to LNet as part of the bulk descriptor setup, rather than having LNet "detect" this itself. |
| Comment by Patrick Farrell [ 30/Nov/23 ] |
|
Cyril, Any update here? Just curious. While working on something related, I learned how to get the needed sparseness info and I think how we could present it to lnet (we could know if a page is sparse and communicate that when "add_kiov_frag" is called on the server to populate the RDMA). You mentioned TCP, is there any part of this that wouldn't work on o2ib/infiniband? That's our most important target. In fact, I could even give you a code snippet to provide that information when add_kiov_frag is called if that would be helpful - it's quite easy to detect when you're reading from a hole, at least for ldiskfs. (ZFS is different, haven't sorted that out) |
| Comment by Cyril Bordage [ 30/Nov/23 ] |
|
Hello Patrick, I was on holidays and had only the time to work a little on how to get sparseness information. So your message is at the right moment. For RDMA, the strategy is not the same. On the server, the part that changes, is kiblnd_init_rdma. We skip some destination pages to match server layout. On its side, the client will receive the list of the holes to zero fill the skipped pages. Thank you. |
| Comment by Patrick Farrell [ 30/Nov/23 ] |
|
OK, I can send over a small patch - it can mark the pages as 'holes' when they're read from disk (on ldiskfs, not sure about ZFS, but that can come later) and then 'un-hole' them when we do a write to them. I think that should suffice. It would be good if possible if we could have the same approach for all LNDs - I'm not sure if that's possible, though, it sounds like it's not. I guess the issue is the client makes a read request, and the server does not have the sparseness information until it is setting up the transfer. (BTW, if it makes things easier, I suggest maybe ignoring sparse for 'short_io' transfers, at least in an initial version. It is a less important case.) I'll post that small patch a little later. |
| Comment by Gerrit Updater [ 30/Nov/23 ] |
|
"Patrick Farrell <pfarrell@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53297 |
| Comment by Patrick Farrell [ 30/Nov/23 ] |
|
Cyril, I just pushed that code here. The commit message includes some additional information that will hopefully be useful. Here's a quick example of this working. #This example uses a slightly larger file so we don't do short_io. # 4K at start of file dd if=/dev/zero bs=4K count=1 of=./newfile conv=notrunc sync # 4K at offset of 128K - data from 0-4K, hole from 4K to 128K, data from 128K to 132K dd if=/dev/zero bs=4K seek=32 count=1 of=./newfile conv=notrunc sync # clear cache to force read echo 3 > /proc/sys/vm/drop_caches lctl set_param *debug=-1 debug_mb=10000 lctl clear # read all of the file at once dd if=./newfile bs=1M count=1 of=/tmp/newfile_copy lctl dk > /tmp/out lctl set_param *debug=0 debug_mb=0 grep "hole " /tmp/out Here's what the output of that grep looks like: 00080000:00000002:2.0:1701363786.690057:0:5679:0:(osd_io.c:450:osd_do_bio()) hole at page_idx 1, block_idx 1, at offset 4096 00080000:00000002:2.0:1701363786.690058:0:5679:0:(osd_io.c:450:osd_do_bio()) hole at page_idx 2, block_idx 2, at offset 8192 00080000:00000002:2.0:1701363786.690059:0:5679:0:(osd_io.c:450:osd_do_bio()) hole at page_idx 3, block_idx 3, at offset 12288 00080000:00000002:2.0:1701363786.690059:0:5679:0:(osd_io.c:450:osd_do_bio()) hole at page_idx 4, block_idx 4, at offset 16384 00080000:00000002:2.0:1701363786.690061:0:5679:0:(osd_io.c:450:osd_do_bio()) hole at page_idx 5, block_idx 5, at offset 20480 00080000:00000002:2.0:1701363786.690062:0:5679:0:(osd_io.c:450:osd_do_bio()) hole at page_idx 6, block_idx 6, at offset 24576 00080000:00000002:2.0:1701363786.690063:0:5679:0:(osd_io.c:450:osd_do_bio()) hole at page_idx 7, block_idx 7, at offset 28672 00080000:00000002:2.0:1701363786.690064:0:5679:0:(osd_io.c:450:osd_do_bio()) hole at page_idx 8, block_idx 8, at offset 32768 00080000:00000002:2.0:1701363786.690064:0:5679:0:(osd_io.c:450:osd_do_bio()) hole at page_idx 9, block_idx 9, at offset 36864 00080000:00000002:2.0:1701363786.690065:0:5679:0:(osd_io.c:450:osd_do_bio()) hole at page_idx 10, block_idx 10, at offset 40960 00080000:00000002:2.0:1701363786.690066:0:5679:0:(osd_io.c:450:osd_do_bio()) hole at page_idx 11, block_idx 11, at offset 45056 00080000:00000002:2.0:1701363786.690066:0:5679:0:(osd_io.c:450:osd_do_bio()) hole at page_idx 12, block_idx 12, at offset 49152 00080000:00000002:2.0:1701363786.690067:0:5679:0:(osd_io.c:450:osd_do_bio()) hole at page_idx 13, block_idx 13, at offset 53248 00080000:00000002:2.0:1701363786.690068:0:5679:0:(osd_io.c:450:osd_do_bio()) hole at page_idx 14, block_idx 14, at offset 57344 00080000:00000002:2.0:1701363786.690069:0:5679:0:(osd_io.c:450:osd_do_bio()) hole at page_idx 15, block_idx 15, at offset 61440 00080000:00000002:2.0:1701363786.690070:0:5679:0:(osd_io.c:450:osd_do_bio()) hole at page_idx 16, block_idx 16, at offset 65536 00080000:00000002:2.0:1701363786.690071:0:5679:0:(osd_io.c:450:osd_do_bio()) hole at page_idx 17, block_idx 17, at offset 69632 00080000:00000002:2.0:1701363786.690072:0:5679:0:(osd_io.c:450:osd_do_bio()) hole at page_idx 18, block_idx 18, at offset 73728 00080000:00000002:2.0:1701363786.690072:0:5679:0:(osd_io.c:450:osd_do_bio()) hole at page_idx 19, block_idx 19, at offset 77824 00080000:00000002:2.0:1701363786.690073:0:5679:0:(osd_io.c:450:osd_do_bio()) hole at page_idx 20, block_idx 20, at offset 81920 00080000:00000002:2.0:1701363786.690073:0:5679:0:(osd_io.c:450:osd_do_bio()) hole at page_idx 21, block_idx 21, at offset 86016 00080000:00000002:2.0:1701363786.690074:0:5679:0:(osd_io.c:450:osd_do_bio()) hole at page_idx 22, block_idx 22, at offset 90112 00080000:00000002:2.0:1701363786.690075:0:5679:0:(osd_io.c:450:osd_do_bio()) hole at page_idx 23, block_idx 23, at offset 94208 00080000:00000002:2.0:1701363786.690075:0:5679:0:(osd_io.c:450:osd_do_bio()) hole at page_idx 24, block_idx 24, at offset 98304 00080000:00000002:2.0:1701363786.690077:0:5679:0:(osd_io.c:450:osd_do_bio()) hole at page_idx 25, block_idx 25, at offset 102400 00080000:00000002:2.0:1701363786.690078:0:5679:0:(osd_io.c:450:osd_do_bio()) hole at page_idx 26, block_idx 26, at offset 106496 00080000:00000002:2.0:1701363786.690079:0:5679:0:(osd_io.c:450:osd_do_bio()) hole at page_idx 27, block_idx 27, at offset 110592 00080000:00000002:2.0:1701363786.690079:0:5679:0:(osd_io.c:450:osd_do_bio()) hole at page_idx 28, block_idx 28, at offset 114688 00080000:00000002:2.0:1701363786.690080:0:5679:0:(osd_io.c:450:osd_do_bio()) hole at page_idx 29, block_idx 29, at offset 118784 00080000:00000002:2.0:1701363786.690081:0:5679:0:(osd_io.c:450:osd_do_bio()) hole at page_idx 30, block_idx 30, at offset 122880 00080000:00000002:2.0:1701363786.690082:0:5679:0:(osd_io.c:450:osd_do_bio()) hole at page_idx 31, block_idx 31, at offset 126976 00000020:00000002:2.0:1701363786.690292:0:5679:0:(tgt_handler.c:2438:tgt_brw_read()) lnb 0, at offset 0, hole 0 00000020:00000002:2.0:1701363786.690292:0:5679:0:(tgt_handler.c:2438:tgt_brw_read()) lnb 1, at offset 4096, hole 1 00000020:00000002:2.0:1701363786.690293:0:5679:0:(tgt_handler.c:2438:tgt_brw_read()) lnb 2, at offset 8192, hole 1 00000020:00000002:2.0:1701363786.690293:0:5679:0:(tgt_handler.c:2438:tgt_brw_read()) lnb 3, at offset 12288, hole 1 00000020:00000002:2.0:1701363786.690293:0:5679:0:(tgt_handler.c:2438:tgt_brw_read()) lnb 4, at offset 16384, hole 1 00000020:00000002:2.0:1701363786.690294:0:5679:0:(tgt_handler.c:2438:tgt_brw_read()) lnb 5, at offset 20480, hole 1 00000020:00000002:2.0:1701363786.690294:0:5679:0:(tgt_handler.c:2438:tgt_brw_read()) lnb 6, at offset 24576, hole 1 00000020:00000002:2.0:1701363786.690294:0:5679:0:(tgt_handler.c:2438:tgt_brw_read()) lnb 7, at offset 28672, hole 1 00000020:00000002:2.0:1701363786.690295:0:5679:0:(tgt_handler.c:2438:tgt_brw_read()) lnb 8, at offset 32768, hole 1 00000020:00000002:2.0:1701363786.690295:0:5679:0:(tgt_handler.c:2438:tgt_brw_read()) lnb 9, at offset 36864, hole 1 00000020:00000002:2.0:1701363786.690295:0:5679:0:(tgt_handler.c:2438:tgt_brw_read()) lnb 10, at offset 40960, hole 1 00000020:00000002:2.0:1701363786.690296:0:5679:0:(tgt_handler.c:2438:tgt_brw_read()) lnb 11, at offset 45056, hole 1 00000020:00000002:2.0:1701363786.690296:0:5679:0:(tgt_handler.c:2438:tgt_brw_read()) lnb 12, at offset 49152, hole 1 00000020:00000002:2.0:1701363786.690296:0:5679:0:(tgt_handler.c:2438:tgt_brw_read()) lnb 13, at offset 53248, hole 1 00000020:00000002:2.0:1701363786.690296:0:5679:0:(tgt_handler.c:2438:tgt_brw_read()) lnb 14, at offset 57344, hole 1 00000020:00000002:2.0:1701363786.690297:0:5679:0:(tgt_handler.c:2438:tgt_brw_read()) lnb 15, at offset 61440, hole 1 00000020:00000002:2.0:1701363786.690297:0:5679:0:(tgt_handler.c:2438:tgt_brw_read()) lnb 16, at offset 65536, hole 1 00000020:00000002:2.0:1701363786.690297:0:5679:0:(tgt_handler.c:2438:tgt_brw_read()) lnb 17, at offset 69632, hole 1 00000020:00000002:2.0:1701363786.690298:0:5679:0:(tgt_handler.c:2438:tgt_brw_read()) lnb 18, at offset 73728, hole 1 00000020:00000002:2.0:1701363786.690298:0:5679:0:(tgt_handler.c:2438:tgt_brw_read()) lnb 19, at offset 77824, hole 1 00000020:00000002:2.0:1701363786.690298:0:5679:0:(tgt_handler.c:2438:tgt_brw_read()) lnb 20, at offset 81920, hole 1 00000020:00000002:2.0:1701363786.690299:0:5679:0:(tgt_handler.c:2438:tgt_brw_read()) lnb 21, at offset 86016, hole 1 00000020:00000002:2.0:1701363786.690299:0:5679:0:(tgt_handler.c:2438:tgt_brw_read()) lnb 22, at offset 90112, hole 1 00000020:00000002:2.0:1701363786.690299:0:5679:0:(tgt_handler.c:2438:tgt_brw_read()) lnb 23, at offset 94208, hole 1 00000020:00000002:2.0:1701363786.690300:0:5679:0:(tgt_handler.c:2438:tgt_brw_read()) lnb 24, at offset 98304, hole 1 00000020:00000002:2.0:1701363786.690300:0:5679:0:(tgt_handler.c:2438:tgt_brw_read()) lnb 25, at offset 102400, hole 1 00000020:00000002:2.0:1701363786.690301:0:5679:0:(tgt_handler.c:2438:tgt_brw_read()) lnb 26, at offset 106496, hole 1 00000020:00000002:2.0:1701363786.690301:0:5679:0:(tgt_handler.c:2438:tgt_brw_read()) lnb 27, at offset 110592, hole 1 00000020:00000002:2.0:1701363786.690301:0:5679:0:(tgt_handler.c:2438:tgt_brw_read()) lnb 28, at offset 114688, hole 1 00000020:00000002:2.0:1701363786.690302:0:5679:0:(tgt_handler.c:2438:tgt_brw_read()) lnb 29, at offset 118784, hole 1 00000020:00000002:2.0:1701363786.690302:0:5679:0:(tgt_handler.c:2438:tgt_brw_read()) lnb 30, at offset 122880, hole 1 00000020:00000002:2.0:1701363786.690302:0:5679:0:(tgt_handler.c:2438:tgt_brw_read()) lnb 31, at offset 126976, hole 1 00000020:00000002:2.0:1701363786.690302:0:5679:0:(tgt_handler.c:2438:tgt_brw_read()) lnb 32, at offset 131072, hole 0
|
| Comment by Cyril Bordage [ 04/Dec/23 ] |
|
Hello Patrick, thank you for the code snippet. I did test to read sparseness information in osd_do_bio as you did, but, as I said in a previous comment, I had a problem for a second read (because of some cache mechanism). But, I hadn't tried to check that from tgt_brw_read, as you did in your code. I was glad to see the value in tgt_brw_read was right, and so I have no cache problem anymore. |
| Comment by Andreas Dilger [ 04/Dec/23 ] |
|
Patrick, it looks like your patch will note holes in the file on the initial read from the backing OSD inode, but there is no way (yet) to save this across RPCs. So if two clients read the same file only the first one would exclude the holes from the bulk RPC transfer. It would be very useful in this case to mark the pages in the cache with a "hole" flag so that they can be skipped for all clients that understand this sparse RPC mechanism. |
| Comment by Patrick Farrell [ 04/Dec/23 ] |
|
Cyril, Andreas actually just explained the cache issue I was referring to. I'll give my own version just to be extra clear. But before I explain the cache thing, I want to note - this shouldn't affect you except at the very edge. The key point for lnet/lnd is that the 'hole' information will be available when you call add_kiov_frag. Exactly how we get that information is a separate problem and the current patch good enough to do the LNet dev against. (Especially because it will be easy to swap in a better way to handle the hole info, since we'll present it to you at the same place.) OK, the issue is: I'm marking the 'hole' information in the lnb. lnbs are associated with pages, but lnbs only exist for one IO. When we create an lnb, we get a page for it, and that page might already be ready in cache, so we don't read it from disk. But we're only marking the lnb as being a hole when we read from disk, so this information is lost when the lnb is destroyed after that specific read is complete. So if we're using the cache, the next read of that page will not do a read from disk, so it misses that the page is a hole. The tricky thing here, Andreas, is we don't use page->private on the server and I don't think we could safely use it with ZFS pages (and using it for ldiskfs would be possible but take some work). So there's nowhere for us to store Lustre specific info attached to the page. We might be able to steal some Linux page flag... We could probably use PagePrivate2, like we did for encryption on the client? That seems likely to be safe. |
| Comment by Cyril Bordage [ 05/Dec/23 ] |
|
I was not sure about the cache issue because of the tests I run. I tested exactly what you described (two clients read the same files) but got no issue with that. The holes were detected for the two of them. Do I have to enable some option to have that to fail? |
| Comment by Patrick Farrell [ 05/Dec/23 ] |
|
Cyril, This is likely related to whether or not your test system is running on nonrotational media. This controls whether or not the page cache is on by default. But I suggest ignoring this for now while you work on the sparse reads support - presenting the hole information correctly is something I think I or Artem or someone else from the file system side can take care of relatively easily. |
| Comment by Cyril Bordage [ 07/Dec/23 ] |
|
You can find the wiki page here. |
| Comment by Patrick Farrell [ 07/Dec/23 ] |
|
Cyril, Can you adjust permissions on that? I can't access it even when logged in. You might just want to move it to a different part of the Wiki - we've got a (rarely used -------------- (here's the error I got - note the permissions from parent part) This is because it's inheriting restrictions from a parent page. A space admin or the person who shared this page may be able to give you access." |