[LU-16897] Optimize sparse file reads - Whamcloud Community JIRA

Details

Type: Improvement
Resolution: Unresolved
Priority: Minor
Fix Version/s: None
Affects Version/s: None
Labels:
None

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

Currently, when a client is reading a sparse file, it doesn't know whether there is data for any particular file offset, so it will pre-allocate pages and set up the RDMA for the full range of the reads. If the file is sparse (has a hole in the allocation) or has unwritten extents that return zeroes on read, the OST will zero full all of the requested pages, and transfer the zeroes over the network.

It would be desirable to avoid sending the pages of zeroes over the network, to reduce network bandwidth and some CPU overhead on the OSS to zero out the pages. IIRC, the BRW WRITE reply returns an array of rc values for each page to indicate success/failure for each one. I would expect BRW READ to return a special state for each page that indicates that it is a hole.

However, this is also complicated by the RDMA configuration, since it has already mapped the pages from the read buffer (which may be specific page cache pages). The best solution would be for the LNet bulk transfer to "not send" those pages in the middle of the RDMA, and have LNet (or the RDMA engine) zero-fill the pages on the client without actually sending them over the wire, but i have no idea about how easy or hard that is to implement.

Failing that, if the holes are packed in the server-side bulk transfer setup (and it is possible to send only the first pages in a shorter bulk transfer), then the client would need to memcpy() the data into the correct pages (from last page to first) and zero the pages in the hole itself. That would add CPU/memory overhead on the client, and would not work for RDMA offload like GDS.

Attachments

Issue Links

links to

LU-16897 patches

mentioned in: Page No Confluence page found with the given URL.; Page No Confluence page found with the given URL.; Page No Confluence page found with the given URL.; Page No Confluence page found with the given URL.; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...

Wiki Page: Wiki Page Loading...

(23 mentioned in, 1 Wiki Page)

Activity

[LU-16897] Optimize sparse file reads

Gerrit Updater added a comment - 02/Jan/25 8:45 PM

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/55404/
Subject: LU-16897 idl: add OBD_CONNECT2_SPARSE flag
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: bc9261874c3cdd68c2bb452eac381ab98105badd

Gerrit Updater added a comment - 02/Jan/25 8:45 PM "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/55404/ Subject: LU-16897 idl: add OBD_CONNECT2_SPARSE flag Project: fs/lustre-release Branch: master Current Patch Set: Commit: bc9261874c3cdd68c2bb452eac381ab98105badd

Gerrit Updater added a comment - 25/Oct/24 8:18 AM

"Cyril Bordage <cbordage@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/56788
Subject: LU-16897 lnet: handle sparse file
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 713b5175b90f3f4ca08a23ab79d4ee7b92c6087e

Gerrit Updater added a comment - 25/Oct/24 8:18 AM "Cyril Bordage <cbordage@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/56788 Subject: LU-16897 lnet: handle sparse file Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 713b5175b90f3f4ca08a23ab79d4ee7b92c6087e

Cyril Bordage added a comment - 12/Aug/24 1:21 PM

It could be interesting and quick to run simple tests comparing Lustre performance with regular header, to Yingjin's header version in order to quantify the impact of the change and know which way we take.

Cyril Bordage added a comment - 12/Aug/24 1:21 PM It could be interesting and quick to run simple tests comparing Lustre performance with regular header, to Yingjin's header version in order to quantify the impact of the change and know which way we take.

Andreas Dilger added a comment - 08/Aug/24 10:24 PM

I don't think Cyril's HLD document reflects the implementation in Yingjin's patch, which is an alternative to the patch produced by Cyril.

With Yingjin's patch, there are no changes needed in the LND/network layer, because it replies back to the client BRW request with a bitmap of pages that are entirely holes, and the client changes the bulk request to avoid requesting/transferring the pages at all, and instead zeroes those pages locally. This ends up as a "smart" BRW read that doesn't request bulk pages that contain no data, saving on network bandwidth, at the cost of an extra LNet message to transfer the 32-byte bitmap (one bit per 4KB page in a 1MB bulk). This is the same BRW as what the client would do today with a sparse write, except the client knows the sparseness in advance.

If we adopt Yingjin's implementation (which I think is probably preferable) then the HLD should be updated to reflect this. The main reason not to use Yingjin's patch would be if the performance/latency is significantly worse than Cyril's because of the extra LNet reply message with the sparseness bitmap. I don't think this is likely, however.

Andreas Dilger added a comment - 08/Aug/24 10:24 PM I don't think Cyril's HLD document reflects the implementation in Yingjin's patch, which is an alternative to the patch produced by Cyril. With Yingjin's patch, there are no changes needed in the LND/network layer, because it replies back to the client BRW request with a bitmap of pages that are entirely holes, and the client changes the bulk request to avoid requesting/transferring the pages at all, and instead zeroes those pages locally. This ends up as a "smart" BRW read that doesn't request bulk pages that contain no data, saving on network bandwidth, at the cost of an extra LNet message to transfer the 32-byte bitmap (one bit per 4KB page in a 1MB bulk). This is the same BRW as what the client would do today with a sparse write, except the client knows the sparseness in advance. If we adopt Yingjin's implementation (which I think is probably preferable) then the HLD should be updated to reflect this. The main reason not to use Yingjin's patch would be if the performance/latency is significantly worse than Cyril's because of the extra LNet reply message with the sparseness bitmap. I don't think this is likely, however.

Colin Faber added a comment - 23/Jul/24 3:23 PM

Just a reminder, there is an HLD which cbordage produced here. It would be highly useful for comments to be reflected there.

Colin Faber added a comment - 23/Jul/24 3:23 PM Just a reminder, there is an HLD which cbordage produced here . It would be highly useful for comments to be reflected there.

Alexey Lyashkov added a comment - 23/Jul/24 1:02 PM

Yingjin,

It probably you misunderstand a GAP. IB specification say - once you create a MR with 256 page size, you should fill it's with 256 pages. all pages must be in the continue logical addresses. If you have less pages or logical addresses have a hole - it's named 'GAP' in IB specification terms.

>Thus we can zero hole pages
not "can" but "must". All other will violate a POSIX.

>The patch is workable.
It might work in our environment - but as i say early it might be not a true at all other. One of pointed example - it's GPU. Because GPU pages don't accessible after LND transfer done and pages is unmapped.

You should discus with Cray GNI/KFI developers to be sure you don't broke anything for him.

Alexey Lyashkov added a comment - 23/Jul/24 1:02 PM Yingjin, It probably you misunderstand a GAP. IB specification say - once you create a MR with 256 page size, you should fill it's with 256 pages. all pages must be in the continue logical addresses. If you have less pages or logical addresses have a hole - it's named 'GAP' in IB specification terms. >Thus we can zero hole pages not "can" but "must". All other will violate a POSIX. >The patch is workable. It might work in our environment - but as i say early it might be not a true at all other. One of pointed example - it's GPU. Because GPU pages don't accessible after LND transfer done and pages is unmapped. You should discus with Cray GNI/KFI developers to be sure you don't broke anything for him.

Qian Yingjin added a comment - 23/Jul/24 12:50 PM - edited

Alexey,

It seems that you have misunderstanding on the GAP in a file and GAP in HCA virtual address mapping.

For an example, we can map the physical pages for a file (with size of PAGE_SIZE, but the logical offsets are not contiguous in file) into a contiguous virtual DMA address in a IB HCA.

Thus we can zero hole pages and remove hole pages from MD in LNet layer, give a modified MD (i.e. all pages after removing holes are PAGE SIZE expect the first and last one) to the lower IB layer.

We do the same operations above for the corresponding MD on both source and sink.

Thus the RDMA bulk data only needs to transfer the needed data but filtering out holes.

The patch is workable. And please note we test holes with checksum enabled. It will check the read data on the client side according to the checksum returned by server.

Qian Yingjin added a comment - 23/Jul/24 12:50 PM - edited Alexey, It seems that you have misunderstanding on the GAP in a file and GAP in HCA virtual address mapping. For an example, we can map the physical pages for a file (with size of PAGE_SIZE, but the logical offsets are not contiguous in file) into a contiguous virtual DMA address in a IB HCA. Thus we can zero hole pages and remove hole pages from MD in LNet layer, give a modified MD (i.e. all pages after removing holes are PAGE SIZE expect the first and last one) to the lower IB layer. We do the same operations above for the corresponding MD on both source and sink. Thus the RDMA bulk data only needs to transfer the needed data but filtering out holes. The patch is workable. And please note we test holes with checksum enabled. It will check the read data on the client side according to the checksum returned by server.

Alexey Lyashkov added a comment - 23/Jul/24 11:21 AM

Cyril.
1. You don't answer - who will be zero a memory you skip from transfer or you will be violate a POSIX requirements?

2. If hole in middle of requested area, removing any pages from transfer will introduce a GAP. you disagree with it? If it's GAP did you agree this is needs some special handing from HW ?

Alexey Lyashkov added a comment - 23/Jul/24 11:21 AM Cyril. 1. You don't answer - who will be zero a memory you skip from transfer or you will be violate a POSIX requirements? 2. If hole in middle of requested area, removing any pages from transfer will introduce a GAP. you disagree with it? If it's GAP did you agree this is needs some special handing from HW ?

Patrick Farrell (Inactive) added a comment - 22/Jul/24 3:37 PM

Ah, I see. So an extra 32 bytes/256 bits. How big is the LNet header currently (without the extra 256 bits)? And if it's a big jump in header size... is there some way to avoid this in common cases? Like having two versions of the header, one used only for bulks? That would eliminate the overhead for everything else and bulks are generally large enough this shouldn't matter.

Compatibility is an issue with either approach, I think, so that's less concerning unless it's particularly hard to ensure compatibility in this case.

Patrick Farrell (Inactive) added a comment - 22/Jul/24 3:37 PM Ah, I see. So an extra 32 bytes/256 bits. How big is the LNet header currently (without the extra 256 bits)? And if it's a big jump in header size... is there some way to avoid this in common cases? Like having two versions of the header, one used only for bulks? That would eliminate the overhead for everything else and bulks are generally large enough this shouldn't matter. Compatibility is an issue with either approach, I think, so that's less concerning unless it's particularly hard to ensure compatibility in this case.

Cyril Bordage added a comment - 22/Jul/24 5:36 AM

shadow, the modification is done at higher level. The lnd will work on a modified md that won't have the holes, but it will be handled like a regular md for it.

paf, the performance issue is due to the fact that LNet header is 256 bits bigger, even if we don't have sparse data. Also, we need to ensure compatibility with old LNet header.

Cyril Bordage added a comment - 22/Jul/24 5:36 AM shadow , the modification is done at higher level. The lnd will work on a modified md that won't have the holes, but it will be handled like a regular md for it. paf , the performance issue is due to the fact that LNet header is 256 bits bigger, even if we don't have sparse data. Also, we need to ensure compatibility with old LNet header.

Alexey Lyashkov added a comment - 21/Jul/24 10:21 AM

@Andreas, Same question for non GPU transfer. Once some pages avoid from transfer - who will be zeroed these pages? Or this is will broke a POSIX - which say "reading from unallocated region must return a zeroed buffer".

>The patch has passed the simple test with 4MiB bulk I/O.
did this test check a returned data to be zero ?

>Please note we are filtering out some pages from KIOV in a MD (LNet Layer), not from the lower layer IB MR (lower layer KLND).
Lower layers create a IB MR based on lnet MD, if some pages are lost you will be GAP in MR. It may work or may don't work - depend of HW. And depending of LND.

Alexey Lyashkov added a comment - 21/Jul/24 10:21 AM @Andreas, Same question for non GPU transfer. Once some pages avoid from transfer - who will be zeroed these pages? Or this is will broke a POSIX - which say "reading from unallocated region must return a zeroed buffer". >The patch has passed the simple test with 4MiB bulk I/O. did this test check a returned data to be zero ? >Please note we are filtering out some pages from KIOV in a MD (LNet Layer), not from the lower layer IB MR (lower layer KLND). Lower layers create a IB MR based on lnet MD, if some pages are lost you will be GAP in MR. It may work or may don't work - depend of HW. And depending of LND.

People

Assignee:: Cyril Bordage

Reporter:: Andreas Dilger

Votes:: 0 Vote for this issue

Watchers:: 12 Start watching this issue

Dates

Created:: 14/Jun/23 3:39 PM

Updated:: 12/May/25 9:09 PM