Details
-
Improvement
-
Resolution: Unresolved
-
Minor
-
None
-
None
-
None
-
3
-
9223372036854775807
Description
Currently, when a client is reading a sparse file, it doesn't know whether there is data for any particular file offset, so it will pre-allocate pages and set up the RDMA for the full range of the reads. If the file is sparse (has a hole in the allocation) or has unwritten extents that return zeroes on read, the OST will zero full all of the requested pages, and transfer the zeroes over the network.
It would be desirable to avoid sending the pages of zeroes over the network, to reduce network bandwidth and some CPU overhead on the OSS to zero out the pages. IIRC, the BRW WRITE reply returns an array of rc values for each page to indicate success/failure for each one. I would expect BRW READ to return a special state for each page that indicates that it is a hole.
However, this is also complicated by the RDMA configuration, since it has already mapped the pages from the read buffer (which may be specific page cache pages). The best solution would be for the LNet bulk transfer to "not send" those pages in the middle of the RDMA, and have LNet (or the RDMA engine) zero-fill the pages on the client without actually sending them over the wire, but i have no idea about how easy or hard that is to implement.
Failing that, if the holes are packed in the server-side bulk transfer setup (and it is possible to send only the first pages in a shorter bulk transfer), then the client would need to memcpy() the data into the correct pages (from last page to first) and zero the pages in the hole itself. That would add CPU/memory overhead on the client, and would not work for RDMA offload like GDS.
Attachments
Issue Links
- is related to
-
LU-19119 Add umd_ prefix for fields in lnet_md
-
- Open
-
- links to
- mentioned in
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
- Wiki Page
-
Wiki Page Loading...
I don't think Cyril's HLD document reflects the implementation in Yingjin's patch, which is an alternative to the patch produced by Cyril.
With Yingjin's patch, there are no changes needed in the LND/network layer, because it replies back to the client BRW request with a bitmap of pages that are entirely holes, and the client changes the bulk request to avoid requesting/transferring the pages at all, and instead zeroes those pages locally. This ends up as a "smart" BRW read that doesn't request bulk pages that contain no data, saving on network bandwidth, at the cost of an extra LNet message to transfer the 32-byte bitmap (one bit per 4KB page in a 1MB bulk). This is the same BRW as what the client would do today with a sparse write, except the client knows the sparseness in advance.
If we adopt Yingjin's implementation (which I think is probably preferable) then the HLD should be updated to reflect this. The main reason not to use Yingjin's patch would be if the performance/latency is significantly worse than Cyril's because of the extra LNet reply message with the sparseness bitmap. I don't think this is likely, however.