Details

    • Improvement
    • Resolution: Unresolved
    • Minor
    • None
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      Currently, when a client is reading a sparse file, it doesn't know whether there is data for any particular file offset, so it will pre-allocate pages and set up the RDMA for the full range of the reads. If the file is sparse (has a hole in the allocation) or has unwritten extents that return zeroes on read, the OST will zero full all of the requested pages, and transfer the zeroes over the network.

      It would be desirable to avoid sending the pages of zeroes over the network, to reduce network bandwidth and some CPU overhead on the OSS to zero out the pages. IIRC, the BRW WRITE reply returns an array of rc values for each page to indicate success/failure for each one. I would expect BRW READ to return a special state for each page that indicates that it is a hole.

      However, this is also complicated by the RDMA configuration, since it has already mapped the pages from the read buffer (which may be specific page cache pages). The best solution would be for the LNet bulk transfer to "not send" those pages in the middle of the RDMA, and have LNet (or the RDMA engine) zero-fill the pages on the client without actually sending them over the wire, but i have no idea about how easy or hard that is to implement.

      Failing that, if the holes are packed in the server-side bulk transfer setup (and it is possible to send only the first pages in a shorter bulk transfer), then the client would need to memcpy() the data into the correct pages (from last page to first) and zero the pages in the hole itself. That would add CPU/memory overhead on the client, and would not work for RDMA offload like GDS.

      Attachments

        Issue Links

          Activity

            [LU-16897] Optimize sparse file reads

            I don't think Cyril's HLD document reflects the implementation in Yingjin's patch, which is an alternative to the patch produced by Cyril.

            With Yingjin's patch, there are no changes needed in the LND/network layer, because it replies back to the client BRW request with a bitmap of pages that are entirely holes, and the client changes the bulk request to avoid requesting/transferring the pages at all, and instead zeroes those pages locally. This ends up as a "smart" BRW read that doesn't request bulk pages that contain no data, saving on network bandwidth, at the cost of an extra LNet message to transfer the 32-byte bitmap (one bit per 4KB page in a 1MB bulk). This is the same BRW as what the client would do today with a sparse write, except the client knows the sparseness in advance.

            If we adopt Yingjin's implementation (which I think is probably preferable) then the HLD should be updated to reflect this. The main reason not to use Yingjin's patch would be if the performance/latency is significantly worse than Cyril's because of the extra LNet reply message with the sparseness bitmap. I don't think this is likely, however.

            adilger Andreas Dilger added a comment - I don't think Cyril's HLD document reflects the implementation in Yingjin's patch, which is an alternative to the patch produced by Cyril. With Yingjin's patch, there are no changes needed in the LND/network layer, because it replies back to the client BRW request with a bitmap of pages that are entirely holes, and the client changes the bulk request to avoid requesting/transferring the pages at all, and instead zeroes those pages locally. This ends up as a "smart" BRW read that doesn't request bulk pages that contain no data, saving on network bandwidth, at the cost of an extra LNet message to transfer the 32-byte bitmap (one bit per 4KB page in a 1MB bulk). This is the same BRW as what the client would do today with a sparse write, except the client knows the sparseness in advance. If we adopt Yingjin's implementation (which I think is probably preferable) then the HLD should be updated to reflect this. The main reason not to use Yingjin's patch would be if the performance/latency is significantly worse than Cyril's because of the extra LNet reply message with the sparseness bitmap. I don't think this is likely, however.
            cfaber Colin Faber added a comment -

            Just a reminder, there is an HLD which cbordage produced here. It would be highly useful for comments to be reflected there.

            cfaber Colin Faber added a comment - Just a reminder, there is an HLD which cbordage produced here . It would be highly useful for comments to be reflected there.

            Yingjin,

            It probably you misunderstand a GAP. IB specification say - once you create a MR with 256 page size, you should fill it's with 256 pages. all pages must be in the continue logical addresses. If you have less pages or logical addresses have a hole - it's named 'GAP' in IB specification terms.

            >Thus we can zero hole pages
            not "can" but "must". All other will violate a POSIX.

            >The patch is workable.
            It might work in our environment - but as i say early it might be not a true at all other. One of pointed example - it's GPU. Because GPU pages don't accessible after LND transfer done and pages is unmapped.

            You should discus with Cray GNI/KFI developers to be sure you don't broke anything for him.

            shadow Alexey Lyashkov added a comment - Yingjin, It probably you misunderstand a GAP. IB specification say - once you create a MR with 256 page size, you should fill it's with 256 pages. all pages must be in the continue logical addresses. If you have less pages or logical addresses have a hole - it's named 'GAP' in IB specification terms. >Thus we can zero hole pages not "can" but "must". All other will violate a POSIX. >The patch is workable. It might work in our environment - but as i say early it might be not a true at all other. One of pointed example - it's GPU. Because GPU pages don't accessible after LND transfer done and pages is unmapped. You should discus with Cray GNI/KFI developers to be sure you don't broke anything for him.
            qian_wc Qian Yingjin added a comment - - edited

            Alexey,

            It seems that you have misunderstanding on the GAP in a file and GAP in HCA virtual address mapping.

            For an example, we can map the physical pages for a file (with size of PAGE_SIZE, but the logical offsets are not contiguous in file) into a contiguous virtual DMA address in a IB HCA.

            Thus we can zero hole pages and remove hole pages from MD in LNet layer, give a modified MD (i.e. all pages after removing holes are PAGE SIZE expect the first and last one) to the lower IB layer.

            We do the same operations above for the corresponding MD on both source and sink.

            Thus the RDMA bulk data only needs to transfer the needed data but filtering out holes.

             

            The patch is workable. And please note we test holes with checksum enabled. It will check the read data on the client side according to the checksum returned by server.

            qian_wc Qian Yingjin added a comment - - edited Alexey, It seems that you have misunderstanding on the GAP in a file and GAP in HCA virtual address mapping. For an example, we can map the physical pages for a file (with size of PAGE_SIZE, but the logical offsets are not contiguous in file) into a contiguous virtual DMA address in a IB HCA. Thus we can zero hole pages and remove hole pages from MD in LNet layer, give a modified MD (i.e. all pages after removing holes are PAGE SIZE expect the first and last one) to the lower IB layer. We do the same operations above for the corresponding MD on both source and sink. Thus the RDMA bulk data only needs to transfer the needed data but filtering out holes.   The patch is workable. And please note we test holes with checksum enabled. It will check the read data on the client side according to the checksum returned by server.

            Cyril.
            1. You don't answer - who will be zero a memory you skip from transfer or you will be violate a POSIX requirements?

            2. If hole in middle of requested area, removing any pages from transfer will introduce a GAP. you disagree with it? If it's GAP did you agree this is needs some special handing from HW ?

            shadow Alexey Lyashkov added a comment - Cyril. 1. You don't answer - who will be zero a memory you skip from transfer or you will be violate a POSIX requirements? 2. If hole in middle of requested area, removing any pages from transfer will introduce a GAP. you disagree with it? If it's GAP did you agree this is needs some special handing from HW ?

            Ah, I see.  So an extra 32 bytes/256 bits.  How big is the LNet header currently (without the extra 256 bits)?  And if it's a big jump in header size...  is there some way to avoid this in common cases?  Like having two versions of the header, one used only for bulks?  That would eliminate the overhead for everything else and bulks are generally large enough this shouldn't matter.

            Compatibility is an issue with either approach, I think, so that's less concerning unless it's particularly hard to ensure compatibility in this case.

            paf Patrick Farrell (Inactive) added a comment - Ah, I see.  So an extra 32 bytes/256 bits.  How big is the LNet header currently (without the extra 256 bits)?  And if it's a big jump in header size...  is there some way to avoid this in common cases?  Like having two versions of the header, one used only for bulks?  That would eliminate the overhead for everything else and bulks are generally large enough this shouldn't matter. Compatibility is an issue with either approach, I think, so that's less concerning unless it's particularly hard to ensure compatibility in this case.

            shadow, the modification is done at higher level. The lnd will work on a modified md that won't have the holes, but it will be handled like a regular md for it.

            paf, the performance issue is due to the fact that LNet header is 256 bits bigger, even if we don't have sparse data. Also, we need to ensure compatibility with old LNet header.

            cbordage Cyril Bordage added a comment - shadow , the modification is done at higher level. The lnd will work on a modified md that won't have the holes, but it will be handled like a regular md for it. paf , the performance issue is due to the fact that LNet header is 256 bits bigger, even if we don't have sparse data. Also, we need to ensure compatibility with old LNet header.

            @Andreas, Same question for non GPU transfer. Once some pages avoid from transfer - who will be zeroed these pages? Or this is will broke a POSIX - which say "reading from unallocated region must return a zeroed buffer".

            >The patch has passed the simple test with 4MiB bulk I/O.
            did this test check a returned data to be zero ?

            >Please note we are filtering out some pages from KIOV in a MD (LNet Layer), not from the lower layer IB MR (lower layer KLND).
            Lower layers create a IB MR based on lnet MD, if some pages are lost you will be GAP in MR. It may work or may don't work - depend of HW. And depending of LND.

            shadow Alexey Lyashkov added a comment - @Andreas, Same question for non GPU transfer. Once some pages avoid from transfer - who will be zeroed these pages? Or this is will broke a POSIX - which say "reading from unallocated region must return a zeroed buffer". >The patch has passed the simple test with 4MiB bulk I/O. did this test check a returned data to be zero ? >Please note we are filtering out some pages from KIOV in a MD (LNet Layer), not from the lower layer IB MR (lower layer KLND). Lower layers create a IB MR based on lnet MD, if some pages are lost you will be GAP in MR. It may work or may don't work - depend of HW. And depending of LND.
            qian_wc Qian Yingjin added a comment - - edited

            Alexey,

            Will don't work. Ptlrpc want an events for each MD. But you may filter out some of them. So.. ptlrpc will wait forever.

            Please note we do not eliminate any LNET events for each MD even all pages in a MD are holes.

            The source sends LNET_MSG_PUT_SPARSE for each MD in bulk read I/O with holes, after the I/O finished, the sink will send a LNET_MSG_ACK to the source. and it will generate the corresponding LNET events to notify PtlRPC via callbacks.

            The patch has passed the simple test with 4MiB bulk I/O. 

            if our solution will filter out some pages from MR - you will be have a problems with IB cards.
            And needs to check with other LND.

            Please note we are filtering out some pages from KIOV in a MD (LNet Layer), not from the lower layer IB MR (lower layer KLND).

            qian_wc Qian Yingjin added a comment - - edited Alexey, Will don't work. Ptlrpc want an events for each MD. But you may filter out some of them. So.. ptlrpc will wait forever. Please note we do not eliminate any LNET events for each MD even all pages in a MD are holes. The source sends LNET_MSG_PUT_SPARSE for each MD in bulk read I/O with holes, after the I/O finished, the sink will send a LNET_MSG_ACK to the source. and it will generate the corresponding LNET events to notify PtlRPC via callbacks. The patch has passed the simple test with 4MiB bulk I/O.  if our solution will filter out some pages from MR - you will be have a problems with IB cards. And needs to check with other LND. Please note we are filtering out some pages from KIOV in a MD (LNet Layer), not from the lower layer IB MR (lower layer KLND).

            shadow, if there are issues with sparse reads for GDS then this feature could be disabled for GDS/GPU transfers. It is still possible and useful to optimize the non-GPU IO path, since it is not worse than what is done today.

            It might be possible to add a future workaround for GDS to RDMA a local zero page for all of the pages that are skipped by LNet, or call some kind of GDS "memset page to zero" but that can also be done in a later patch.

            adilger Andreas Dilger added a comment - shadow , if there are issues with sparse reads for GDS then this feature could be disabled for GDS/GPU transfers. It is still possible and useful to optimize the non-GPU IO path, since it is not worse than what is done today. It might be possible to add a future workaround for GDS to RDMA a local zero page for all of the pages that are skipped by LNet, or call some kind of GDS "memset page to zero" but that can also be done in a later patch.

            Yingjin,

            I this you focused to o2ib - but if you add something - it should works with all LND.

            > Yes, the client and the server both sides are remapping KIOV pages (in lnet_libmd) to create the required mapping.
            > The server side filters out the hole pages from KIOV in lnet_libmd (MD).

            Will don't work. Ptlrpc want an events for each MD. But you may filter out some of them. So.. ptlrpc will wait forever.

            > Our solution is in LNet layer, only need to change the KIOV mapping in each MD (lnet_libmd).
            if our solution will filter out some pages from MR - you will be have a problems with IB cards.
            And needs to check with other LND.

            >Our solution is in LNet layer, only need to change the KIOV mapping in each MD (lnet_libmd).
            No matter we enable SG_GAPS or not, it should still work, I think.

            This will don't work with GPU. Client provides a DIO pages which remapped in the o2iblnd to real addresses. If you will filter out some pages - GPU memory will have a random data instead of zeros for filtered pages.

            shadow Alexey Lyashkov added a comment - Yingjin, I this you focused to o2ib - but if you add something - it should works with all LND. > Yes, the client and the server both sides are remapping KIOV pages (in lnet_libmd) to create the required mapping. > The server side filters out the hole pages from KIOV in lnet_libmd (MD). Will don't work. Ptlrpc want an events for each MD. But you may filter out some of them. So.. ptlrpc will wait forever. > Our solution is in LNet layer, only need to change the KIOV mapping in each MD (lnet_libmd). if our solution will filter out some pages from MR - you will be have a problems with IB cards. And needs to check with other LND. >Our solution is in LNet layer, only need to change the KIOV mapping in each MD (lnet_libmd). No matter we enable SG_GAPS or not, it should still work, I think. This will don't work with GPU. Client provides a DIO pages which remapped in the o2iblnd to real addresses. If you will filter out some pages - GPU memory will have a random data instead of zeros for filtered pages.

            People

              cbordage Cyril Bordage
              adilger Andreas Dilger
              Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated: