Details

    • Improvement
    • Resolution: Unresolved
    • Minor
    • None
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      Currently, when a client is reading a sparse file, it doesn't know whether there is data for any particular file offset, so it will pre-allocate pages and set up the RDMA for the full range of the reads. If the file is sparse (has a hole in the allocation) or has unwritten extents that return zeroes on read, the OST will zero full all of the requested pages, and transfer the zeroes over the network.

      It would be desirable to avoid sending the pages of zeroes over the network, to reduce network bandwidth and some CPU overhead on the OSS to zero out the pages. IIRC, the BRW WRITE reply returns an array of rc values for each page to indicate success/failure for each one. I would expect BRW READ to return a special state for each page that indicates that it is a hole.

      However, this is also complicated by the RDMA configuration, since it has already mapped the pages from the read buffer (which may be specific page cache pages). The best solution would be for the LNet bulk transfer to "not send" those pages in the middle of the RDMA, and have LNet (or the RDMA engine) zero-fill the pages on the client without actually sending them over the wire, but i have no idea about how easy or hard that is to implement.

      Failing that, if the holes are packed in the server-side bulk transfer setup (and it is possible to send only the first pages in a shorter bulk transfer), then the client would need to memcpy() the data into the correct pages (from last page to first) and zero the pages in the hole itself. That would add CPU/memory overhead on the client, and would not work for RDMA offload like GDS.

      Attachments

        Issue Links

          Activity

            [LU-16897] Optimize sparse file reads

            "Cyril Bordage <cbordage@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/59195
            Subject: LU-16897 utils: tools for tuning SFO performance
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 4d5539a773cb8b0bd5b8f28e2d471945faaf6e2c

            gerrit Gerrit Updater added a comment - "Cyril Bordage <cbordage@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/59195 Subject: LU-16897 utils: tools for tuning SFO performance Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 4d5539a773cb8b0bd5b8f28e2d471945faaf6e2c

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/53297/
            Subject: LU-16897 tgt: note 'hole' pages
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 4c3ddd1b8634c06579a4f4e77d93584076e1c5e1

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/53297/ Subject: LU-16897 tgt: note 'hole' pages Project: fs/lustre-release Branch: master Current Patch Set: Commit: 4c3ddd1b8634c06579a4f4e77d93584076e1c5e1

            "Cyril Bordage <cbordage@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/57884
            Subject: LU-16897 llite: option to enable/disable SFO
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: b206d0358a03402ee45bd51f962d9cf5ba3bc80c

            gerrit Gerrit Updater added a comment - "Cyril Bordage <cbordage@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/57884 Subject: LU-16897 llite: option to enable/disable SFO Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: b206d0358a03402ee45bd51f962d9cf5ba3bc80c

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/55404/
            Subject: LU-16897 idl: add OBD_CONNECT2_SPARSE flag
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: bc9261874c3cdd68c2bb452eac381ab98105badd

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/55404/ Subject: LU-16897 idl: add OBD_CONNECT2_SPARSE flag Project: fs/lustre-release Branch: master Current Patch Set: Commit: bc9261874c3cdd68c2bb452eac381ab98105badd

            "Cyril Bordage <cbordage@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/56788
            Subject: LU-16897 lnet: handle sparse file
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 713b5175b90f3f4ca08a23ab79d4ee7b92c6087e

            gerrit Gerrit Updater added a comment - "Cyril Bordage <cbordage@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/56788 Subject: LU-16897 lnet: handle sparse file Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 713b5175b90f3f4ca08a23ab79d4ee7b92c6087e

            It could be interesting and quick to run simple tests comparing Lustre performance with regular header, to Yingjin's header version in order to quantify the impact of the change and know which way we take.

            cbordage Cyril Bordage added a comment - It could be interesting and quick to run simple tests comparing Lustre performance with regular header, to Yingjin's header version in order to quantify the impact of the change and know which way we take.

            I don't think Cyril's HLD document reflects the implementation in Yingjin's patch, which is an alternative to the patch produced by Cyril.

            With Yingjin's patch, there are no changes needed in the LND/network layer, because it replies back to the client BRW request with a bitmap of pages that are entirely holes, and the client changes the bulk request to avoid requesting/transferring the pages at all, and instead zeroes those pages locally. This ends up as a "smart" BRW read that doesn't request bulk pages that contain no data, saving on network bandwidth, at the cost of an extra LNet message to transfer the 32-byte bitmap (one bit per 4KB page in a 1MB bulk). This is the same BRW as what the client would do today with a sparse write, except the client knows the sparseness in advance.

            If we adopt Yingjin's implementation (which I think is probably preferable) then the HLD should be updated to reflect this. The main reason not to use Yingjin's patch would be if the performance/latency is significantly worse than Cyril's because of the extra LNet reply message with the sparseness bitmap. I don't think this is likely, however.

            adilger Andreas Dilger added a comment - I don't think Cyril's HLD document reflects the implementation in Yingjin's patch, which is an alternative to the patch produced by Cyril. With Yingjin's patch, there are no changes needed in the LND/network layer, because it replies back to the client BRW request with a bitmap of pages that are entirely holes, and the client changes the bulk request to avoid requesting/transferring the pages at all, and instead zeroes those pages locally. This ends up as a "smart" BRW read that doesn't request bulk pages that contain no data, saving on network bandwidth, at the cost of an extra LNet message to transfer the 32-byte bitmap (one bit per 4KB page in a 1MB bulk). This is the same BRW as what the client would do today with a sparse write, except the client knows the sparseness in advance. If we adopt Yingjin's implementation (which I think is probably preferable) then the HLD should be updated to reflect this. The main reason not to use Yingjin's patch would be if the performance/latency is significantly worse than Cyril's because of the extra LNet reply message with the sparseness bitmap. I don't think this is likely, however.
            cfaber Colin Faber added a comment -

            Just a reminder, there is an HLD which cbordage produced here. It would be highly useful for comments to be reflected there.

            cfaber Colin Faber added a comment - Just a reminder, there is an HLD which cbordage produced here . It would be highly useful for comments to be reflected there.

            Yingjin,

            It probably you misunderstand a GAP. IB specification say - once you create a MR with 256 page size, you should fill it's with 256 pages. all pages must be in the continue logical addresses. If you have less pages or logical addresses have a hole - it's named 'GAP' in IB specification terms.

            >Thus we can zero hole pages
            not "can" but "must". All other will violate a POSIX.

            >The patch is workable.
            It might work in our environment - but as i say early it might be not a true at all other. One of pointed example - it's GPU. Because GPU pages don't accessible after LND transfer done and pages is unmapped.

            You should discus with Cray GNI/KFI developers to be sure you don't broke anything for him.

            shadow Alexey Lyashkov added a comment - Yingjin, It probably you misunderstand a GAP. IB specification say - once you create a MR with 256 page size, you should fill it's with 256 pages. all pages must be in the continue logical addresses. If you have less pages or logical addresses have a hole - it's named 'GAP' in IB specification terms. >Thus we can zero hole pages not "can" but "must". All other will violate a POSIX. >The patch is workable. It might work in our environment - but as i say early it might be not a true at all other. One of pointed example - it's GPU. Because GPU pages don't accessible after LND transfer done and pages is unmapped. You should discus with Cray GNI/KFI developers to be sure you don't broke anything for him.
            qian_wc Qian Yingjin added a comment - - edited

            Alexey,

            It seems that you have misunderstanding on the GAP in a file and GAP in HCA virtual address mapping.

            For an example, we can map the physical pages for a file (with size of PAGE_SIZE, but the logical offsets are not contiguous in file) into a contiguous virtual DMA address in a IB HCA.

            Thus we can zero hole pages and remove hole pages from MD in LNet layer, give a modified MD (i.e. all pages after removing holes are PAGE SIZE expect the first and last one) to the lower IB layer.

            We do the same operations above for the corresponding MD on both source and sink.

            Thus the RDMA bulk data only needs to transfer the needed data but filtering out holes.

             

            The patch is workable. And please note we test holes with checksum enabled. It will check the read data on the client side according to the checksum returned by server.

            qian_wc Qian Yingjin added a comment - - edited Alexey, It seems that you have misunderstanding on the GAP in a file and GAP in HCA virtual address mapping. For an example, we can map the physical pages for a file (with size of PAGE_SIZE, but the logical offsets are not contiguous in file) into a contiguous virtual DMA address in a IB HCA. Thus we can zero hole pages and remove hole pages from MD in LNet layer, give a modified MD (i.e. all pages after removing holes are PAGE SIZE expect the first and last one) to the lower IB layer. We do the same operations above for the corresponding MD on both source and sink. Thus the RDMA bulk data only needs to transfer the needed data but filtering out holes.   The patch is workable. And please note we test holes with checksum enabled. It will check the read data on the client side according to the checksum returned by server.

            People

              cbordage Cyril Bordage
              adilger Andreas Dilger
              Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated: