Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-16355

batch dirty buffered write of small files

Details

    • New Feature
    • Resolution: Unresolved
    • Minor
    • None
    • None
    • None
    • 9223372036854775807

    Description

      For buffered I/O mode, I/O can cache on the client side until flush is needed.
      Small I/O is not good supported in the current Lustre.
      To improve the small I/O performance, Lustre has already implemented short I/O feature in which the data is transferring using inline buffer of an I/O RPC request. However, the performance improvement is limited.

      After batched RPC introduced, It can batch many dirty pages of many small files at OSC layer into a large RPC and transfer the I/O in bulk I/O mode.
      The max dirty pages allowed by an OSC is reach 2G. Thus, it can cache lots of dirty data from OSC objects before hit max dirty limit or the space grants which needs to write out the data.
      In OSC layer, It can scan the dirty objects, and batching the dirty pages of these objects, and send the I/O requests in a batched way.

      It expects that this feature can benefit the write I/O from many small files and sync() at the end of the writes. (i.e. mdtest-hard-write).

      Here there are two design choices:
      1. Use the existed short I/O mechanism to store the data into the batched RPC.
      The advantage is that

      • It can store very small file data (i.e. less than 1024 bytes) in a much efficient way and does not need a whole page to hold the data for small files.
      • It can better integrate with the batched RPC.
        The disadvantage is that the data movement is not zero-copy. The dirty pages is needed to copy to the inline buffer of the RPC on the client side, and the inline data still needs to copy into the prepared page buffer to doing I/O to the backend filesystem on the server side.
        2. Use RDMA mechanism, bind the dirty page IOV from multiple objects to the Bulk I/O directly on the client side, transfer the data into the prepared page IOV on the server side.
        The advantage of this mechanism is that all data movement is zero copy from a client to a server.
        The disadvantage is that:
      • The implementation may be complex. The bulk IOV contains the I/O pages from multiple objects, which may change the I/O logic at server side a lot.
      • the min I/O per object is page size, not much efficient for small objects just with several bytes.

      Any suggestion and comment is welcome!

      Attachments

        Issue Links

          Activity

            [LU-16355] batch dirty buffered write of small files

            "Qian Yingjin <qian@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52200
            Subject: LU-16355 osc: batch small writes based on small object count
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: caf3701792f394b702bf2f260d8ab850304b94ba

            gerrit Gerrit Updater added a comment - "Qian Yingjin <qian@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52200 Subject: LU-16355 osc: batch small writes based on small object count Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: caf3701792f394b702bf2f260d8ab850304b94ba

            "Qian Yingjin <qian@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52129
            Subject: LU-16355 osc: add tunable for batching small writes
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: bc7e9d919b3ad5cfc2f8339da108eae513478e0b

            gerrit Gerrit Updater added a comment - "Qian Yingjin <qian@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52129 Subject: LU-16355 osc: add tunable for batching small writes Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: bc7e9d919b3ad5cfc2f8339da108eae513478e0b
            qian_wc Qian Yingjin added a comment -

            I have some thoughts about ahead operations (batched open+read-ahead for DoM files), not implemented yet. But we have ahead operations framework.
            However, it only works for batching of small DoM-only files.
            For files with data on OSTs, we can open-ahead the file, but current batch RPC does not support batched extent DLM locking, but we can use asynchronous RPCs (one PRC per file read, not batched RPC), which is similar to asynchronous RPCs for stat-ahead for AGL after opened the files.

            qian_wc Qian Yingjin added a comment - I have some thoughts about ahead operations (batched open+read-ahead for DoM files), not implemented yet. But we have ahead operations framework. However, it only works for batching of small DoM-only files. For files with data on OSTs, we can open-ahead the file, but current batch RPC does not support batched extent DLM locking, but we can use asynchronous RPCs (one PRC per file read, not batched RPC), which is similar to asynchronous RPCs for stat-ahead for AGL after opened the files.

            Have you looked at implementing batched read support, if the reads can be generated asynchronously (e.g. via AIO, io_uring, or statahead for mdtest-easy/hard-read)?

            adilger Andreas Dilger added a comment - Have you looked at implementing batched read support, if the reads can be generated asynchronously (e.g. via AIO, io_uring, or statahead for mdtest-easy/hard-read)?

            "Qian Yingjin <qian@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49342
            Subject: LU-16355 osc: batch dirty buffered write of small files
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 11e13575d9370c66742ce44cffa0ce0cfedc1f63

            gerrit Gerrit Updater added a comment - "Qian Yingjin <qian@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49342 Subject: LU-16355 osc: batch dirty buffered write of small files Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 11e13575d9370c66742ce44cffa0ce0cfedc1f63
            • the min I/O per object is page size, not much efficient for small objects just with several bytes.

            If we agreed that zero-copy IO was not needed/possible for very small writes (smaller than 4KB), then it would be possible to pack the dirty data from multiple files into a single RDMA transfer, and then copy the data out of the pages on the server into the server side inode pages again. Even with the extra memcpy() it would likely still be faster than sending separate RPCs for each file. This also would fit very well with WBC since it could create DoM layouts directly for small files, and skip the DoM component for larger files that will store data on OSTs.

            It isn't very clear if packing the pages would help with mdtest-hard-write since those files are 3901 bytes, so only 195 bytes smaller than 4096-byte pages (4%). However, packing multiple objects into a single RPC should hopefully improve performance.

            If it is helpful, there was at one time support in the OBD_BRW_WRITE RPC for handling multiple objects, since there could be an array of struct obd_ioobj in the request, but I think much of this support was removed, because there could only be a single struct obdo with file attributes per RPC (timestamps, UID, GID, PRJID, etc), so it didn't make sense to have writes to multiple objects. However, if the writes are (commonly) all from the same UID/GID/PRJID then this might be possible.

            Alternately, having the batching at the RPC level is likely still helpful, and this would also allow mballoc to do a better job to aggregate the blocks of many small file writes together in the filesystem (e.g. 256x4KB writes into a single 1MB group allocation on disk).

            For very small files (< 512 bytes) it would be possible to use the "inline data" feature of ldiskfs (LU-5603) to store data directly in the inode. This could be used with DoM files to store them most efficiently, but may need some added testing/fixing in ldiskfs to work correctly with other ldiskfs features.

            adilger Andreas Dilger added a comment - the min I/O per object is page size, not much efficient for small objects just with several bytes. If we agreed that zero-copy IO was not needed/possible for very small writes (smaller than 4KB), then it would be possible to pack the dirty data from multiple files into a single RDMA transfer, and then copy the data out of the pages on the server into the server side inode pages again. Even with the extra memcpy() it would likely still be faster than sending separate RPCs for each file. This also would fit very well with WBC since it could create DoM layouts directly for small files, and skip the DoM component for larger files that will store data on OSTs. It isn't very clear if packing the pages would help with mdtest-hard-write since those files are 3901 bytes, so only 195 bytes smaller than 4096-byte pages (4%). However, packing multiple objects into a single RPC should hopefully improve performance. If it is helpful, there was at one time support in the OBD_BRW_WRITE RPC for handling multiple objects, since there could be an array of struct obd_ioobj in the request, but I think much of this support was removed, because there could only be a single struct obdo with file attributes per RPC (timestamps, UID, GID, PRJID, etc), so it didn't make sense to have writes to multiple objects. However, if the writes are (commonly) all from the same UID/GID/PRJID then this might be possible. Alternately, having the batching at the RPC level is likely still helpful, and this would also allow mballoc to do a better job to aggregate the blocks of many small file writes together in the filesystem (e.g. 256x4KB writes into a single 1MB group allocation on disk). For very small files (< 512 bytes) it would be possible to use the "inline data" feature of ldiskfs ( LU-5603 ) to store data directly in the inode. This could be used with DoM files to store them most efficiently, but may need some added testing/fixing in ldiskfs to work correctly with other ldiskfs features.

            People

              qian_wc Qian Yingjin
              qian_wc Qian Yingjin
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated: