Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-13798

Improve direct i/o performance with multiple stripes: Submit all stripes of a DIO and then wait

Details

    • Improvement
    • Resolution: Fixed
    • Major
    • Lustre 2.15.0
    • None
    • None
    • 9223372036854775807

    Description

      The AIO implementation created in LU-4198 is able to perform at extremely high speeds because it submits multiple i/os via the direct i/o path, in a manner similar to the buffered i/o path.

      Consider the case where we do 1 MiB AIO requests with a queue depth of 64 MiB.  In this case, we submit 64 1 MiB DIO requests, and then we wait for them to complete.  (Assume we do only 64 MiB of i/o total, just for ease of conversation.)

      Critically, we submit all the i/o requests and then wait for completion.  We do not wait for completion of individual 1 MiB writes.

      Compare this now to the case where we write do a 64 MiB DIO write (or some smaller size, but > stripe size).  As LU-4198 originally noted, the performance of DIO does not scale when adding stripes.

      Consider a file with a stripe size of 1 MiB.

      This 64 MiB DIO generates 64 1 MiB writes, exactly the same as AIO with a queue depth of 64.

      Except that while the AIO request performs at ~4-5 GiB/s, the DIO request performs at ~300 MiB/s.

      This is because the DIO system submits each 1 MiB request and then waits for it:
      (Submit 1 stripe(1 MiB)) --> wait for sync, (Submit 1 stripe (1 MiB)) --> wait for sync ... etc, 64 times.

      AIO submits all of the requests and then waits, so:
      (Submit 1 stripe(1 MiB)) -> (Submit 1 stripe(1 MiB)) -> (Submit 1 stripe(1 MiB)) -> (Submit 1 stripe(1 MiB)) -> (Submit 1 stripe(1 MiB)) -> (Submit 1 stripe(1 MiB)) ->  ... (Wait for all writes to complete)

      There is no reason DIO cannot work the same way, and when we make this change, large DIO writes & reads jump in performance to the same levels as AIO with an equivalent queue depth.

      The change consists of essentially moving the waiting from the ll_direct_rw_* code up to the ll_file_io_generic layer and waiting for the completion of all submitted i/os rather than one at a time - It is a relatively simple change.

      The improvement is dramatic, from a few hundred MiB/s to roughly 5 GiB/s.

      Quick benchmark:

      mpirun -np 1 $IOR -w -r -t 256M -b 64G -o ./iorfile --posix.odirect
      Before:
      Max Write: 583.03 MiB/sec (611.35 MB/sec)
      Max Read:  641.03 MiB/sec (672.17 MB/sec)
       
      After (w/patch):
      Max Write: 5185.96 MiB/sec (5437.87 MB/sec)
      Max Read:  5093.06 MiB/sec (5340.46 MB/sec) 

      The basic patch is relatively simple, but there are a number of additional subtleties to work out around when to do this and what sizes to submit, etc, etc.  Basic patch will be forthcoming shortly.

      Attachments

        Issue Links

          Activity

            [LU-13798] Improve direct i/o performance with multiple stripes: Submit all stripes of a DIO and then wait
            adilger Andreas Dilger made changes -
            Link New: This issue is related to EX-4334 [ EX-4334 ]
            adilger Andreas Dilger made changes -
            Link New: This issue is related to LU-15371 [ LU-15371 ]
            paf0186 Patrick Farrell made changes -
            Link New: This issue is related to LU-14828 [ LU-14828 ]
            pjones Peter Jones made changes -
            Fix Version/s New: Lustre 2.15.0 [ 14791 ]
            Resolution New: Fixed [ 1 ]
            Status Original: Open [ 1 ] New: Resolved [ 5 ]
            pjones Peter Jones added a comment -

            Landed for 2.15

            pjones Peter Jones added a comment - Landed for 2.15

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39436/
            Subject: LU-13798 llite: parallelize direct i/o issuance
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: cba07b68f9386b6169788065c8cba1974cb7f712

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39436/ Subject: LU-13798 llite: parallelize direct i/o issuance Project: fs/lustre-release Branch: master Current Patch Set: Commit: cba07b68f9386b6169788065c8cba1974cb7f712

            So I think Shilong was suggesting this, but it took me a bit to figure this out.

            We cannot reset the file size to the original size on error.  Not really.  It is possible it was updated by another client at almost any point during the write syscall, and we cannot figure out what updates were from the client which received an error vs another client.  (At least, not practically.)

            So I think the async DIO behavior will have to be like Shilong said - the same as a regular write + fsync().  It will return an error if there was an error, but it can't tell you how many bytes were written (or which bytes were written).

            So in this case, of a 5 MiB write to an empty file:
            W W X W W

            We would return an error, but the file size would be 5 MiB.

            ENOSPC is still handled 'correctly' with short writes, etc, because we have grant.

            Just wanted to state clearly the behavior we're going to have here.

            paf0186 Patrick Farrell added a comment - So I think Shilong was suggesting this, but it took me a bit to figure this out. We cannot reset the file size to the original size on error.  Not really.  It is possible it was updated by another client at almost any point during the write syscall, and we cannot figure out what updates were from the client which received an error vs another client.  (At least, not practically.) So I think the async DIO behavior will have to be like Shilong said - the same as a regular write + fsync().  It will return an error if there was an error, but it can't tell you how many bytes were written (or which bytes were written). So in this case, of a 5 MiB write to an empty file: W W X W W We would return an error, but the file size would be 5 MiB. ENOSPC is still handled 'correctly' with short writes, etc, because we have grant. Just wanted to state clearly the behavior we're going to have here.

            Yes, I agree - I think that's a good way to think of it.

            And yes, thank you for the reminder, I need to add that switch.  We already have to have the old non-parallel mode for pipes (next version of patch explains this), so it doesn't add any more code to the i/o path to make it switchable.

            paf0186 Patrick Farrell added a comment - Yes, I agree - I think that's a good way to think of it. And yes, thank you for the reminder, I need to add that switch.  We already have to have the old non-parallel mode for pipes (next version of patch explains this), so it doesn't add any more code to the i/o path to make it switchable.

            I think even "return the contiguous byte written at the beginning".  did not totally fix confusing. As still some data writting eg:

            W W X W

            If we could return 2M, this still is confusing, as we actually wrote another 1M, this is still a bit different though. If we consider this parallel DIO as something like buffer IO + fsync(): write dirty data and then call fsync(),  fsync() will return error to application if some write failed, but we have no idea how much data we really wrote...

             

            Maybe we could just use easy way to return error to caller, but better add an option to disable parallel DIO in case it break some existing application?

             

            wshilong Wang Shilong (Inactive) added a comment - I think even "return the contiguous byte written at the beginning".  did not totally fix confusing. As still some data writting eg: W W X W If we could return 2M, this still is confusing, as we actually wrote another 1M, this is still a bit different though.  If we consider this parallel DIO as something like buffer IO + fsync(): write dirty data and then call fsync(),  fsync() will return error to application if some write failed, but we have no idea how much data we really wrote...   Maybe we could just use easy way to return error to caller, but better add an option to disable parallel DIO in case it break some existing application?  

            "If write is expanding file size, return error directly might be fine, as in ext4 expanding file size will be executed after IO, short write data will be discarded as file size was not updated, only question is if it is fine if IO apply on existed data."

            Well, we have to make sure the file size isn't updated, right?  I'm not quite sure when that occurs relative to error processing here...  OK, I'm going to add that to the list of things to verify.  (How does it work for AIO writes...?)

            My thinking is this:
            Because we can get a failure "in the middle", it's not realistic to do "short i/o" and return bytes written.  I think that's only useful if they're contiguous.

            So our failure cases are things like:
            X W W W W

            In that case, we could just return error, since we didn't write any bytes.

            OK, so now:
            W X W W W

            What do we return here?  1 MiB?

            Or:
            W W W X W

            3 MiB here?

            I think the only arguably correct choices are "just return an error" or "return the contiguous byte written at the beginning".  Because we cannot accurately represent a write with a hole in it to the application.  There's no way to describe that.

            Just returning an error has these advantages:
            It is relatively simple.  No tracking which regions completed and sorting out contiguous bytes written.

            But it does not let users know if we did write some contiguous bytes at the start.  The concern then is they assume that we didn't write any other bytes...  This doesn't seem very dangerous in practice, though.

            For extending a file...  Similar behavior - We extend it as far as the contiguous bytes written allow us.

            I don't really like this - we're going to have to track every submitted RPC up at the top level, so we can verify they're contiguous, and they can arrive in any order, so we're going to have to track them all basically with some sort of extent map.

            This is necessary if we want to give "report contiguous bytes written" as our response.

            I would argue that the upstream kernel no longer does this for DIO, which suggests to me we can get away with just returning an error.  That is certainly much easier.

            paf0186 Patrick Farrell added a comment - "If write is expanding file size, return error directly might be fine, as in ext4 expanding file size will be executed after IO, short write data will be discarded as file size was not updated, only question is if it is fine if IO apply on existed data. " Well, we have to make sure the file size isn't updated, right?  I'm not quite sure when that occurs relative to error processing here...  OK, I'm going to add that to the list of things to verify.  (How does it work for AIO writes...?) My thinking is this: Because we can get a failure "in the middle", it's not realistic to do "short i/o" and return bytes written.  I think that's only useful if they're contiguous. So our failure cases are things like: X W W W W In that case, we could just return error, since we didn't write any bytes. OK, so now: W X W W W What do we return here?  1 MiB? Or: W W W X W 3 MiB here? I think the only arguably correct choices are "just return an error" or "return the contiguous byte written at the beginning".  Because we cannot accurately represent a write with a hole in it to the application.  There's no way to describe that. Just returning an error has these advantages: It is relatively simple.  No tracking which regions completed and sorting out contiguous bytes written. But it does not let users know if we did write some contiguous bytes at the start.  The concern then is they assume that we didn't write any other bytes...  This doesn't seem very dangerous in practice, though. For extending a file...  Similar behavior - We extend it as far as the contiguous bytes written allow us. I don't really like this - we're going to have to track every submitted RPC up at the top level, so we can verify they're contiguous, and they can arrive in any order, so we're going to have to track them all basically with some sort of extent map. This is necessary if we want to give "report contiguous bytes written" as our response. I would argue that the upstream kernel no longer does this for DIO, which suggests to me we can get away with just returning an error.  That is certainly much easier.

            People

              paf0186 Patrick Farrell
              paf0186 Patrick Farrell
              Votes:
              0 Vote for this issue
              Watchers:
              15 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: