[LU-13798] Improve direct i/o performance with multiple stripes: Submit all stripes of a DIO and then wait - Whamcloud Community JIRA

Details

Type: Improvement
Resolution: Fixed
Priority: Major
Fix Version/s: Lustre 2.15.0
Affects Version/s: None
Labels:
None

Rank (Obsolete):
9223372036854775807

Description

The AIO implementation created in ~~LU-4198~~ is able to perform at extremely high speeds because it submits multiple i/os via the direct i/o path, in a manner similar to the buffered i/o path.

Consider the case where we do 1 MiB AIO requests with a queue depth of 64 MiB. In this case, we submit 64 1 MiB DIO requests, and then we wait for them to complete. (Assume we do only 64 MiB of i/o total, just for ease of conversation.)

Critically, we submit all the i/o requests and then wait for completion. We do not wait for completion of individual 1 MiB writes.

Compare this now to the case where we write do a 64 MiB DIO write (or some smaller size, but > stripe size). As ~~LU-4198~~ originally noted, the performance of DIO does not scale when adding stripes.

Consider a file with a stripe size of 1 MiB.

This 64 MiB DIO generates 64 1 MiB writes, exactly the same as AIO with a queue depth of 64.

Except that while the AIO request performs at ~4-5 GiB/s, the DIO request performs at ~300 MiB/s.

This is because the DIO system submits each 1 MiB request and then waits for it:
(Submit 1 stripe(1 MiB)) --> wait for sync, (Submit 1 stripe (1 MiB)) --> wait for sync ... etc, 64 times.

AIO submits all of the requests and then waits, so:
(Submit 1 stripe(1 MiB)) -> (Submit 1 stripe(1 MiB)) -> (Submit 1 stripe(1 MiB)) -> (Submit 1 stripe(1 MiB)) -> (Submit 1 stripe(1 MiB)) -> (Submit 1 stripe(1 MiB)) -> ... (Wait for all writes to complete)

There is no reason DIO cannot work the same way, and when we make this change, large DIO writes & reads jump in performance to the same levels as AIO with an equivalent queue depth.

The change consists of essentially moving the waiting from the ll_direct_rw_* code up to the ll_file_io_generic layer and waiting for the completion of all submitted i/os rather than one at a time - It is a relatively simple change.

The improvement is dramatic, from a few hundred MiB/s to roughly 5 GiB/s.

Quick benchmark:

mpirun -np 1 $IOR -w -r -t 256M -b 64G -o ./iorfile --posix.odirect
Before:
Max Write: 583.03 MiB/sec (611.35 MB/sec)
Max Read:  641.03 MiB/sec (672.17 MB/sec)
 
After (w/patch):
Max Write: 5185.96 MiB/sec (5437.87 MB/sec)
Max Read:  5093.06 MiB/sec (5340.46 MB/sec)

The basic patch is relatively simple, but there are a number of additional subtleties to work out around when to do this and what sizes to submit, etc, etc. Basic patch will be forthcoming shortly.

Attachments

Issue Links

is related to

LU-14828 Remove extra debug from 398m

Resolved

LU-4198 Improve IO performance when using DIRECT IO using libaio

Resolved

is related to

LU-13802 New i/o path: Buffered i/o as DIO

Open

LU-13799 DIO/AIO efficiency improvements

Resolved

LU-13805 i/o path: Unaligned direct i/o

Open

LU-13814 DIO performance: cl_page struct removal for DIO path

Open

(1 is related to )

Activity

[LU-13798] Improve direct i/o performance with multiple stripes: Submit all stripes of a DIO and then wait

Wang Shilong (Inactive) added a comment - 27/May/21 11:34 PM

I think even "return the contiguous byte written at the beginning". did not totally fix confusing. As still some data writting eg:

W W X W

If we could return 2M, this still is confusing, as we actually wrote another 1M, this is still a bit different though. If we consider this parallel DIO as something like buffer IO + fsync(): write dirty data and then call fsync(), fsync() will return error to application if some write failed, but we have no idea how much data we really wrote...

Maybe we could just use easy way to return error to caller, but better add an option to disable parallel DIO in case it break some existing application?

Wang Shilong (Inactive) added a comment - 27/May/21 11:34 PM I think even "return the contiguous byte written at the beginning". did not totally fix confusing. As still some data writting eg: W W X W If we could return 2M, this still is confusing, as we actually wrote another 1M, this is still a bit different though. If we consider this parallel DIO as something like buffer IO + fsync(): write dirty data and then call fsync(), fsync() will return error to application if some write failed, but we have no idea how much data we really wrote... Maybe we could just use easy way to return error to caller, but better add an option to disable parallel DIO in case it break some existing application?

Patrick Farrell added a comment - 27/May/21 4:33 PM

"If write is expanding file size, return error directly might be fine, as in ext4 expanding file size will be executed after IO, short write data will be discarded as file size was not updated, only question is if it is fine if IO apply on existed data."

Well, we have to make sure the file size isn't updated, right? I'm not quite sure when that occurs relative to error processing here... OK, I'm going to add that to the list of things to verify. (How does it work for AIO writes...?)

My thinking is this:
Because we can get a failure "in the middle", it's not realistic to do "short i/o" and return bytes written. I think that's only useful if they're contiguous.

So our failure cases are things like:
X W W W W

In that case, we could just return error, since we didn't write any bytes.

OK, so now:
W X W W W

What do we return here? 1 MiB?

Or:
W W W X W

3 MiB here?

I think the only arguably correct choices are "just return an error" or "return the contiguous byte written at the beginning". Because we cannot accurately represent a write with a hole in it to the application. There's no way to describe that.

Just returning an error has these advantages:
It is relatively simple. No tracking which regions completed and sorting out contiguous bytes written.

But it does not let users know if we did write some contiguous bytes at the start. The concern then is they assume that we didn't write any other bytes... This doesn't seem very dangerous in practice, though.

For extending a file... Similar behavior - We extend it as far as the contiguous bytes written allow us.

I don't really like this - we're going to have to track every submitted RPC up at the top level, so we can verify they're contiguous, and they can arrive in any order, so we're going to have to track them all basically with some sort of extent map.

This is necessary if we want to give "report contiguous bytes written" as our response.

I would argue that the upstream kernel no longer does this for DIO, which suggests to me we can get away with just returning an error. That is certainly much easier.

Patrick Farrell added a comment - 27/May/21 4:33 PM "If write is expanding file size, return error directly might be fine, as in ext4 expanding file size will be executed after IO, short write data will be discarded as file size was not updated, only question is if it is fine if IO apply on existed data. " Well, we have to make sure the file size isn't updated, right? I'm not quite sure when that occurs relative to error processing here... OK, I'm going to add that to the list of things to verify. (How does it work for AIO writes...?) My thinking is this: Because we can get a failure "in the middle", it's not realistic to do "short i/o" and return bytes written. I think that's only useful if they're contiguous. So our failure cases are things like: X W W W W In that case, we could just return error, since we didn't write any bytes. OK, so now: W X W W W What do we return here? 1 MiB? Or: W W W X W 3 MiB here? I think the only arguably correct choices are "just return an error" or "return the contiguous byte written at the beginning". Because we cannot accurately represent a write with a hole in it to the application. There's no way to describe that. Just returning an error has these advantages: It is relatively simple. No tracking which regions completed and sorting out contiguous bytes written. But it does not let users know if we did write some contiguous bytes at the start. The concern then is they assume that we didn't write any other bytes... This doesn't seem very dangerous in practice, though. For extending a file... Similar behavior - We extend it as far as the contiguous bytes written allow us. I don't really like this - we're going to have to track every submitted RPC up at the top level, so we can verify they're contiguous, and they can arrive in any order, so we're going to have to track them all basically with some sort of extent map. This is necessary if we want to give "report contiguous bytes written" as our response. I would argue that the upstream kernel no longer does this for DIO, which suggests to me we can get away with just returning an error. That is certainly much easier.

Patrick Farrell added a comment - 27/May/21 4:04 PM

"The first thing to check is what e.g. XFS does in such a situation (e.g. EIO from dm-flakey for a block in the middle of a large write)? I don't think error recovery in such a case is clean at all, because O_DIRECT may be overwriting existing data in-place, so truncating the file to before the start of the error is possibly worse than returning an error. However, I do believe that the VFS write() handler will truncate a file that returned a partial error, if it was doing an extending write, and discard any data written beyond EOF."

I agree entirely - It's not clean at all. I don't think truncation is a good answer except for extending writes. And a key point here is we don't know what blocks were written successfully. (We could figure that out, but then we're tracking that at the top level. I would love to avoid writing that code, which seems like it would be significant, in that it requires awareness of i/o splitting among RPCs, among other things. We're going to have to map the splitting of the write and see what chunks failed.)

So not knowing which block fails means we would truncate off the entirety of the extending write in that case. But what about when a write is partially extending? Ew...

For XFS... I suspect if XFS DIO is split, it is submitted synchronously, ie, the failure granularity and the waiting granularity are the same. So they would not have this issue.

"Also, for buffered writes, this error should be returned to userspace if any write failed, but it would be returned via close() or fsync() from the saved error state on the file descriptor, and not write(), because the error isn't even detected until after write."

Yes, agreed completely. Sorry to be unclear on that - I meant it's not returned to the write() call.

Patrick Farrell added a comment - 27/May/21 4:04 PM "The first thing to check is what e.g. XFS does in such a situation (e.g. EIO from dm-flakey for a block in the middle of a large write)? I don't think error recovery in such a case is clean at all, because O_DIRECT may be overwriting existing data in-place, so truncating the file to before the start of the error is possibly worse than returning an error. However, I do believe that the VFS write() handler will truncate a file that returned a partial error, if it was doing an extending write, and discard any data written beyond EOF." I agree entirely - It's not clean at all. I don't think truncation is a good answer except for extending writes. And a key point here is we don't know what blocks were written successfully. (We could figure that out, but then we're tracking that at the top level. I would love to avoid writing that code, which seems like it would be significant , in that it requires awareness of i/o splitting among RPCs, among other things. We're going to have to map the splitting of the write and see what chunks failed.) So not knowing which block fails means we would truncate off the entirety of the extending write in that case . But what about when a write is partially extending? Ew... For XFS... I suspect if XFS DIO is split, it is submitted synchronously, ie, the failure granularity and the waiting granularity are the same. So they would not have this issue. "Also, for buffered writes, this error should be returned to userspace if any write failed, but it would be returned via close() or fsync() from the saved error state on the file descriptor, and not write() , because the error isn't even detected until after write." Yes, agreed completely. Sorry to be unclear on that - I meant it's not returned to the write() call.

Wang Shilong (Inactive) added a comment - 27/May/21 1:36 PM - edited

If write is expanding file size, return error directly might be fine, as in ext4 expanding file size will be executed after IO, short write data will be discarded as file size was not updated, only question is if it is fine if IO apply on existed data.

Wang Shilong (Inactive) added a comment - 27/May/21 1:36 PM - edited If write is expanding file size, return error directly might be fine, as in ext4 expanding file size will be executed after IO, short write data will be discarded as file size was not updated, only question is if it is fine if IO apply on existed data.

Wang Shilong (Inactive) added a comment - 27/May/21 1:29 PM - edited

I checked Centos7 kernel and Latest upstream linux kernel, behavior was a bit different:

In latest Linux kernel, Direct IO was implemented using iomap:

|->iomap_dio_rw()
   |->__iomap_dio_rw()

     |->iomap_apply()

If middle of iomap_apply() failed, iomap_dio_set_error() will set error code and it will return error to caller

rather than return already written.

However in Centos7:

|->__generic_file_aio_write()

We will return bytes wrotten in short IO...

I am not sure what is posix requirements in this cases, maybe upstream codes has bug to miss short IO? Returning error directly might confuse application, because application think IO failure, but some data was actually wroten in-place.

Any idea?

Wang Shilong (Inactive) added a comment - 27/May/21 1:29 PM - edited I checked Centos7 kernel and Latest upstream linux kernel, behavior was a bit different: In latest Linux kernel, Direct IO was implemented using iomap: |->iomap_dio_rw() |->__iomap_dio_rw() |->iomap_apply() If middle of iomap_apply() failed, iomap_dio_set_error() will set error code and it will return error to caller rather than return already written. However in Centos7: |->__generic_file_aio_write() We will return bytes wrotten in short IO... I am not sure what is posix requirements in this cases, maybe upstream codes has bug to miss short IO? Returning error directly might confuse application, because application think IO failure, but some data was actually wroten in-place. Any idea?

Andreas Dilger added a comment - 27/May/21 3:05 AM

The first thing to check is what e.g. XFS does in such a situation (e.g. EIO from dm-flakey for a block in the middle of a large write)? I don't think error recovery in such a case is clean at all, because O_DIRECT may be overwriting existing data in-place, so truncating the file to before the start of the error is possibly worse than returning an error. However, I do believe that the VFS write() handler will truncate a file that returned a partial error, if it was doing an extending write, and discard any data written beyond EOF.

Also, for buffered writes, this error should be returned to userspace if any write failed, but it would be returned via close() or fsync() from the saved error state on the file descriptor, and not write(), because the error isn't even detected until after write.

Andreas Dilger added a comment - 27/May/21 3:05 AM The first thing to check is what e.g. XFS does in such a situation (e.g. EIO from dm-flakey for a block in the middle of a large write)? I don't think error recovery in such a case is clean at all, because O_DIRECT may be overwriting existing data in-place, so truncating the file to before the start of the error is possibly worse than returning an error. However, I do believe that the VFS write() handler will truncate a file that returned a partial error, if it was doing an extending write, and discard any data written beyond EOF. Also, for buffered writes, this error should be returned to userspace if any write failed, but it would be returned via close() or fsync() from the saved error state on the file descriptor, and not write() , because the error isn't even detected until after write.

Wang Shilong (Inactive) added a comment - 27/May/21 12:56 AM - edited

This is reason why i would suggest we added fault injection I think it ok to just return error to make code simple, since most of case DIO should be ok except ENOSPC.

Wang Shilong (Inactive) added a comment - 27/May/21 12:56 AM - edited This is reason why i would suggest we added fault injection I think it ok to just return error to make code simple, since most of case DIO should be ok except ENOSPC.

Patrick Farrell added a comment - 26/May/21 11:56 PM

Related question...

In a case like this, where we have a 5 MiB write to 1 MiB stripe size file, and a single write RPC fails in the middle (not due to ENOSPC). (Ws represent 1 MiB write RPC, X represents a failed write RPC)
W W X W W

This is an unusual situation - Reporting this error back is not possible with buffered writes, because they're completely async, so it would normally be silent.

With async DIO, we can return an error. But is it acceptable to return an error? Or do we need to return 2 MiB, because we successfully wrote the first 2 MiB of data? Determining exactly how much we wrote before the gap seems pretty tricky - It would be much easier if we could just return an error in this case....

Is that acceptable? Note also that returning 2 MiB also seems misleading because it suggests a short write, when in fact we also wrote data further along in the file...

I am hoping the answer is "error is good".

Patrick Farrell added a comment - 26/May/21 11:56 PM Related question... In a case like this, where we have a 5 MiB write to 1 MiB stripe size file, and a single write RPC fails in the middle (not due to ENOSPC). (Ws represent 1 MiB write RPC, X represents a failed write RPC) W W X W W This is an unusual situation - Reporting this error back is not possible with buffered writes, because they're completely async, so it would normally be silent. With async DIO, we can return an error. But is it acceptable to return an error? Or do we need to return 2 MiB, because we successfully wrote the first 2 MiB of data? Determining exactly how much we wrote before the gap seems pretty tricky - It would be much easier if we could just return an error in this case.... Is that acceptable? Note also that returning 2 MiB also seems misleading because it suggests a short write, when in fact we also wrote data further along in the file... I am hoping the answer is "error is good".

Andreas Dilger added a comment - 26/May/21 6:51 PM

That was going to be my suggestion as well. Since patch https://review.whamcloud.com/39386 "LU-12687 osc: consume grants for direct I/O" was landed, the client should remain "fully stocked" with grant for O_DIRECT writes until the OST runs low/out of space, so this shouldn't cause any performance hit, unless (possibly) the O_DIRECT size is so large that it exceeds the total grant amount that the client has for the OST(s) the file is striped over. It may be worthwhile to check if the client would be given e.g. 1GB+ grant when doing 1GB O_DIRECT writes (assuming enough space in the filesystem)?

Andreas Dilger added a comment - 26/May/21 6:51 PM That was going to be my suggestion as well. Since patch https://review.whamcloud.com/39386 " LU-12687 osc: consume grants for direct I/O " was landed, the client should remain "fully stocked" with grant for O_DIRECT writes until the OST runs low/out of space, so this shouldn't cause any performance hit, unless (possibly) the O_DIRECT size is so large that it exceeds the total grant amount that the client has for the OST(s) the file is striped over. It may be worthwhile to check if the client would be given e.g. 1GB+ grant when doing 1GB O_DIRECT writes (assuming enough space in the filesystem)?

Patrick Farrell added a comment - 26/May/21 4:25 PM

Hmm, so I think I have this figured out.

I asked originally because I thought working with grants would be complicated, but after thinking about it, I think the solution is very simple, and I will just implement it.

DIO writes already consume grant if it is available, so we can just switch to per-RPC sync behavior if not enough grant is available. So if there is a grant issue, we fall back to submitting each individual RPC synchronously. This should solve the problem, and I don't think it should present a performance issue - When we are running out of grant, it is OK not to write at high speed.

Patrick Farrell added a comment - 26/May/21 4:25 PM Hmm, so I think I have this figured out. I asked originally because I thought working with grants would be complicated, but after thinking about it, I think the solution is very simple, and I will just implement it. DIO writes already consume grant if it is available, so we can just switch to per-RPC sync behavior if not enough grant is available. So if there is a grant issue, we fall back to submitting each individual RPC synchronously. This should solve the problem, and I don't think it should present a performance issue - When we are running out of grant, it is OK not to write at high speed.

Patrick Farrell added a comment - 26/May/21 3:20 PM - edited

Andreas, Shilong,
The change from waiting for each DIO RPC individually vs waiting for the batch raises an interesting problem, similar to buffered i/o.

Specifically, previously since every RPC was individually 'sync', we would always catch errors immediately.

So the only possible write failures looked like this... Say we tried to write 5 MiB to a file with stripe size 1 MiB, that generates 5 1 MiB RPCs.
Success looks like this (W is a 1 MiB write RPC):

W W W W W

"X" is a failed write RPC, "-" is "we didn't try to write this 1 MiB". failure is always something like:

W W W X -

Or:

W X - - -

Failure is a short write, but there is never a gap, because we confirm each RPC is sync'ed before starting the next one.

With this change, we wait for sync after all RPCs have been sent. This means we can get a failure "in the middle", like this:

W W X W W

So now there is a gap, rather than just a short write.

Still, I think this is probably fine in the general case. This problem already exists for buffered writes, because they are async. And the problem for buffered writes is worse, because they are 100% async, so the error happens after the syscall has completed. With the modified DIO, we wait for sync before returning to userspace, so we can return an error.

Since this is similar to buffered writes, I think it's OK for the general error case.

Here is my actual concern.

What about short writes due to ENOSPC? If one OST runs out of space, we could get a pattern like this (with the new DIO).

Same as above, but "E" represents an RPC which failed with ENOSPC:

W W E W W

So we have a gap in the write due to ENOSPC.

This is impossible with buffered writes, because we check grant for each write RPC before submitting it. So with buffered writes, the ENOSPC looks like this:

W W E - -

Where we stop when ENOSPC is encountered.

So buffered writes hitting ENOSPC guarantee a "short" write, whereas with this change, DIO writes hitting ENOSPC can give a write with a "gap" in it.

We can solve this by giving DIO similar "require grant, switch to sync if grant unavailable" behavior as is used for buffered i/o.

My question:
Do you think this is necessary to solve? My instinct is yes, because the "gap" write on ENOSPC is unacceptable, because users rely on out of space generating a short write or error.

Patrick Farrell added a comment - 26/May/21 3:20 PM - edited Andreas, Shilong, The change from waiting for each DIO RPC individually vs waiting for the batch raises an interesting problem, similar to buffered i/o. Specifically, previously since every RPC was individually 'sync', we would always catch errors immediately. So the only possible write failures looked like this... Say we tried to write 5 MiB to a file with stripe size 1 MiB, that generates 5 1 MiB RPCs. Success looks like this (W is a 1 MiB write RPC): W W W W W "X" is a failed write RPC, "-" is "we didn't try to write this 1 MiB". failure is always something like: W W W X - Or: W X - - - Failure is a short write, but there is never a gap, because we confirm each RPC is sync'ed before starting the next one. With this change, we wait for sync after all RPCs have been sent. This means we can get a failure "in the middle", like this: W W X W W So now there is a gap, rather than just a short write. Still, I think this is probably fine in the general case. This problem already exists for buffered writes, because they are async. And the problem for buffered writes is worse, because they are 100% async, so the error happens after the syscall has completed. With the modified DIO, we wait for sync before returning to userspace, so we can return an error. Since this is similar to buffered writes, I think it's OK for the general error case. Here is my actual concern. What about short writes due to ENOSPC? If one OST runs out of space, we could get a pattern like this (with the new DIO). Same as above, but "E" represents an RPC which failed with ENOSPC: W W E W W So we have a gap in the write due to ENOSPC. This is impossible with buffered writes, because we check grant for each write RPC before submitting it. So with buffered writes, the ENOSPC looks like this: W W E - - Where we stop when ENOSPC is encountered. So buffered writes hitting ENOSPC guarantee a "short" write, whereas with this change, DIO writes hitting ENOSPC can give a write with a "gap" in it. We can solve this by giving DIO similar "require grant, switch to sync if grant unavailable" behavior as is used for buffered i/o. My question: Do you think this is necessary to solve? My instinct is yes, because the "gap" write on ENOSPC is unacceptable, because users rely on out of space generating a short write or error.

People

Assignee:: Patrick Farrell

Reporter:: Patrick Farrell

Votes:: 0 Vote for this issue

Watchers:: 15 Start watching this issue

Dates

Created:: 17/Jul/20 5:04 PM

Updated:: 19/Jan/24 7:08 AM

Resolved:: 30/Jun/21 6:48 PM