Details
-
Improvement
-
Resolution: Fixed
-
Major
-
None
-
None
-
9223372036854775807
Description
The AIO implementation created in LU-4198 is able to perform at extremely high speeds because it submits multiple i/os via the direct i/o path, in a manner similar to the buffered i/o path.
Consider the case where we do 1 MiB AIO requests with a queue depth of 64 MiB. In this case, we submit 64 1 MiB DIO requests, and then we wait for them to complete. (Assume we do only 64 MiB of i/o total, just for ease of conversation.)
Critically, we submit all the i/o requests and then wait for completion. We do not wait for completion of individual 1 MiB writes.
Compare this now to the case where we write do a 64 MiB DIO write (or some smaller size, but > stripe size). As LU-4198 originally noted, the performance of DIO does not scale when adding stripes.
Consider a file with a stripe size of 1 MiB.
This 64 MiB DIO generates 64 1 MiB writes, exactly the same as AIO with a queue depth of 64.
Except that while the AIO request performs at ~4-5 GiB/s, the DIO request performs at ~300 MiB/s.
This is because the DIO system submits each 1 MiB request and then waits for it:
(Submit 1 stripe(1 MiB)) --> wait for sync, (Submit 1 stripe (1 MiB)) --> wait for sync ... etc, 64 times.
AIO submits all of the requests and then waits, so:
(Submit 1 stripe(1 MiB)) -> (Submit 1 stripe(1 MiB)) -> (Submit 1 stripe(1 MiB)) -> (Submit 1 stripe(1 MiB)) -> (Submit 1 stripe(1 MiB)) -> (Submit 1 stripe(1 MiB)) -> ... (Wait for all writes to complete)
There is no reason DIO cannot work the same way, and when we make this change, large DIO writes & reads jump in performance to the same levels as AIO with an equivalent queue depth.
The change consists of essentially moving the waiting from the ll_direct_rw_* code up to the ll_file_io_generic layer and waiting for the completion of all submitted i/os rather than one at a time - It is a relatively simple change.
The improvement is dramatic, from a few hundred MiB/s to roughly 5 GiB/s.
Quick benchmark:
mpirun -np 1 $IOR -w -r -t 256M -b 64G -o ./iorfile --posix.odirect Before: Max Write: 583.03 MiB/sec (611.35 MB/sec) Max Read: 641.03 MiB/sec (672.17 MB/sec) After (w/patch): Max Write: 5185.96 MiB/sec (5437.87 MB/sec) Max Read: 5093.06 MiB/sec (5340.46 MB/sec)
The basic patch is relatively simple, but there are a number of additional subtleties to work out around when to do this and what sizes to submit, etc, etc. Basic patch will be forthcoming shortly.
Attachments
Issue Links
- is related to
-
LU-14828 Remove extra debug from 398m
-
- Resolved
-
-
LU-4198 Improve IO performance when using DIRECT IO using libaio
-
- Resolved
-
- is related to
-
LU-13802 New i/o path: Buffered i/o as DIO
-
- Open
-
-
LU-13799 DIO/AIO efficiency improvements
-
- Resolved
-
-
LU-13805 i/o path: Unaligned direct i/o
-
- Open
-
-
LU-13814 DIO performance: cl_page struct removal for DIO path
-
- Open
-
"The first thing to check is what e.g. XFS does in such a situation (e.g. EIO from dm-flakey for a block in the middle of a large write)? I don't think error recovery in such a case is clean at all, because O_DIRECT may be overwriting existing data in-place, so truncating the file to before the start of the error is possibly worse than returning an error. However, I do believe that the VFS write() handler will truncate a file that returned a partial error, if it was doing an extending write, and discard any data written beyond EOF."
I agree entirely - It's not clean at all. I don't think truncation is a good answer except for extending writes. And a key point here is we don't know what blocks were written successfully. (We could figure that out, but then we're tracking that at the top level. I would love to avoid writing that code, which seems like it would be significant, in that it requires awareness of i/o splitting among RPCs, among other things. We're going to have to map the splitting of the write and see what chunks failed.)
So not knowing which block fails means we would truncate off the entirety of the extending write in that case. But what about when a write is partially extending? Ew...
For XFS... I suspect if XFS DIO is split, it is submitted synchronously, ie, the failure granularity and the waiting granularity are the same. So they would not have this issue.
"Also, for buffered writes, this error should be returned to userspace if any write failed, but it would be returned via close() or fsync() from the saved error state on the file descriptor, and not write(), because the error isn't even detected until after write."
Yes, agreed completely. Sorry to be unclear on that - I meant it's not returned to the write() call.