[LU-13805] i/o path: Unaligned direct i/o Created: 20/Jul/20  Updated: 26/Jan/24

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major
Reporter: Patrick Farrell Assignee: Patrick Farrell
Resolution: Unresolved Votes: 0
Labels: None

Issue Links:
Cloners
Related
is related to LU-247 Lustre client slow performance on BG/... Resolved
is related to LU-13814 DIO performance: cl_page struct remov... Open
is related to LU-17450 sanity: interop test failures with ma... Open
is related to LU-13802 New i/o path: Buffered i/o as DIO Open
is related to LU-13799 DIO/AIO efficiency improvements Resolved
is related to LU-17156 sanityn test_16j: timeout Resolved
is related to LU-17215 sanity/398q should use $tfile Resolved
is related to LU-12550 automatic lockahead Open
is related to LU-17194 parallelize DIO submit Open
is related to LU-17422 unaligned DIO: use page pools Open
is related to LU-17433 async hybrid writes Open
is related to LU-13798 Improve direct i/o performance with m... Resolved
is related to LU-16964 I/O Path: Auto switch from BIO to DIO Closed
Rank (Obsolete): 9223372036854775807

 Description   

With the DIO performance improvements in LU-13798 and LU-13799, it becomes interesting to do larger buffered i/o (BIO) using the DIO path, as in LU-13802.

LU-13802 covers the code for switching between the BIO and DIO paths, allowing BIO which meets the requirements for DIO to use the BIO path when appropriate.

The problem is, the requirements for DIO are sometimes hard to meet.  i/o must be both page & size aligned.  This ticket is about how to do unaligned DIO, in order to let us do any BIO through the DIO path.

 

This cannot be done with the existing Lustre i/o path.  There are a few minor issues, but the central problem is that if an i/o is unaligned, we no longer have a 1-to-1 mapping between a page on the client and a page in the file/on the server.  (Buffered i/o creates this 1-to-1 mapping by copying in to an aligned buffer.)  This 1-to-1 mapping could possibly be removed, but it would require a significant rework of the Lustre i/o path to make this possible.

So, one option is creating a new DIO path which permits unaligned i/o from userspace all the way to disk.

The other option comes from the following observation:
When doing buffered i/o, about 20% of the time is spent in allocating the buffer and doing memcopy() in to that buffer.  Of the remaining 80%, something like 70% is page tracking of various kinds.
Because each page in the page cache can be accessed from multiple threads, including being flushed at any time from various threads (memory pressure va kswapd, lock cancellation, writeout...), it has to be on various lists & have references on (effectively) the file it is part of, etc.

This work, not allocation and memcopy, is where most of the time goes.

So if we implement a simple buffering scheme - allocate an aligned buffer, then copy data to (or from) that buffer - and then do a normal DIO write(/read) from(/to) that buffer, this can be hugely faster than buffered i/o.

If we use the normal DIO path (ie, sync write, and do not keep pages after read), we keep this as a buffer, and not a cache, so we can keep the DIO path lockless.

Also, if we implement this correctly, we have a number of excellent options for speeding this up:

  1. Move allocation (if we're not pre-allocated) and memcopy from the user thread to the ptlrpcd threads handling RPC submission - This allows us to do these operations in parallel, which should dramatically improve speed.
  2. Use pre-allocated buffers.
  3. Potentially, since we control the entire copying path, we could enable the FPU to use vectorized memcopies.  (Various aspects of the buffered i/o path in the kernel mean the FPU has to be turned on and off for each page.  The cost of this outweighs the benefit of vectorized memcopy.)


 Comments   
Comment by Patrick Farrell [ 20/Jul/20 ]

Obviously, I favor this approach vs a full new unaligned i/o path, partly because I think this approach will be much simpler/easier.  But if someone wants to implement a new unaligned DIO path and can make it fast and efficient, then that would be good as well.

Comment by Patrick Farrell [ 20/Jul/20 ]

Also, I've done a quick proof of concept of this 'fast buffering' approach, and it seems to work fine.  It's not anything worth posting unless people are really interested in seeing the bare minimum required to make this work for writes.

Comment by Wang Shilong (Inactive) [ 21/Jul/20 ]

I think using extra page allocation is simple and good start, and actually, Oleg's lockess write patch has showed big improvements for this.

Comment by Wang Shilong (Inactive) [ 21/Jul/20 ]

I hit to mention it, but i'd like to raise this problem/question here:

So the main purpose here we are trying to do is to make IO lockless, this could break lock behavior a bit like before:

Let's consider stripe size is 4M, we are doing "dd if=/dev/zero of=/mnt/lustre/data bs=16M count=1."

With lockless IO, we don't do lock in client, but rather lock is doing in the server side, this means lock is held in the server side per RPC level with random sequences, this could break behavior a bit like before, let's consider following cases:

Client1: dd if=all_zero_file of=/mnt/lustre/data bs=16M count=1

Client2: dd if=all_1_file of=/mnt/lustre/data1 bs=16M count=1

So since we fallback to DIO with large IO size, we are doing lockless for above command, without lockless, client1 hold 16M ldlm lock, and then write data, Client2 try to hold same lock which conflict client1 and will trigger flush and cancel, finally data will be all zero or all 1 filled.

But with lockless, lock held is as per RPC, let' say it is 4M RPC, so finally file contents might be totally random...

I am not saying this could cause big problem, but this indeed break some previous Semantics that is known for some users which needs some extra attention here especially we are considering to fall buffer IO to DIO.

Comment by Wang Shilong (Inactive) [ 21/Jul/20 ]

I might be wrong with lock could be held accross stripes once a time, it is still handled per stripe level?

But anyway, we could not gurante RPC could always send one stripe size data with one RPC?

Comment by Patrick Farrell [ 21/Jul/20 ]

Well, the goal for me is to make it much, much faster - Unaligned i/o through this path can be made very fast.  Lockless is also very nice, but it's not required. 

But, yes, absolutely:
Write atomicity for multi-stripe writes is broken by lockless i/o.  I sort of thought everyone realized that.

Write atomicity is the guarantee that you get all of one write or another.

So if we have two racing & overlapping 4 MiB writes in a 1 MiB stripe size file (with 4 stripes, to make it simple):
AAAA
BBBB

Write atomicity is the guarantee that we get AAAA or BBBB.
With lockless i/o, we can get any combination of A and B, like ABAB.

With locked i/o, I believe we take all the LDLM locks first, and then do the i/o - this would guarantee atomicity.  (I have to check logs to be 100% sure.)

Comment by Patrick Farrell [ 21/Jul/20 ]

Locking is per-stripe in the sense that each LDLM lock goes to one stripe object.  So, you can't hold one lock on > 1 stripe, but you can take multiple locks (like, the lock on stripe 1 and the lock on stripe 2) and hold them at the same time.

Comment by Patrick Farrell [ 21/Jul/20 ]

I guess I didn't share any performance data or projections in this LU...

So, my first, quick implementation goes at 3 GiB/s.  That is without any parallelism - ie, all the allocation and copying is done in the userspace thread, not ptlrpcd threads.

There's also still a bunch of overhead in the DIO path as well.

Basically, I would guess that we can do at least 5-10 GiB/s via this "fast buffering", and I think 15+ is possible.  I need to open the LU about the other work for DIO path efficiency, haven't found time yet.

Comment by Wang Shilong (Inactive) [ 21/Jul/20 ]

I think 3GiB/s for now is attractive enough to play with it Ihara want to kiss you

Comment by Andreas Dilger [ 16/Feb/21 ]

It is worthwhile to note that we already have a pre-allocated IO buffer on the client which is used for holding encrypted data before it is submitted to the server when Kerberos, SSK, or fscrypt is active. This bounce buffer was recently used by patch https://review.whamcloud.com/38967 "LU-12275 sec: O_DIRECT for encrypted file" for fscrypt so that the source buffers are not modified as the data is encrypted to be sent to the server. It would be relatively straight forward to detect the case of unaligned incoming O_DIRECT writes and copy the data into the bounce buffers with the proper alignment to be sent to the servers. However, it isn't clear if this client-side copy could be leveraged into an "easy win" that only affected the client (i.e. no server-side changes) in all cases, but may be useful in some workloads, and may be worthwhile to implement if the effort is low.

Since we do not have (nor want) the start/end of the unaligned O_DIRECT buffers on the client, it is likely that the server would also have to do a data copy to overlay the partial-block write onto the full-sized pages (i.e. read-modify-write either to page cache or OST preallocated buffers). Doing the data copy on both the client and server seems undesirable, so in fact it would make more sense to just extend the OST_BRW_READ/WRITE RPC protocol to allow specifying an unaligned IO request, do the RDMA of the source pages with whatever data alignment the client currently has, then copy the data on the server into its final location.

There are several benefits of doing the unaligned data copy on the server side, despite the added CPU overhead it places on the server:

  • the client does not send RPCs to pre-read the start/end pages into its cache to generate full-page writes
  • the server may already have those pages in its cache from other adjacent unaligned writes, and can merge them very efficiently
  • the server can hold the extent lock on the object for the region being modified instead of the client

This would clearly be a tradeoff between added CPU overhead on the server vs. extra RPCs from the client, so this optimization would only make sense up to some particular IO size (e.g. < 64KB or cl_max_short_io_bytes, but subject to measurement). Beyond that threshold, doing the copy on the client to align the data would allow the intermediate pages to avoid the majority of the copying on the OSS. Beyond a higher threshold (e.g. >=1MB) it is better to just read the start/end pages to the client (in a single RPC!) and submit the whole RPC directly to the OSS as is done today.

Comment by Gerrit Updater [ 22/Feb/21 ]

Wang Shilong (wshilong@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/41715
Subject: LU-13805 clio: bounce buffer to support unaligned DIO write
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: ce19114c327946a609f1d239bccf588a5be86112

Comment by Wang Shilong (Inactive) [ 22/Feb/21 ]

This is just a simple way to support unaligned DIO write use bounce buffer it might be intresting to see how this could help IOR hard write performance.

 

Comment by Gerrit Updater [ 18/Nov/21 ]

"Patrick Farrell <pfarrell@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/45616
Subject: LU-13805 clio: bounce buffer for unaligned DIO
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: e1b0da4c429116b824335567c9f29e8a30385bb8

Comment by Oleg Drokin [ 31/Jan/22 ]

know I am late on this ticket but why do we need 1:1 mapping between client-server pages for IO and therefore the need for bounce buffers?

We already had "all io is direct io" patch that was used by llnl in 1.8 branch and basically for lnet purposes you give it an iovec in the form of "my data starts at this page this offset and runs this long (spans many pages)". Can't seem to find the ticket now though.

Server-side it lands into another buffer at a totally different alignment which is not a problem. The only stretch is you need to calculate the lnb/rnb as if your source data was particularly aligned, but it does not require any data movement.

We also had similar-natured lockless io triggered by contention on a resource in 1.x branch.

Comment by Patrick Farrell [ 31/Jan/22 ]

So my understanding is as follows...  (Oleg, this will include some things you definitely know, I'm just trying to give a complete explanation for anyone reading this.)

Allowing unaligned I/O means userspace pages can have a non-1-to-1 mapping with filepages.  A write from a single page of userspace memory can span two file pages, since it doesn't have to be aligned.

As I understand it, the data eventually needs to be aligned for the write to disk.  Like, we can't write from the same page in memory to two pages on disk, or we can't do it without a significant cost.  (Since every memory page is unaligned to disk pages)

Perhaps I'm wrong about that and this is cheap.  That would change things a lot - but it would be a pretty big shift from how things work today...?

Anyway, assuming that's an issue, the next place you could solve alignment is the network, copying from the userpace buffer (unaligned) to some aligned blob of server memory (where 'aligned' means a 1-to-1 mapping with file/on disk pages).  The problem is, again, the alignment:  My understanding is you cannot (efficiently) do RDMA on a non 1-to-1 page mapping.  If you have 1 page of memory on 1 side, it needs to map precisely to 1 page of memory on the other side.  Otherwise you end up with a complicated expression describing the memory.

This means you can't use RDMA to do the alignment for you.

I'm pretty confident about the RDMA thing (we should ask Amir, though - I may be wrong), but less sure about the disk thing - Is it possible to efficiently write from unaligned memory to disk? (in the sense of "pages in memory don't map 1-to-1 to pages in the file")  (I am assuming all the same issues apply for reads, I'm pretty sure this is the case.)

Comment by Patrick Farrell [ 31/Jan/22 ]

I'd actually be quite excited to be wrong about this, if it's practical to do - This would be significantly faster and when we implement buffered-as-direct, this would bring buffered-as-direct and DIO in to line performance wise without extra effort.  (It was probably possible with cleverness and effort to get them close, but this would spare that effort.)

Comment by Patrick Farrell [ 01/Feb/22 ]

Oleg, Andreas,

It would be great to get reflections on this - I've been working on this and this would be a notable shift if it's possible to take this approach.  (A good shift, but still a shift.)

Comment by Andreas Dilger [ 01/Feb/22 ]

Patrick, my understanding of LNet/IB RDMA details is limited, but it aligns with yours - that the source pages need to be aligned, and we can't do RDMA from arbitrary offsets. However, I could easily be wrong also.

Comment by Patrick Farrell [ 01/Feb/22 ]

Andreas,

I was actually hoping you could comment on the bit about requiring alignment for the eventual write to disk:
"

As I understand it, the data eventually needs to be aligned for the write to disk.  Like, we can't write from the same page in memory to two pages on disk, or we can't do it without a significant cost.  (Since every memory page is unaligned to disk pages)"

Any thoughts there?  That's the eventual source of the requirement, because otherwise we could RDMA to a set of pages with the same alignment as the user pages and then do the disk I/O from this (entirely unaligned with respect to the file on disk) buffer.

Comment by Andreas Dilger [ 01/Feb/22 ]

That would require that the OSS do data copies for every byte. I would rather have the client do the data copies and the OSS can do direct RDMA into the aligned pages and submit them directly to the filesystem.

For very small IO (few bytes to few pages unaligned) there is a benefit from doing the copy on the OSS because this avoids the partial-page copy to the client.

Comment by Oleg Drokin [ 01/Feb/22 ]

the alignment for write to disk is correct but is a red herring.

Basically unaligned pages on the client turn into aligned pages on the server because lnet is a copy from unaligned pages into aligned ones pretty much. Thus the server cache is your bounce buffer and you don't need another one on the client.

This was already done before, and check the lnet interface - you really give it the beginning of data (in the form of page and offset) and length. And if a perfect 1:1 page to page mapping was required on both ends the different sized pages would never work and they do.

Comment by Patrick Farrell [ 01/Feb/22 ]

"That would require that the OSS do data copies for every byte. I would rather have the client do the data copies and the OSS can do direct RDMA into the aligned pages and submit them directly to the filesystem.

For very small IO (few bytes to few pages unaligned) there is a benefit from doing the copy on the OSS because this avoids the partial-page copy to the client."

Thanks Andreas, that's what I thought - Just wanted to confirm.

Comment by Patrick Farrell [ 01/Feb/22 ]

"Basically unaligned pages on the client turn into aligned pages on the server because lnet is a copy from unaligned pages into aligned ones pretty much. Thus the server cache is your bounce buffer and you don't need another one on the client."

"because lnet is a copy from unaligned pages into aligned ones pretty much"
This isn't ever true today - On the client, page cache pages are aligned (in the sense that the nth byte of a page in memory is the nth byte of a page in the file, there is no offset between them), and so are DIO pages.  It may be that LNet can do this, but it doesn't today.

"This was already done before, and check the lnet interface - you really give it the beginning of data (in the form of page and offset) and length. And if a perfect 1:1 page to page mapping was required on both ends the different sized pages would never work and they do."

Hmm.  We'll need to ask Amir.  I've been saying a 1-to-1 mapping, but that was shorthand for an aligned mapping.  Sure, a 64K page can be mapped to 16 4K pages, but that's still an aligned mapping.  The larger page maps to a fixed number of smaller pages.

I may just be wrong about this and RDMA can fix alignment for us - that would be great.

Comment by Oleg Drokin [ 01/Feb/22 ]

yes the page cache is of course aligned, we are talking about stuff from userspace.

See commit e4472c979c809ee4ba8c14b933c2e53713af1845 for unaligned directio implementation there were some additional fixes afterwards

Comment by Patrick Farrell [ 01/Feb/22 ]

ashehata ,

Question about how RDMA works.

If I have a sequence of data spanning some number of pages in memory on one node, can I efficiently RDMA that to a set of pages on another node and change the offset in the pages?  eg, say the data on the source node is offset 1K from where I wish it was, so byte 1024 in page 1 on the source would become byte 0 in page 1 at the destination.

Then, because the copy started at offset 1024 in page 1 on the source, bytes 3072-4096 of the copy will be from page 2 on the source (the first 1024 bytes of page 2 on the source_, but will go to bytes 3072-4096 of page 1 at the destination.  So there's not a 1-to-1 mapping between source and destination pages.

Can that be done efficiently with most RDMA implementations?  My impression was the answer is no.

Comment by Patrick Farrell [ 01/Feb/22 ]

Oleg,

Thanks for the pointer, my point was just that all users today provide lnet with aligned memory because we don't support unaligned DIO.

Comment by Alexey Lyashkov [ 02/Feb/22 ]

@Patrik,

It's possible. with wr_sge == 2. First SGE covers a first half page, second SGE entry covers a second half page, but these addresses needs to be physically continues except card have a GAPS support.

Comment by Gerrit Updater [ 06/Feb/23 ]

"Patrick Farrell <pfarrell@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49913
Subject: LU-13805 clio: Add csi_complete
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 56451eeaa985b274575013ae4e1eeb04282d8445

Comment by Gerrit Updater [ 06/Feb/23 ]

"Patrick Farrell <pfarrell@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49915
Subject: LU-13805 llite: Fix return for non-queued aio
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 40cb4a09d4188f5ba8c4c88be365f823d4f041a0

Comment by Gerrit Updater [ 08/Feb/23 ]

"Patrick Farrell <pfarrell@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49940
Subject: LU-13805 llite: Add copy of iovec to sub-dio
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 1f1a3de4492be7be5e86dffe9ba54a38481ca222

Comment by Gerrit Updater [ 08/Feb/23 ]

"Patrick Farrell <pfarrell@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49947
Subject: LU-13805 llite: add mm to dio struct
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: cedcde6ddaf042e96b4d43b9e9c1f9535769636e

Comment by Gerrit Updater [ 14/Feb/23 ]

"Patrick Farrell <pfarrell@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49987
Subject: LU-13805 clio: Trivial DIO cleanups
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: b0bc6fba34c5c6a96450672d5438b2a1d3d06f7e

Comment by Gerrit Updater [ 14/Feb/23 ]

"Patrick Farrell <pfarrell@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49988
Subject: LU-13805 osc: Add debug
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 154b38a589a0b55e409372980c2108f167ad6376

Comment by Gerrit Updater [ 14/Feb/23 ]

"Patrick Farrell <pfarrell@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49989
Subject: LU-13805 tests: add debug to aiocp
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: c73e18132137e3d4f10aa65a048dd4ba08edb0c7

Comment by Gerrit Updater [ 14/Feb/23 ]

"Patrick Farrell <pfarrell@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49990
Subject: LU-13805 tests: add unaligned io to multiop
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 5c68d2bd8e9a1d9b2d68e6db0acc45949b5483c9

Comment by Gerrit Updater [ 14/Feb/23 ]

"Patrick Farrell <pfarrell@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49991
Subject: LU-13805 clio: Add write to sdio
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 1f9441fd8a77decd4252e6e26fa7f5d4565be9bc

Comment by Gerrit Updater [ 14/Feb/23 ]

"Patrick Farrell <pfarrell@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49993
Subject: LU-13805 llite: unaligned direct_rw_pages
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 2b7c31f273da6c48e18177894c0ba25980c75e07

Comment by Gerrit Updater [ 20/Feb/23 ]

"Patrick Farrell <pfarrell@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50067
Subject: LU-13805 osc: Don't include lock for srvlock
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: d0c89ad51745949681408d150d1be2ba984b5fc5

Comment by Gerrit Updater [ 20/Feb/23 ]

"Patrick Farrell <pfarrell@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50075
Subject: LU-13805 tests: Add additional tests of DIO/BIO
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 48c161f73436629ed77a4f02014fc1c011a98bee

Comment by Gerrit Updater [ 01/Mar/23 ]

"Patrick Farrell <pfarrell@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50167
Subject: LU-13805 llite: Improve sync_io comments
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: f2ab0390f0fc318efa74ffd6f9b17ac19bff781b

Comment by Gerrit Updater [ 01/Mar/23 ]

"Patrick Farrell <pfarrell@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50168
Subject: LU-13805 llite: Convert allocate/get to use pvec
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 48830a00ceae8bafb6b74926265bde91c0101085

Comment by Gerrit Updater [ 01/Mar/23 ]

"Patrick Farrell <pfarrell@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50170
Subject: LU-13805 llite: Rename ldp_aio to sdio
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 525b1b31cf91896bad6067703bc2fde16ccde649

Comment by Gerrit Updater [ 04/Apr/23 ]

"Patrick Farrell <pfarrell@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50529
Subject: LU-13805 tests: Add racing tests of BIO, DIO, AIO
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 58e16cb9742999879ec69b041bdfa9a0bfe2c6b0

Comment by Gerrit Updater [ 07/Apr/23 ]

"Patrick Farrell <pfarrell@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50577
Subject: LU-13805 tests: add racing tests of aio
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 1112c1fb45a049ee3f7da47bc673da6e047e4b3f

Comment by Gerrit Updater [ 13/Apr/23 ]

"Patrick Farrell <pfarrell@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50621
Subject: LU-13805 osc: Make unaligned DIO async
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 98ee008afe8a08a4d91390b4ff47f131d40e1ea1

Comment by Andreas Dilger [ 13/Apr/23 ]

Because unaligned DIO is using a copy of the data from userspace, we can make the writes async on the OST side. Because DIO expects that data be 'safe', we cannot just go entirely async and not wait for RPC completion, but we can wait only for RPC completion and not force a commit sync on the server.

Consider - for buffered as DIO, we could just return to userspace once the copy is complete and not worry about the RPC. The problem with that is it creates a lot of issues around things like fsync. And ... actually, no, we couldn't do this with server side locking, because the data in the client buffer would be unlocked. It is of course possible in theory to do locked parallel DIO, though that would require some other changes.

as does this Stack Overflow article.

DIO is expected to avoid caching, so that if write() returns to userspace, then a read() of the same offset expects to see the previous writer's data. If the write() call is held until the client has a reply to the RPC (not just RPC sent, but doesn't have to actually be committed to the OST) then the write/read ordering is maintained.

If the RPC-pinned data on the client is not visible to userspace, then there isn't an immediate consistency issue. If only the OST crashes, then the writes would be replayed (in transno order) from the clients so there shouldn't be any out-of-order writes or reads. If only the client crashes, then the writes would eventually commit on the OST.

 

The second expectation of DIO is data persistence after write() returns, and generally no need to call fsync() on a file. If BOTH a client and OST crash before the writes are committed, then that client's writes would be lost, and possibly also later writes from other clients due to recovery abort from missing transno.

I suspect that many applications using DIO would expect that data is persistent after a DIO write() call returns, so this behavior may need to be a configurable parameter (even though it isn't strictly required). That said, there is separately O_SYNC/O_DSYNC in addition to O_DIRECT, so files using only O_DIRECT should not necessarily expect data persistence after write() return without fsync().  The open(2) man page seems to confirm this:

O_DIRECT (Since Linux 2.4.10)
    The O_DIRECT flag on its own makes an effort to transfer data
    synchronously, but does not give the guarantees of the O_SYNC flag
    that data and necessary metadata are transferred.  To guarantee
    synchronous I/O, O_SYNC must be used in addition to O_DIRECT. 

    :
     
    The behaviour of O_DIRECT with NFS will differ from local file systems.
    Older kernels, or kernels configured in certain ways, may not support
    this combination.  The NFS protocol does not support passing the flag
    to the server, so O_DIRECT I/O will bypass the page cache only on the
    client; the server may still cache the I/O. The client asks the server
    to make the I/O synchronous to preserve the synchronous semantics of
    O_DIRECT.  Some servers will perform poorly under these circumstances,
    especially if the I/O size is small. Some servers may also be configured
    to lie to clients about the I/O having reached stable storage; this will
    avoid the performance penalty at some risk to data integrity in the event
    of server power failure.  The Linux NFS client places no alignment
    restrictions on O_DIRECT I/O.

If the IOR IO engine was using AIO/DIO, or io_uring, then delaying write completion until OST commit would avoid this issue completely, and would only be limited by the amount of RAM on the client. This would be similar if unaligned BIO write() was treated as lockless async DIO write() (as with this patch). There is no expectation for consistency after BIO write() return until fsync() is called (which should send an OST_SYNC RPC if no writes are pending, or send the last write RPC without BRW_ASYNC) and then wait for last_committed to drop all unstable writes for that file.

Comment by Patrick Farrell [ 13/Apr/23 ]

Andreas,

Yes, sorry - that was a bit stream of consciousness on my part; I blame the late hour.  When I started rambling about fsync, I was considering how to maintain the write visibility aspect if we held things in our DIO buffer, in order to make writes fully async.  It gets ugly quickly, since you have to implement a flushing mechanism, etc, etc, and you're fairly quickly staring at re-implementing the page cache, so you do a ton of work and lose most of the benefit of DIO along the way.  It's a bad idea.

I agree completely about the DIO and BIO guarantees.  (More shortly)

Comment by Patrick Farrell [ 13/Apr/23 ]

So, you make a comment here which worries me, I'd like to check it:
"There is no expectation for consistency after BIO write() return until fsync() is called"

If you mean "consistency" as in "O_SYNC" type guarantees, where data is 100% safely on disk, then yes, no such expectation for BIO.  If you mean write visibility, then no - Any write() which has returned to userspace must be visible to any read which starts after that return to userspace.  (I think of this as the "forward progression of time is guaranteed", but that's special relativity leaking in. )

I just wanted to highlight this point - I may be misunderstanding you, but it's a very key point, so I want to make sure we're on the same page.

Comment by Patrick Farrell [ 13/Apr/23 ]

Yingjin asked "What about the aligned DIO which also wants not to force journal sync on OST for each write?"
Which is a fun question, and tied in to what we're talking about here, so I decided to put my explanation here.  It has a lot of overlap with what Andreas just said.
----------------
DIO has an expectation that the data be 'safe' after a write, basically similar to if they used O_SYNC.  This isn't in the spec (if you look it up, you can find people arguing about this), but in practice, people expect it.  This is, arguably, the only requirement for DIO, since the "no caching" thing is an expectation that isn't always met (DIO can fall back to BIO under certain circumstances in many Linux file systems).

So how do we guarantee data is 'safe' after a write completes?  Let's compare some things.
Buffered IO has a weak guarantee: a write completes once the data is in cache, so if the client crashes, data is lost.  If the server crashes/goes away before the client sends the RPC, the client just waits.  When the client sends an RPC, we 'pin' the pages so they cannot be removed from the page cache, but the we do mark the pages as clean (so now the write is 'really' done from the VFS perspective).  Anyway, we pin the pages so we can replay the RPC if the server crashes, and then we unpin them on commit.

Normal direct IO has the strongest possible guarantee: A write completes once the server has fully committed the data and the associated journal.  There is no possibility of data loss once a write completes.  But, also note:
Normal DIO is done directly from userspace pages.  This means we can't keep them pinned for RPC replay after we return on write(), because then the user can change them.  This means we have to do a full sync/commit on the server.

Because unaligned direct I/O has a buffer, we can do something different.  Because we have our own copy of the data, it is possible to complete after we have sent the RPC to the server and received a reply.  This means we are set up for replay.  So this gives a partial safety guarantee:
If the client crashes, the server already has the data and will finish committing it.
If the server crashes, the client will replay the RPC.

This is only possible because we have our own buffer.

So, basically, this is a 'reduced' version of the normal DIO 'sync'/safety guarantee, where it is possible to lose data after a DIO write, but should be very rare.  (This is why I consider it likely to be acceptable.)

As Andreas suggests, we should probably make this behavior tunable, because it is a reduction in the safety guarantee - it opens up a small-ish window for data loss.  It's still safer than buffered I/O, but less safe than 'standard' DIO.

(Answer continues in next post...)

Comment by Patrick Farrell [ 13/Apr/23 ]

Now, as Yingjin asked, what about aligned DIO that wants to skip the commit?
As explained above, the issue is we must have our own copy of the data for replay, or we cannot skip the commit.  Once we return from write(), the user is allowed to change the data they submitted (obviously ).  This means we need our own copy to safely replay an RPC.

So this means we would be allocating a buffer and doing a data copy for small direct I/O writes.  This has a cost, and it makes large DIO writes much slower.  (This means large unaligned DIO is much faster than buffered I/O, but much slower than large aligned DIO, which does not have to do the data copy.)

However, the cost of allocation + copy is small for small DIO, and we can them make the commit async.  So it probably does make sense to do 'small' DIO (for a definition of 'small' that we have to figure out...) as 'async commit'.  The idea of DIO with async commit is something I only thought of yesterday, and it hadn't occurred to me to do this for other small DIO.  So I am still thinking about it.

Comment by Andreas Dilger [ 13/Apr/23 ]

Patrick, you are right that for BIO I meant to say that users do not have any expectation about persistence after write() return, but POSIX requires read-after-write ordering guarantees that we cannot break. 

Comment by Gerrit Updater [ 10/May/23 ]

"Patrick Farrell <pfarrell@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50920
Subject: LU-13805 tests: test for AIO fallback across stripe
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 40042cc2e39994df6c460f3707a7d985973e9196

Comment by Gerrit Updater [ 11/May/23 ]

"Patrick Farrell <pfarrell@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50966
Subject: LU-13805 llite: wait for partially successful aio
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 312019d6302e2d4ef3c515a6ee294bbee781660f

Comment by Gerrit Updater [ 21/May/23 ]

"Patrick Farrell <pfarrell@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51075
Subject: LU-13805 obd: Reserve unaligned DIO connect flag
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 40d5c59b89de496e837c9021a250b0dbb3b74588

Comment by Gerrit Updater [ 24/May/23 ]

"Patrick Farrell <pfarrell@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51125
Subject: LU-13805 llite: add flag to disable unaligned DIO
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: f8e7ae17326a420f25b3249f3a7158f2876290a9

Comment by Gerrit Updater [ 24/May/23 ]

"Patrick Farrell <pfarrell@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51126
Subject: LU-13805 llite: Implement unaligned DIO connect flag
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 2b7ee4aec08c34f966e35d7b41fee13854f2c9d3

Comment by Gerrit Updater [ 31/May/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49915/
Subject: LU-13805 llite: Fix return for non-queued aio
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 8a5bb81f774b9d41f1009b07010372fa9cd03a62

Comment by Gerrit Updater [ 01/Jun/23 ]

"Patrick Farrell <pfarrell@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51194
Subject: LU-13805 tests: eviction debugging
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 445c1aaca6d910fbc51bece70f3be5fd8d8defb4

Comment by Gerrit Updater [ 01/Jun/23 ]

"Patrick Farrell <pfarrell@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51195
Subject: LU-13805 llite: Fake change to force testing
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 763c6925f2b0f7f6f9d8f531e17ef1318a28c0b7

Comment by Gerrit Updater [ 28/Jun/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50067/
Subject: LU-13805 osc: Don't include lock for srvlock
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 9a34ec2b09864a93339fd8c90d358debea9703f9

Comment by Gerrit Updater [ 28/Jun/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49987/
Subject: LU-13805 clio: Trivial DIO cleanups
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: a667cfac0e70cd4da7c70ef28ffdcc25136996b2

Comment by Gerrit Updater [ 28/Jun/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49988/
Subject: LU-13805 osc: Add debug
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 6d1045d63a5bf33a1f26e0fd939305bdd49767f3

Comment by Gerrit Updater [ 14/Jul/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49989/
Subject: LU-13805 tests: add debug to aiocp
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: bd30b49355c071183ed29d15e2d2c6b78f92bda7

Comment by Gerrit Updater [ 14/Jul/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50168/
Subject: LU-13805 llite: Convert allocate/get to use pvec
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: dfdfda0ba495d8b3e27601f43ac298fea43c4194

Comment by Gerrit Updater [ 14/Jul/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50170/
Subject: LU-13805 llite: Rename ldp_aio to sdio
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 7c416273ad1a8f55990d430221c16e6dd43ea7e0

Comment by Gerrit Updater [ 27/Jul/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50167/
Subject: LU-13805 llite: Improve sync_io comments
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 8fa28e42ce38cb3b6834284f7793e3e74e776c75

Comment by Gerrit Updater [ 31/Jul/23 ]

"Patrick Farrell <pfarrell@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51815
Subject: LU-13805 tests: patch for testing
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 32f9fef79f68a31ce1063b027c562b9c7a3b1106

Comment by Gerrit Updater [ 07/Aug/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49990/
Subject: LU-13805 tests: add unaligned io to multiop
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 84376bf67446d7b061d032719f335edc12a932ff

Comment by Gerrit Updater [ 22/Aug/23 ]

"Patrick Farrell <pfarrell@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52044
Subject: LU-13805 tests: janitor testing for csi_complete (2)
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 722198b9e8fae940b253f9beba612b682fb6b3b0

Comment by Gerrit Updater [ 22/Aug/23 ]

"Patrick Farrell <pfarrell@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52045
Subject: LU-13805 tests: janitor testing for csi_complete (3)
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: b5da30c10e4625e333f146a0bcca62a31ff402bd

Comment by Gerrit Updater [ 22/Aug/23 ]

"Patrick Farrell <pfarrell@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52046
Subject: LU-13805 tests: janitor testing for csi_complete (4)
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: b42a70c79b8e343f2af830bc6b5d0b770f8f0453

Comment by Gerrit Updater [ 22/Aug/23 ]

"Patrick Farrell <pfarrell@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52047
Subject: LU-13805 tests: janitor testing for csi_complete (5)
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 5d7cb36db4692d80b785b53c4bb72e3de891c583

Comment by Gerrit Updater [ 22/Aug/23 ]

"Patrick Farrell <pfarrell@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52048
Subject: LU-13805 tests: janitor testing for csi_complete (6)
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 3cd760fe020f0ce001bdadfea4a5c04eaeb474ca

Comment by Gerrit Updater [ 22/Aug/23 ]

"Patrick Farrell <pfarrell@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52049
Subject: LU-13805 tests: janitor testing for csi_complete (7)
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 268a58e8485a51569ea3faef8ef4b1f2826096f0

Comment by Gerrit Updater [ 22/Aug/23 ]

"Patrick Farrell <pfarrell@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52050
Subject: LU-13805 tests: janitor testing for csi_complete (8)
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 8a5804fcc69fbd34a6b9e92e8287cdbcf004ed94

Comment by Gerrit Updater [ 23/Aug/23 ]

"Patrick Farrell <pfarrell@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52057
Subject: LU-13805 llite: make page_list_

{add,del}

symmetric
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 7c5eb60a6c9260b3816c760e449c7f14a35ef688

Comment by Gerrit Updater [ 24/Aug/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50529/
Subject: LU-13805 tests: Add racing tests of BIO, DIO
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 43c3a804fe23c96de4f76ae9f1f8ba909558433d

Comment by Gerrit Updater [ 24/Aug/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49991/
Subject: LU-13805 clio: Add write to sdio
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 89f489ef9a18b5fea0571e2606dfdf064777626b

Comment by Gerrit Updater [ 24/Aug/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49940/
Subject: LU-13805 llite: Add copy of iovec to sub-dio
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: d011e65a8936bc105e19b3fad64bfdece455de61

Comment by Gerrit Updater [ 24/Aug/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/51075/
Subject: LU-13805 obd: Reserve unaligned DIO connect flag
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 4c96cbf89dba5e4bf8ddf98a18b72142c22a4289

Comment by Gerrit Updater [ 06/Sep/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49913/
Subject: LU-13805 clio: Add csi_complete
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: a2b722214a99c8c65fb764a67770b47a1195691a

Comment by Gerrit Updater [ 06/Sep/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49947/
Subject: LU-13805 llite: add mm to dio struct
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 7df8bd69fbe59afba0a43fe19e7a5b1d2c3fd115

Comment by Gerrit Updater [ 15/Sep/23 ]

"Patrick Farrell <pfarrell@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52391
Subject: LU-13805 llite: fail unaligned DIO for RDMA pages
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 4f72f0b4b5277eca792bad33b0e6df4df9d32ce1

Comment by Gerrit Updater [ 28/Sep/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49993/
Subject: LU-13805 llite: unaligned direct_rw_pages
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 40bb067da1f40714af4f77d24e840fdb6285b08e

Comment by Gerrit Updater [ 28/Sep/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/45616/
Subject: LU-13805 clio: bounce buffer for unaligned DIO
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 7194eb6431d2ef7245ef3b13394b60e220145187

Comment by Gerrit Updater [ 12/Oct/23 ]

"Patrick Farrell <pfarrell@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52668
Subject: LU-13805 llite: udio and encryption
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 5ea5a9c13a81d0168137230b74b4450ce66de54a

Comment by Gerrit Updater [ 03/Nov/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/51125/
Subject: LU-13805 llite: add flag to disable unaligned DIO
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 33ed79ba61a3275519c897557407619d576a9dc2

Comment by Gerrit Updater [ 18/Nov/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/51126/
Subject: LU-13805 llite: Implement unaligned DIO connect flag
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 0e6e60b1233b08952c338b2c4f121ef749a99f8b

Comment by Andreas Dilger [ 22/Jan/24 ]

Patrick, a number of subtests added as part of this series are failing during sanity interop testing - 119h, 119i,
https://testing.whamcloud.com/test_sets/dc77145c-b7d3-4010-a7a2-f8435f9353ff

Could you please push a patch for master to add a version check to those subtests with:

Fixes: 7194eb6431 ("LU-13805 clio: bounce buffer for unaligned DIO")
Comment by Patrick Farrell [ 26/Jan/24 ]

I opened LU-17473 to take these two patches, which aren't actually related to unaligned DIO:
llite: wait for partially successful aio 
https://review.whamcloud.com/50966/
tests: add racing tests of aio 
https://review.whamcloud.com/50577/

Comment by Patrick Farrell [ 26/Jan/24 ]

Yes, sure - I'll add those checks.

Generated at Sat Feb 10 03:04:22 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.