Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-9409

Lustre small IO write performance improvement

Details

    • Improvement
    • Resolution: Fixed
    • Minor
    • Lustre 2.11.0
    • None
    • None
    • 9223372036854775807

    Description

      This task is going to address the problem of poor small IO write performance in Lustre.

      We only talk about cached I/O here because LU-4198 has a decent solution to improve AIO + DIO for small I/O.

      Also for the workload of small I/O, it assumes the pattern of I/O is highly predicated. In another word, it won't help the workload of small random I/O.

      The small I/O doesn't have to be page aligned.

      (detailed HLD in pogress)

      Attachments

        Issue Links

          Activity

            [LU-9409] Lustre small IO write performance improvement

            Minh Diep (minh.diep@intel.com) uploaded a new patch: https://review.whamcloud.com/31306
            Subject: LU-9409 llite: Add tiny write support
            Project: fs/lustre-release
            Branch: b2_10
            Current Patch Set: 1
            Commit: 3480f68c6497dd5fc359284ba13fb816415f2f5f

            gerrit Gerrit Updater added a comment - Minh Diep (minh.diep@intel.com) uploaded a new patch: https://review.whamcloud.com/31306 Subject: LU-9409 llite: Add tiny write support Project: fs/lustre-release Branch: b2_10 Current Patch Set: 1 Commit: 3480f68c6497dd5fc359284ba13fb816415f2f5f
            pjones Peter Jones added a comment -

            It looks like all outstanding work tracked under this ticket has landed but we can reopen this ticket if more work is still to come

            pjones Peter Jones added a comment - It looks like all outstanding work tracked under this ticket has landed but we can reopen this ticket if more work is still to come

            Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/27903/
            Subject: LU-9409 llite: Add tiny write support
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 94470f7eeab5fde0648a14dda36941402c6a3e10

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/27903/ Subject: LU-9409 llite: Add tiny write support Project: fs/lustre-release Branch: master Current Patch Set: Commit: 94470f7eeab5fde0648a14dda36941402c6a3e10

            Perf record of dd with count=8, relevant parts expanded.
            We actually spend (slightly) more time in cl_env_get than in cl_page_touch:

            See attached screenshot, I can't get the text to format correctly.

            paf Patrick Farrell (Inactive) added a comment - - edited Perf record of dd with count=8, relevant parts expanded. We actually spend (slightly) more time in cl_env_get than in cl_page_touch: See attached screenshot, I can't get the text to format correctly.

            Full conversation, a few bits of proprietary information redacted. Apologies for the messy presentation.

            -------
            Re: local file systems, it seemed to leave us off by about a factor of 2 (for very tiny) vs what I tested. I should probably rerun those so I have exact results, and I don't recall the 1k results well. But those were much closer between Lustre and local. The 10-20x was for a few bytes at a time.

            That 30% is entirely separate, because the short I/O (LU-1757) doesn't apply for this - we're just dropping data in the page cache and then sending it out later. The actual RPCs are going to be large.

            I'll plan to copy the discussion over and attach my patch.

            Jinshan, may I post the PDF of the containers design doc? Or would you?

            • Patrick

            From: Dilger, Andreas <andreas.dilger@intel.com>
            Sent: Friday, June 30, 2017 7:50:20 PM
            To: Patrick Farrell
            Cc: Xiong, Jinshan; Cory Spitz
            Subject: Re: Small writes, LU-1757, containers, etc.

            On Jun 30, 2017, at 13:35, Patrick Farrell <paf@cray.com> wrote:
            > Sure. That's what my patch does, essentially, and it's working at least as far as that goes.
            >
            > On real hardware, it seemed to offer a roughly 10x improvement for writes in the 8-32 byte range, 30% for writes in the 1k range. A bit less on my VMs.
            >
            > I've posted it here, but haven't opened an LU yet:
            > https://review.whamcloud.com/#/c/27903/1
            >
            > If we can fix the problems I noted in my comments, it might even be worth putting in by itself, since even if we go forward with containers, that will presumably take a while, and this patch is pretty small and could possibly land soon-ish.

            Interesting. So with this patch (improving very small sequential writes by 10x), does this bring Lustre into the realm of local filesystems, where you previously reported Lustre was 10-20x worse? I'm not sure I'm comparing the same workloads, but 10x would be a very interesting improvement, and even 30% faster is very useful. Is that 30% for this patch itself, in addition to the ~30% for the short IO changes?

            I think this whole discussion should be moved to LU-9409 so that it isn't lost.

            Cheers, Andreas

            >
            >> From: Dilger, Andreas <andreas.dilger@intel.com>
            >> Sent: Friday, June 30, 2017 12:05:21 PM
            >> To: Xiong, Jinshan
            >> Cc: Patrick Farrell; Cory Spitz
            >> Subject: Re: Small writes, LU-1757, containers, etc.
            >>
            >> So, I'm not against the container idea if it really shows benefits.
            >>
            >> Is there some way that we could see a prototype of that functionality that would allow us to estimate how much performance improvement we might get, without spending the full time to implement it and get it working?
            >>
            >> It doesn't actually have to work correctly, just a hack to give us an idea of the performance gains. For example, if there was a check in the llite layer to just see if the page is already dirty and return immediately, and test with 1KB or 512-byte writes, should short-circuit the "slow" write path 3/4 or 7/8 of the writes.
            >>
            >> Cheers, Andreas
            >>
            >>
            >> > On Jun 30, 2017, at 09:45, Xiong, Jinshan <jinshan.xiong@intel.com> wrote:
            >> >
            >> >
            >> >> On Jun 28, 2017, at 8:00 AM, Patrick Farrell <paf@cray.com> wrote:
            >> >>
            >> >> Andreas,
            >> >>
            >> >> Not at all, I just didn't want to impose. I've copied him in.
            >> >>
            >> >> Outlook makes this ugly, but I'm going to do responses inline as well. (Marked out by ---- on either side)
            >> >>
            >> >>
            >> >>
            >> >> From: Dilger, Andreas <andreas.dilger@intel.com>
            >> >> Sent: Tuesday, June 27, 2017 4:37 PM
            >> >> To: Patrick Farrell
            >> >> Cc: Cory Spitz
            >> >> Subject: Re: Small writes, LU-1757, containers, etc.
            >> >>
            >> >> Patrick, do you mind if I include Jinshan in this discussion?
            >> >> More comments inline.
            >> >>
            >> >> On Jun 27, 2017, at 12:14, Patrick Farrell <paf@cray.com> wrote:
            >> >> > Andreas,
            >> >> >
            >> >> > This doesn't quite fit in LU-1757, but I think it fits a bit in to the larger conversation around small write performance that you and Cory touched on at LUG, that also includes Jinshan's proposed work around write containers (the buckets of reserved pages and grant he wrote the design doc for, aimed at small write performance). Cory mentioned you were more in favor of pushing small writes out as fast as possible (perhaps not exactly what you said, just what he passed on), and less inclined to the containers idea.
            >> >>
            >> >> Yes, that was my general thought.
            >> >>
            >> >> > I've been giving this some thought, and I've also been doing some related experimentation (LU-1757, other work w/flash and other filesystems, etc), and I thought I'd offer a vote in favor of containers.
            >> >> >
            >> >> > Small writes are slow for a few reasons, but I think central among them is client side overhead (cl_io creation, page allocation, locking, file size & attributes) for each i/o. This is better with dio than with buffered i/o, but it's still significant. There are other issues with direct i/o, too. (As I understand it, the "push small writes out fast" thing you're suggesting in LU-1757 would essentially amount to doing dio for those writes.)
            >> >> >
            >> >> > One notable issue with dio vs buffered i/o is what the client has to wait for server side. As I understand it, for direct i/o server needs to have the write truly synced/down before it can reply to the client, otherwise we risk data loss since the client can't replay once we return to userspace.
            >> >> >
            >> >> > This hurts performance a lot, but I don't think it can be avoided. So we could force out small writes from the client buffer, and those could be sorted/aggregated server side, at least some. But now we're back to that very large per i/o overhead. (For small writes, just to the page cache, Lustre is ~10-20x worse than local file systems.)
            >> >> >
            >> >> > That's probably not news, but there's one last nasty thing with small writes in the dio style, where the server usually is also forced to do small writes.
            >> >> >
            >> >> > This is a surprise I found when testing another filesystem that never uses the page cache (so, always small network i/o if small writes from userspace), and manages much better write latencies than Lustre (I've been asked not to share the exact numbers, but it's a gap of several times for this vs Lustre dio, even with LU-1757.). So this is sort of an extreme case of "push small writes out fast".
            >> >>
            >> >> So, in some respects isn't that "several times faster than Lustre" what we should be targeting? I think there is performance to be gained from reducing the per-I/O overhead of Lustre on the client.
            >> >> —
            >> >>
            >> >> Yes, I think that's interesting, but I'm not sure how achievable it is. There are certainly improvements to be had, but really dropping the latency looks (to me) quite hard.
            >> >>
            >> >> —
            >> >>
            >> >> > For small (4k and less) random writes from one client, the performance is about the same as Lustre (Lustre handily wins at sequential writes).
            >> >> >
            >> >> > But Lustre with random or sequential small i/o, Lustre absolutely blows it away in terms of extracting all of the available server performance.
            >> >>
            >> >> At this point I'm not totally clear what workload you are describing? Is this with multiple clients, or how does it differ from your earlier comparison that "the other filesystem manages much better write latencies than Lustre"? Is this about bandwidth vs. latency?
            >> >> —
            >> >>
            >> >> Sure, sorry - this is with many clients. The idea is a job spread across many nodes where all of them are doing small writes I/O. In that case, the total bandwidth delivered by the server is much worse if we really do small writes as individual network ops, rather than aggregating them. (I've got data showing that while 1K (non-sync) writes on one Lustre client aren't great (other fs is better), 1K writes from many Lustre clients can fully drive the server, whereas the other system - which is really doing 1k i/o on the network - is limited to a small fraction of theoretical server bandwidth.)
            >> >>
            >> >> —
            >> >>
            >> >> > It can't, Lustre can, because the server can't actually process the influx of 4k RPCs fast enough. On the other hand, Lustre aggregates even the random i/o in to large RPCs, and the server can take full advantage of its (flash) drives.
            >> >>
            >> >> That is true if the random writes are async and can be aggregated. For sync writes we can't aggregate the writes on the client.
            >> >> —
            >> >>
            >> >> This is definitely true. I suppose I'd say we should try to handle that case better, but that it's still going to be a lot worse.
            >> >>
            >> >> —
            >> >>
            >> >> > So, I guess I think that's another point for the containers ideas from Jinshan. I think we should try to only do it for sequential i/o, but that's not that hard to recognize (something as simple as "did this i/o make a new extent" can serve as a proxy for random i/o, I've got a patch that does that for something else).
            >> >>
            >> >> My concern about the containers idea is that this adds even more complexity and overhead to the client IO path, but only handles one specific case - small contiguous writes from a single client, but doesn't help at all (and may hurt) small discontiguous writes from one or more clients due to the existing and extra overhead. The code is already very complex and hard to understand, and the containers wouldn't improve that. My hope was to find a solution that simplified and/or bypassed the client-side locking and write aggregation mechanisms and their corresponding overhead for synchronous small disjoint writes, which can't be optimized at the client.
            >> >> —
            >> >> To the first: This is certainly true. While this does just add more complexity, I was hoping we could limit using the containers case to where we observed small sequential i/o, as opposed to random i/o (and probably also a check to make sure the ldlm locks are not being repeatedly lost - could probably still be an osc layer check, counting cancellation flushes vs i/o operations done).
            >> >
            >> > The container idea will help not only for small writes but also for large writes because it can reduce the overhead of grant reservation routine in current I/O submit routine. I would expect to see a factor of ~10% improvement on large writes and 10 times better on small writes with it.
            >> >
            >> > Some existing code can be removed after container is implemented so it will simplify the code a little bit initially. But it could become more complex because there could exist lots of corner cases to be solved.
            >> >
            >> >>
            >> >> To the second part: I think that's an interesting and useful goal, and I think the combination of LU-247, LU-1757, and LU-4198 would get us a lot of what we need to achieve the best we can do for small disjoint writes... But I'm not sure how good that best is. In particular, the resiliency requirement that we must sync non-cached writes (whether or not we call them direct i/o) server side before returning to userspace I think may really kill us. This is probably the best we can do (or at least, it's the best I can think of). But I think we have the option to really do dramatically better for small sequential writing. (More thoughts on this after the next jump.)
            >> >
            >> > I tend to agree. Sync small write is really difficult due to I/O overhead. Right now the doable ways are to optimize the performance case by case. For example, if the application can afford to do AIO and DIO, then LU-4198 will be really helpful. On the other hand, if the apps will do sequential small write, then my proposal of container will help deliver much better results.
            >> >
            >> > In the other words, the current direction of merge RPCs on the client or server will not reduce the per I/O overhead so I wouldn’t see a room of significant performance improvement. Most likely the improvement would be from the level of ’super suck’ to ‘very suck’ .
            >> >
            >> >>
            >> >> —
            >> >>
            >> >> I think there are things we could do to improve smaller writes, for example your readahead-for-write patch, maybe combined with a "fast write" path (akin to the fast read path) that checks if a page is already dirty in cache and avoids taking any locks under the assumption that the page must already be under a lock. This would help for sub-4KB writes, which is where the maximum overhead would be seen.
            >> >
            >> > Container is actually an implementation of ‘fast write’ as it caches whatever resources it will need to dirty pages at LLITE layer.
            >> >
            >> > Thanks,
            >> > Jinshan
            >> >
            >> >>
            >> >> —
            >> >>
            >> >> Yes, I've actually got exactly that patch written up, but I ran into some snags. For really small (single byte to single word range) writes, it improves performance by a factor of 10. (A local file system is 20x faster than Lustre normally on this workload, so we're still off by 2x. But it's a huge improvement.) Specifically, it seems a little tricky to guarantee that Lustre decides to write out the necessary part of the page, (My tentative solution is to just note in the page it's had untracked writes, then to write the full page, since it's not easy to know which bytes are dirty, and I think with overhead, writing out all of the page will usually be comparable to writing out part of it.) and I'm also struggling to get file size to update correctly. That should be fixable.
            >> >>
            >> >> Anyway, I will post that patch to gerrit and link you and Jinshan. Perhaps good solutions to those snags will be clear to other eyes.
            >> >>
            >> >> —
            >> >>
            >> >> > Sorry to just butt in, I hope this doesn't come amiss.
            >> >>
            >> >> I'm happy to get your input, and I wouldn't object to this discussion going to the lustre-devel mailing list, just to get some more activity going in public on this and other issues. It is definitely great to see you are interested in this area, as it is one of the achilles heels of Lustre. While I have some opinions on this issue, the proof is in the pudding, and I'd be thrilled to be proven wrong with changes that improve performance in some other manner.
            >> >> —
            >> >>

            Cheers, Andreas

            Andreas Dilger
            Lustre Principal Architect
            Intel Corporation

            paf Patrick Farrell (Inactive) added a comment - Full conversation, a few bits of proprietary information redacted. Apologies for the messy presentation. ------- Re: local file systems, it seemed to leave us off by about a factor of 2 (for very tiny) vs what I tested. I should probably rerun those so I have exact results, and I don't recall the 1k results well. But those were much closer between Lustre and local. The 10-20x was for a few bytes at a time. That 30% is entirely separate, because the short I/O ( LU-1757 ) doesn't apply for this - we're just dropping data in the page cache and then sending it out later. The actual RPCs are going to be large. I'll plan to copy the discussion over and attach my patch. Jinshan, may I post the PDF of the containers design doc? Or would you? Patrick From: Dilger, Andreas <andreas.dilger@intel.com> Sent: Friday, June 30, 2017 7:50:20 PM To: Patrick Farrell Cc: Xiong, Jinshan; Cory Spitz Subject: Re: Small writes, LU-1757 , containers, etc. On Jun 30, 2017, at 13:35, Patrick Farrell <paf@cray.com> wrote: > Sure. That's what my patch does, essentially, and it's working at least as far as that goes. > > On real hardware, it seemed to offer a roughly 10x improvement for writes in the 8-32 byte range, 30% for writes in the 1k range. A bit less on my VMs. > > I've posted it here, but haven't opened an LU yet: > https://review.whamcloud.com/#/c/27903/1 > > If we can fix the problems I noted in my comments, it might even be worth putting in by itself, since even if we go forward with containers, that will presumably take a while, and this patch is pretty small and could possibly land soon-ish. Interesting. So with this patch (improving very small sequential writes by 10x), does this bring Lustre into the realm of local filesystems, where you previously reported Lustre was 10-20x worse? I'm not sure I'm comparing the same workloads, but 10x would be a very interesting improvement, and even 30% faster is very useful. Is that 30% for this patch itself, in addition to the ~30% for the short IO changes? I think this whole discussion should be moved to LU-9409 so that it isn't lost. Cheers, Andreas > >> From: Dilger, Andreas <andreas.dilger@intel.com> >> Sent: Friday, June 30, 2017 12:05:21 PM >> To: Xiong, Jinshan >> Cc: Patrick Farrell; Cory Spitz >> Subject: Re: Small writes, LU-1757 , containers, etc. >> >> So, I'm not against the container idea if it really shows benefits. >> >> Is there some way that we could see a prototype of that functionality that would allow us to estimate how much performance improvement we might get, without spending the full time to implement it and get it working? >> >> It doesn't actually have to work correctly, just a hack to give us an idea of the performance gains. For example, if there was a check in the llite layer to just see if the page is already dirty and return immediately, and test with 1KB or 512-byte writes, should short-circuit the "slow" write path 3/4 or 7/8 of the writes. >> >> Cheers, Andreas >> >> >> > On Jun 30, 2017, at 09:45, Xiong, Jinshan <jinshan.xiong@intel.com> wrote: >> > >> > >> >> On Jun 28, 2017, at 8:00 AM, Patrick Farrell <paf@cray.com> wrote: >> >> >> >> Andreas, >> >> >> >> Not at all, I just didn't want to impose. I've copied him in. >> >> >> >> Outlook makes this ugly, but I'm going to do responses inline as well. (Marked out by ---- on either side) >> >> >> >> >> >> >> >> From: Dilger, Andreas <andreas.dilger@intel.com> >> >> Sent: Tuesday, June 27, 2017 4:37 PM >> >> To: Patrick Farrell >> >> Cc: Cory Spitz >> >> Subject: Re: Small writes, LU-1757 , containers, etc. >> >> >> >> Patrick, do you mind if I include Jinshan in this discussion? >> >> More comments inline. >> >> >> >> On Jun 27, 2017, at 12:14, Patrick Farrell <paf@cray.com> wrote: >> >> > Andreas, >> >> > >> >> > This doesn't quite fit in LU-1757 , but I think it fits a bit in to the larger conversation around small write performance that you and Cory touched on at LUG, that also includes Jinshan's proposed work around write containers (the buckets of reserved pages and grant he wrote the design doc for, aimed at small write performance). Cory mentioned you were more in favor of pushing small writes out as fast as possible (perhaps not exactly what you said, just what he passed on), and less inclined to the containers idea. >> >> >> >> Yes, that was my general thought. >> >> >> >> > I've been giving this some thought, and I've also been doing some related experimentation ( LU-1757 , other work w/flash and other filesystems, etc), and I thought I'd offer a vote in favor of containers. >> >> > >> >> > Small writes are slow for a few reasons, but I think central among them is client side overhead (cl_io creation, page allocation, locking, file size & attributes) for each i/o. This is better with dio than with buffered i/o, but it's still significant. There are other issues with direct i/o, too. (As I understand it, the "push small writes out fast" thing you're suggesting in LU-1757 would essentially amount to doing dio for those writes.) >> >> > >> >> > One notable issue with dio vs buffered i/o is what the client has to wait for server side. As I understand it, for direct i/o server needs to have the write truly synced/down before it can reply to the client, otherwise we risk data loss since the client can't replay once we return to userspace. >> >> > >> >> > This hurts performance a lot, but I don't think it can be avoided. So we could force out small writes from the client buffer, and those could be sorted/aggregated server side, at least some. But now we're back to that very large per i/o overhead. (For small writes, just to the page cache, Lustre is ~10-20x worse than local file systems.) >> >> > >> >> > That's probably not news, but there's one last nasty thing with small writes in the dio style, where the server usually is also forced to do small writes. >> >> > >> >> > This is a surprise I found when testing another filesystem that never uses the page cache (so, always small network i/o if small writes from userspace), and manages much better write latencies than Lustre (I've been asked not to share the exact numbers, but it's a gap of several times for this vs Lustre dio, even with LU-1757 .). So this is sort of an extreme case of "push small writes out fast". >> >> >> >> So, in some respects isn't that "several times faster than Lustre" what we should be targeting? I think there is performance to be gained from reducing the per-I/O overhead of Lustre on the client. >> >> — >> >> >> >> Yes, I think that's interesting, but I'm not sure how achievable it is. There are certainly improvements to be had, but really dropping the latency looks (to me) quite hard. >> >> >> >> — >> >> >> >> > For small (4k and less) random writes from one client, the performance is about the same as Lustre (Lustre handily wins at sequential writes). >> >> > >> >> > But Lustre with random or sequential small i/o, Lustre absolutely blows it away in terms of extracting all of the available server performance. >> >> >> >> At this point I'm not totally clear what workload you are describing? Is this with multiple clients, or how does it differ from your earlier comparison that "the other filesystem manages much better write latencies than Lustre"? Is this about bandwidth vs. latency? >> >> — >> >> >> >> Sure, sorry - this is with many clients. The idea is a job spread across many nodes where all of them are doing small writes I/O. In that case, the total bandwidth delivered by the server is much worse if we really do small writes as individual network ops, rather than aggregating them. (I've got data showing that while 1K (non-sync) writes on one Lustre client aren't great (other fs is better), 1K writes from many Lustre clients can fully drive the server, whereas the other system - which is really doing 1k i/o on the network - is limited to a small fraction of theoretical server bandwidth.) >> >> >> >> — >> >> >> >> > It can't, Lustre can, because the server can't actually process the influx of 4k RPCs fast enough. On the other hand, Lustre aggregates even the random i/o in to large RPCs, and the server can take full advantage of its (flash) drives. >> >> >> >> That is true if the random writes are async and can be aggregated. For sync writes we can't aggregate the writes on the client. >> >> — >> >> >> >> This is definitely true. I suppose I'd say we should try to handle that case better, but that it's still going to be a lot worse. >> >> >> >> — >> >> >> >> > So, I guess I think that's another point for the containers ideas from Jinshan. I think we should try to only do it for sequential i/o, but that's not that hard to recognize (something as simple as "did this i/o make a new extent" can serve as a proxy for random i/o, I've got a patch that does that for something else). >> >> >> >> My concern about the containers idea is that this adds even more complexity and overhead to the client IO path, but only handles one specific case - small contiguous writes from a single client, but doesn't help at all (and may hurt) small discontiguous writes from one or more clients due to the existing and extra overhead. The code is already very complex and hard to understand, and the containers wouldn't improve that. My hope was to find a solution that simplified and/or bypassed the client-side locking and write aggregation mechanisms and their corresponding overhead for synchronous small disjoint writes, which can't be optimized at the client. >> >> — >> >> To the first: This is certainly true. While this does just add more complexity, I was hoping we could limit using the containers case to where we observed small sequential i/o, as opposed to random i/o (and probably also a check to make sure the ldlm locks are not being repeatedly lost - could probably still be an osc layer check, counting cancellation flushes vs i/o operations done). >> > >> > The container idea will help not only for small writes but also for large writes because it can reduce the overhead of grant reservation routine in current I/O submit routine. I would expect to see a factor of ~10% improvement on large writes and 10 times better on small writes with it. >> > >> > Some existing code can be removed after container is implemented so it will simplify the code a little bit initially. But it could become more complex because there could exist lots of corner cases to be solved. >> > >> >> >> >> To the second part: I think that's an interesting and useful goal, and I think the combination of LU-247 , LU-1757 , and LU-4198 would get us a lot of what we need to achieve the best we can do for small disjoint writes... But I'm not sure how good that best is . In particular, the resiliency requirement that we must sync non-cached writes (whether or not we call them direct i/o) server side before returning to userspace I think may really kill us. This is probably the best we can do (or at least, it's the best I can think of). But I think we have the option to really do dramatically better for small sequential writing. (More thoughts on this after the next jump.) >> > >> > I tend to agree. Sync small write is really difficult due to I/O overhead. Right now the doable ways are to optimize the performance case by case. For example, if the application can afford to do AIO and DIO, then LU-4198 will be really helpful. On the other hand, if the apps will do sequential small write, then my proposal of container will help deliver much better results. >> > >> > In the other words, the current direction of merge RPCs on the client or server will not reduce the per I/O overhead so I wouldn’t see a room of significant performance improvement. Most likely the improvement would be from the level of ’super suck’ to ‘very suck’ . >> > >> >> >> >> — >> >> >> >> I think there are things we could do to improve smaller writes, for example your readahead-for-write patch, maybe combined with a "fast write" path (akin to the fast read path) that checks if a page is already dirty in cache and avoids taking any locks under the assumption that the page must already be under a lock. This would help for sub-4KB writes, which is where the maximum overhead would be seen. >> > >> > Container is actually an implementation of ‘fast write’ as it caches whatever resources it will need to dirty pages at LLITE layer. >> > >> > Thanks, >> > Jinshan >> > >> >> >> >> — >> >> >> >> Yes, I've actually got exactly that patch written up, but I ran into some snags. For really small (single byte to single word range) writes, it improves performance by a factor of 10. (A local file system is 20x faster than Lustre normally on this workload, so we're still off by 2x. But it's a huge improvement.) Specifically, it seems a little tricky to guarantee that Lustre decides to write out the necessary part of the page, (My tentative solution is to just note in the page it's had untracked writes, then to write the full page, since it's not easy to know which bytes are dirty, and I think with overhead, writing out all of the page will usually be comparable to writing out part of it.) and I'm also struggling to get file size to update correctly. That should be fixable. >> >> >> >> Anyway, I will post that patch to gerrit and link you and Jinshan. Perhaps good solutions to those snags will be clear to other eyes. >> >> >> >> — >> >> >> >> > Sorry to just butt in, I hope this doesn't come amiss. >> >> >> >> I'm happy to get your input, and I wouldn't object to this discussion going to the lustre-devel mailing list, just to get some more activity going in public on this and other issues. It is definitely great to see you are interested in this area, as it is one of the achilles heels of Lustre. While I have some opinions on this issue, the proof is in the pudding, and I'd be thrilled to be proven wrong with changes that improve performance in some other manner. >> >> — >> >> Cheers, Andreas – Andreas Dilger Lustre Principal Architect Intel Corporation

            This is the patch (from me) referenced in the email conversation:
            https://review.whamcloud.com/#/c/27903/

            paf Patrick Farrell (Inactive) added a comment - This is the patch (from me) referenced in the email conversation: https://review.whamcloud.com/#/c/27903/

            Attached design doc from Jinshan. This doc and some other items formed the basis of an email conversation between Jinshan, Andreas, and myself.

            I'll provide the full conversation, but I thought I'd also give a brief summary first.

            The issue of concern here is write performance, particularly small buffered write performance. ("small" could probably mean anything much under 1 MiB, with performance getting worse as i/o size drops.) The main problem with small writes is per i/o overhead - There is a lot of work involved in doing each i/o in Lustre, much more than on a local file system. This is without even sending the data out on the wire - Just the i/o overhead required to get the data in to the page cache and do all of the other things Lustre does to guarantee reasonable POSIX semantics.

            Andreas suggested that we should try to reduce per i/o overhead directly, since this would benefit all cases. Jinshan contends (and I agree) that while this is a good idea, there is limited benefit available here, because it is hard to reduce that overhead.

            The basic idea of i/o containers is to reduce overhead for sequential i/o by reserving the required items in advance, so we can just write directly in to the page cache, similarly to how fast_read reads directly from the page cache. This should improve performance hugely for very small writes, and improve performance some for all writes.

            Email conversation to follow.

            paf Patrick Farrell (Inactive) added a comment - Attached design doc from Jinshan. This doc and some other items formed the basis of an email conversation between Jinshan, Andreas, and myself. I'll provide the full conversation, but I thought I'd also give a brief summary first. The issue of concern here is write performance, particularly small buffered write performance. ("small" could probably mean anything much under 1 MiB, with performance getting worse as i/o size drops.) The main problem with small writes is per i/o overhead - There is a lot of work involved in doing each i/o in Lustre, much more than on a local file system. This is without even sending the data out on the wire - Just the i/o overhead required to get the data in to the page cache and do all of the other things Lustre does to guarantee reasonable POSIX semantics. Andreas suggested that we should try to reduce per i/o overhead directly, since this would benefit all cases. Jinshan contends (and I agree) that while this is a good idea, there is limited benefit available here, because it is hard to reduce that overhead. The basic idea of i/o containers is to reduce overhead for sequential i/o by reserving the required items in advance, so we can just write directly in to the page cache, similarly to how fast_read reads directly from the page cache. This should improve performance hugely for very small writes, and improve performance some for all writes. Email conversation to follow.

            Very curious about this one, I'd be happy to help with testing, review, etc.

            paf Patrick Farrell (Inactive) added a comment - Very curious about this one, I'd be happy to help with testing, review, etc.

            Absolutely - I'll write a one-page description for this work. I hope I can finish it this week.

            jay Jinshan Xiong (Inactive) added a comment - Absolutely - I'll write a one-page description for this work. I hope I can finish it this week.

            Can you provide some ideas on how you plan to address this problem, before spending a lot of time to implement anything?

            There is already patch https://review.whamcloud.com/3690 "LU-1757 brw: add short io osc/ost transfer" that implements the network protocol for short write RPCs with inline data, but that didn't show any performance improvement for 4KB random writes.

            It may be that different test workloads (e.g. 7KB non-synchronous writes), or with further optimizations of the IO path on the client that this would show actual improvements.

            adilger Andreas Dilger added a comment - Can you provide some ideas on how you plan to address this problem, before spending a lot of time to implement anything? There is already patch https://review.whamcloud.com/3690 " LU-1757 brw: add short io osc/ost transfer" that implements the network protocol for short write RPCs with inline data, but that didn't show any performance improvement for 4KB random writes. It may be that different test workloads (e.g. 7KB non-synchronous writes), or with further optimizations of the IO path on the client that this would show actual improvements.

            People

              bobijam Zhenyu Xu
              jay Jinshan Xiong (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: