[LU-6658] single stream write performance improvement with worker threads in llite Created: 28/May/15 Updated: 12/Jan/18 Resolved: 12/Jan/18 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.7.0, Lustre 2.8.0 |
| Fix Version/s: | None |
| Type: | New Feature | Priority: | Major |
| Reporter: | Hiroya Nozaki | Assignee: | Dmitry Eremin (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | patch | ||
| Attachments: |
|
||||||||||||
| Issue Links: |
|
||||||||||||
| Epic/Theme: | Performance | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
This patch provides single stream write performance improvement with multiple worker threads in llite layer. its operation overvies is the following In system call context In worker thread context I attached the performance to compare the original Lustre-2.7.52 and this custom Lustre-2.7.52 |
| Comments |
| Comment by Gerrit Updater [ 28/May/15 ] |
|
Hiroya Nozaki (nozaki.hiroya@jp.fujitsu.com) uploaded a new patch: http://review.whamcloud.com/14990 |
| Comment by Jinshan Xiong (Inactive) [ 28/May/15 ] |
|
That's wonderful, you guys have done parallel IO for Lustre. |
| Comment by Jinshan Xiong (Inactive) [ 28/May/15 ] |
|
I didn't read the patch yet. I will appreciate if you can answer my some quick questions: 1. Does the system call thread wait for the worker thread to complete? |
| Comment by Hiroya Nozaki [ 29/May/15 ] |
|
Hi, Jinshan. > 1. Does the system call thread wait for the worker thread to complete? > 2. In your performance benchmark sheet, what's the stripe count and size you used?
> 3. it looks like you have done parallel IO on the LLITE layer, how much effort would it be to move it to LOV layer |
| Comment by Andreas Dilger [ 05/Jun/15 ] |
|
This looks like very interesting work. My understanding is that the benchmark results are for a single iozone thread running on one client? Are there any benchmark results from many threads (e.g. one per core on a 16+ core client) to see what the overhead is from running the extra worker threads? There is definitely a performance improvement seen during writes (about 50%) with the patch compared to the unpatched client, though the read performance appears to be a bit lower (maybe 5%?). Since there is already parallelism in IO submission at the ptlrpcd layer, it would be interesting to figure out where the performance improvement is coming from? |
| Comment by Hiroya Nozaki [ 08/Jun/15 ] |
That's right. This was done by single iozone thread running on one client.
There is but it was measured for FEFS v2.0 based on Lustre-2.6.0. I will attach a simple graph though I doubt whether this is suitable to introduce here.
Strictly Speaking, this patch doesn't work as parallel I/O but ... like Async-Async I/O. Focusing on write latency, page cache cost and waiting request cost, under vvp layer cost, are the most in write system call. So I tried to cut it of write system call using worker threads, thought I have to add memcpy one more. That's why the performance improvement is coming from .. (non-patched write cost) - (under vvp layer cost) + (an extra memcpy cost) And There was a good news to me. Lustre started to support range_lock then I didn't have to wait the previous write even if the target file is the same, of course, write have to wait if the extent is overlapped, though. Consequently, this works like parallel I/O but I want to call this async-async I/O ... Anyway, I'm really bad in English so I doubt the people who read my comment completely understand what I want to say. Please feel free to ask me anything. Thank you. |
| Comment by Jinshan Xiong (Inactive) [ 08/Jun/15 ] |
|
Hi Hiroya, I understood you really well. Today I have a chance to read your patch and I have a few questions to ask: 2. The patch copies the user space buffer into a worker thread's kernel buffer, and then this kernel buffer will have to copied once again to inode's page cache. This is not good. And I think you can see significant performance degradation in the case of multi-thread write. Yes, I think you've realized this is not parallel I/O intended to be. For parallel I/O implementation, I'd like to I/O to be split on the LOV layer. In the first phase, we can only support files with multiple stripes, and different threads will work on different stripes, from requesting lock to copying data. The writer thread must wait for all sub I/O to complete before it can return to application. |
| Comment by Hiroya Nozaki [ 09/Jun/15 ] |
|
Hi, Jinshan ! 1> 2> |
| Comment by Jinshan Xiong (Inactive) [ 12/Jun/15 ] |
Not every application calls fsync() at the end of I/O. Also no space and no quota error should be returned at the time of writing, otherwise it's useless to cache the I/O if that it is certain to fail.
In my view, the bigger the buffer size, the more likely it can take advantage of parallel data copying. Small buffer can't tolerate the cost of thread switching. Of course, the policy can be way more complex than this. An obvious fact is that the modern CPUs have huge cache and mostly likely the writing data is in the warm cache so the working thread should be limited to the same core or socket. |
| Comment by Hiroya Nozaki [ 16/Jun/15 ] |
I intended to implement this feature as close-to-sync because, strictly speaking, every application should call close(2) and check its return value. And if these follow the rule the implementation may not be a issue, right? .... But it seems that my implementation doesn't work as I expected in close(2).
I agree with you.
I bind each thread to same core when the threads are created. Anyway, I cannot help revising my implementation, hmmm. time is always an issue ... |
| Comment by Li Xi (Inactive) [ 03/May/16 ] |
|
I agree that we need to first fully understand why this patch brings performance improvement, i.e. why the original single thread write performance is not as good as it can be. Hiroya, do you have any profiling results or detailed comparison that we can look at? |
| Comment by Jinshan Xiong (Inactive) [ 09/May/16 ] |
|
Parallel I/O is definitely helpful for performance. However, we need to perform this in a different way. From my point of view, it should be done stripes by stripes because if we did that in single stripe, it would only contribute contention to single OSC. That being said, the current I/O architecture should be changed as follows: |
| Comment by Dmitry Eremin (Inactive) [ 12/Jan/18 ] |
|
Close this ticket because of partially similar approach landed from |