[LU-6658] single stream write performance improvement with worker threads in llite Created: 28/May/15  Updated: 12/Jan/18  Resolved: 12/Jan/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.7.0, Lustre 2.8.0
Fix Version/s: None

Type: New Feature Priority: Major
Reporter: Hiroya Nozaki Assignee: Dmitry Eremin (Inactive)
Resolution: Fixed Votes: 0
Labels: patch

Attachments: Microsoft Word N-process one client.xlsx     Microsoft Word iosvc.performance.iozone.xlsx    
Issue Links:
Related
is related to LU-8964 use parallel I/O to improve performan... Resolved
is related to LU-1056 Single-client, single-thread and sing... Resolved
Epic/Theme: Performance
Rank (Obsolete): 9223372036854775807

 Description   

This patch provides single stream write performance improvement with multiple worker threads in llite layer. its operation overvies is the following

In system call context
1) get a worker thread's lu_env
2) assemble and set parameters
2-1) copy user buffer to kernel buffer
2-2) copy parameters needed for worker thread to resume I/O
2-3) set parameters to the lu_env gotten in (1)
2-4) set extra parameters to an I/O item, iosvc_item
3) inform worker thread: ll_iosvc, we got ready.
4) return immediately

In worker thread context
1) wake up
2) gathering information
2-1) refer its own lu_env to know the parameters set by syscall
2-2) refer the item made in (2-3)
3) resume I/O
4) sleep

I attached the performance to compare the original Lustre-2.7.52 and this custom Lustre-2.7.52



 Comments   
Comment by Gerrit Updater [ 28/May/15 ]

Hiroya Nozaki (nozaki.hiroya@jp.fujitsu.com) uploaded a new patch: http://review.whamcloud.com/14990
Subject: LU-6658 llite: single stream write performance improvement
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 3e0da7583c369e213e795e22367cbbab9d33b7d7

Comment by Jinshan Xiong (Inactive) [ 28/May/15 ]

That's wonderful, you guys have done parallel IO for Lustre.

Comment by Jinshan Xiong (Inactive) [ 28/May/15 ]

I didn't read the patch yet. I will appreciate if you can answer my some quick questions:

1. Does the system call thread wait for the worker thread to complete?
2. In your performance benchmark sheet, what's the stripe count and size you used?
3. it looks like you have done parallel IO on the LLITE layer, how much effort would it be to move it to LOV layer so that worker thread can do IO on stripe boundary?

Comment by Hiroya Nozaki [ 29/May/15 ]

Hi, Jinshan.

> 1. Does the system call thread wait for the worker thread to complete?
No, the system call thread returns immediately after registering the I/O information with a worker thread.

> 2. In your performance benchmark sheet, what's the stripe count and size you used?
ah, yes, I forgot to write it down in the sheet.
it was the following

  1. lfs getstripe /mnt/lustre
    /mnt/lustre
    stripe_count: -1 stripe_size: 4194304 stripe_offset: -1
    /mnt/lustre/test
    lmm_stripe_count: 2
    lmm_stripe_size: 4194304
    lmm_pattern: 1
    lmm_layout_gen: 0
    lmm_stripe_offset: 2
    obdidx objid objid group
    2 2 0x2 0
    1 2 0x2 0

> 3. it looks like you have done parallel IO on the LLITE layer, how much effort would it be to move it to LOV layer
> so that worker thread can do IO on stripe boundary?
ah ... yeah, actually it looks working well in LOV layer too. hmm ... it seems to takes a couple of month until I complete . But, you know, I had to deal with tons of issues when implementing this feature which is why I finally landed this place so I'm kind of diffident about the estimation and the difficulty.

Comment by Andreas Dilger [ 05/Jun/15 ]

This looks like very interesting work. My understanding is that the benchmark results are for a single iozone thread running on one client? Are there any benchmark results from many threads (e.g. one per core on a 16+ core client) to see what the overhead is from running the extra worker threads?

There is definitely a performance improvement seen during writes (about 50%) with the patch compared to the unpatched client, though the read performance appears to be a bit lower (maybe 5%?). Since there is already parallelism in IO submission at the ptlrpcd layer, it would be interesting to figure out where the performance improvement is coming from?

Comment by Hiroya Nozaki [ 08/Jun/15 ]

My understanding is that the benchmark results are for a single iozone thread running on one client?

That's right. This was done by single iozone thread running on one client.

Are there any benchmark results from many threads (e.g. one per core on a 16+ core client) to see what the overhead is from running the extra worker threads?

There is but it was measured for FEFS v2.0 based on Lustre-2.6.0. I will attach a simple graph though I doubt whether this is suitable to introduce here.

There is definitely a performance improvement seen during writes (about 50%) with the patch compared to the unpatched client, though the read performance appears to be a bit lower (maybe 5%?). Since there is already parallelism in IO submission at the ptlrpcd layer, it would be interesting to figure out where the performance improvement is coming from?

Strictly Speaking, this patch doesn't work as parallel I/O but ... like Async-Async I/O. Focusing on write latency, page cache cost and waiting request cost, under vvp layer cost, are the most in write system call. So I tried to cut it of write system call using worker threads, thought I have to add memcpy one more.

That's why the performance improvement is coming from ..

(non-patched write cost) - (under vvp layer cost) + (an extra memcpy cost) 

And There was a good news to me. Lustre started to support range_lock then I didn't have to wait the previous write even if the target file is the same, of course, write have to wait if the extent is overlapped, though.

Consequently, this works like parallel I/O but I want to call this async-async I/O ...

Anyway, I'm really bad in English so I doubt the people who read my comment completely understand what I want to say. Please feel free to ask me anything. Thank you.

Comment by Jinshan Xiong (Inactive) [ 08/Jun/15 ]

Hi Hiroya,

I understood you really well.

Today I have a chance to read your patch and I have a few questions to ask:
1. As you mentioned in your previous comment, the writer doesn't wait the async thread to finish the I/O, which means the I/O may fail but the application has no idea about it. This can potentially cause data corruption because otherwise the application can retry the write or just stop running;

2. The patch copies the user space buffer into a worker thread's kernel buffer, and then this kernel buffer will have to copied once again to inode's page cache. This is not good. And I think you can see significant performance degradation in the case of multi-thread write.

Yes, I think you've realized this is not parallel I/O intended to be. For parallel I/O implementation, I'd like to I/O to be split on the LOV layer. In the first phase, we can only support files with multiple stripes, and different threads will work on different stripes, from requesting lock to copying data. The writer thread must wait for all sub I/O to complete before it can return to application.

Comment by Hiroya Nozaki [ 09/Jun/15 ]

Hi, Jinshan !
Let me answer your questions as inline

1>
When an error or something causes short-write, the error is detected by iosvc_detect_swrite() and set the errno to ll_inode_info.lli_iosvc_rc.
And the member is picked up by fsync or fflush. So a user can pick up the error with them. Considering fflush is called via close(), user is able to realize the error with it.

2>
yes, the total cost increases but I restricts the duplicated memory quantity by a module parameter; iosvc_max_iovec_mb. this means that iosvc can use only (iosvc_max_iovec_mb * thread_num) [MiB] and when write excesses the number, it's done through ordinary route. So that we can easily avoid awful performance degradation. you can see the logic in iosvc_check_and_get_iovec()
( ll_file_io_generic() -> iosvc_duplicate_env() -> iosvc_setup_iovec() -> iosvc_check_and_get_iovec() )

Comment by Jinshan Xiong (Inactive) [ 12/Jun/15 ]

When an error or something causes short-write, the error is detected by iosvc_detect_swrite() and set the errno to ll_inode_info.lli_iosvc_rc.
And the member is picked up by fsync or fflush. So a user can pick up the error with them. Considering fflush is called via close(), user is able to realize the error with it.

Not every application calls fsync() at the end of I/O. Also no space and no quota error should be returned at the time of writing, otherwise it's useless to cache the I/O if that it is certain to fail.

yes, the total cost increases but I restricts the duplicated memory quantity by a module parameter; iosvc_max_iovec_mb. this means that iosvc can use only (iosvc_max_iovec_mb * thread_num) [MiB] and when write excesses the number, it's done through ordinary route. So that we can easily avoid awful performance degradation. you can see the logic in iosvc_check_and_get_iovec()
( ll_file_io_generic() -> iosvc_duplicate_env() -> iosvc_setup_iovec() -> iosvc_check_and_get_iovec() )

In my view, the bigger the buffer size, the more likely it can take advantage of parallel data copying. Small buffer can't tolerate the cost of thread switching. Of course, the policy can be way more complex than this. An obvious fact is that the modern CPUs have huge cache and mostly likely the writing data is in the warm cache so the working thread should be limited to the same core or socket.

Comment by Hiroya Nozaki [ 16/Jun/15 ]

Not every application calls fsync() at the end of I/O. Also no space and no quota error should be returned at the time of writing, otherwise it's useless to cache the I/O if that it is certain to fail.

I intended to implement this feature as close-to-sync because, strictly speaking, every application should call close(2) and check its return value. And if these follow the rule the implementation may not be a issue, right? .... But it seems that my implementation doesn't work as I expected in close(2).

In my view, the bigger the buffer size, the more likely it can take advantage of parallel data copying. Small buffer can't tolerate the cost of thread switching.

I agree with you.

An obvious fact is that the modern CPUs have huge cache and mostly likely the writing data is in the warm cache so the working thread should be limited to the same core or socket.

I bind each thread to same core when the threads are created.

Anyway, I cannot help revising my implementation, hmmm. time is always an issue ...

Comment by Li Xi (Inactive) [ 03/May/16 ]

I agree that we need to first fully understand why this patch brings performance improvement, i.e. why the original single thread write performance is not as good as it can be. Hiroya, do you have any profiling results or detailed comparison that we can look at?

Comment by Jinshan Xiong (Inactive) [ 09/May/16 ]

Parallel I/O is definitely helpful for performance. However, we need to perform this in a different way. From my point of view, it should be done stripes by stripes because if we did that in single stripe, it would only contribute contention to single OSC. That being said, the current I/O architecture should be changed as follows:
1. I/O initialization;
2. in I/O iter_init(), it should split the I/O to multiple sub tasks at LOV layer, and sub I/Os should be scheduled to work thread by stripes;
3. at the end of I/O, the master thread should be waiting for all sub I/O to finish or error out;
4. error handling - to decide what value should be returned to application
5. special I/Os like append should go to original I/O path

Comment by Dmitry Eremin (Inactive) [ 12/Jan/18 ]

Close this ticket because of partially similar approach landed from LU-8964.

Generated at Sat Feb 10 02:02:06 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.