[LU-6658] single stream write performance improvement with worker threads in llite - Whamcloud Community JIRA

Details

Type: New Feature
Resolution: Fixed
Priority: Major
Fix Version/s: None
Affects Version/s: Lustre 2.7.0, Lustre 2.8.0
Labels:
- patch

Epic/Theme:
- Performance
Rank (Obsolete):
9223372036854775807

Description

This patch provides single stream write performance improvement with multiple worker threads in llite layer. its operation overvies is the following

In system call context
1) get a worker thread's lu_env
2) assemble and set parameters
2-1) copy user buffer to kernel buffer
2-2) copy parameters needed for worker thread to resume I/O
2-3) set parameters to the lu_env gotten in (1)
2-4) set extra parameters to an I/O item, iosvc_item
3) inform worker thread: ll_iosvc, we got ready.
4) return immediately

In worker thread context
1) wake up
2) gathering information
2-1) refer its own lu_env to know the parameters set by syscall
2-2) refer the item made in (2-3)
3) resume I/O
4) sleep

I attached the performance to compare the original Lustre-2.7.52 and this custom Lustre-2.7.52

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

iosvc.performance.iozone.xlsx
68 kB
28/May/15 8:03 AM
N-process one client.xlsx
12 kB
08/Jun/15 7:45 AM

Issue Links

is related to

LU-8964 use parallel I/O to improve performance on machines with slow single thread performance

Resolved

LU-1056 Single-client, single-thread and single-file is limited at 1.5GB/s

Resolved

Activity

[LU-6658] single stream write performance improvement with worker threads in llite

Jinshan Xiong (Inactive) added a comment - 08/Jun/15 6:46 PM

Hi Hiroya,

I understood you really well.

Today I have a chance to read your patch and I have a few questions to ask:
1. As you mentioned in your previous comment, the writer doesn't wait the async thread to finish the I/O, which means the I/O may fail but the application has no idea about it. This can potentially cause data corruption because otherwise the application can retry the write or just stop running;

2. The patch copies the user space buffer into a worker thread's kernel buffer, and then this kernel buffer will have to copied once again to inode's page cache. This is not good. And I think you can see significant performance degradation in the case of multi-thread write.

Yes, I think you've realized this is not parallel I/O intended to be. For parallel I/O implementation, I'd like to I/O to be split on the LOV layer. In the first phase, we can only support files with multiple stripes, and different threads will work on different stripes, from requesting lock to copying data. The writer thread must wait for all sub I/O to complete before it can return to application.

Jinshan Xiong (Inactive) added a comment - 08/Jun/15 6:46 PM Hi Hiroya, I understood you really well. Today I have a chance to read your patch and I have a few questions to ask: 1. As you mentioned in your previous comment, the writer doesn't wait the async thread to finish the I/O, which means the I/O may fail but the application has no idea about it. This can potentially cause data corruption because otherwise the application can retry the write or just stop running; 2. The patch copies the user space buffer into a worker thread's kernel buffer, and then this kernel buffer will have to copied once again to inode's page cache. This is not good. And I think you can see significant performance degradation in the case of multi-thread write. Yes, I think you've realized this is not parallel I/O intended to be. For parallel I/O implementation, I'd like to I/O to be split on the LOV layer. In the first phase, we can only support files with multiple stripes, and different threads will work on different stripes, from requesting lock to copying data. The writer thread must wait for all sub I/O to complete before it can return to application.

Hiroya Nozaki (Inactive) added a comment - 08/Jun/15 7:44 AM - edited

My understanding is that the benchmark results are for a single iozone thread running on one client?

That's right. This was done by single iozone thread running on one client.

Are there any benchmark results from many threads (e.g. one per core on a 16+ core client) to see what the overhead is from running the extra worker threads?

There is but it was measured for FEFS v2.0 based on Lustre-2.6.0. I will attach a simple graph though I doubt whether this is suitable to introduce here.

There is definitely a performance improvement seen during writes (about 50%) with the patch compared to the unpatched client, though the read performance appears to be a bit lower (maybe 5%?). Since there is already parallelism in IO submission at the ptlrpcd layer, it would be interesting to figure out where the performance improvement is coming from?

Strictly Speaking, this patch doesn't work as parallel I/O but ... like Async-Async I/O. Focusing on write latency, page cache cost and waiting request cost, under vvp layer cost, are the most in write system call. So I tried to cut it of write system call using worker threads, thought I have to add memcpy one more.

That's why the performance improvement is coming from ..

(non-patched write cost) - (under vvp layer cost) + (an extra memcpy cost)

And There was a good news to me. Lustre started to support range_lock then I didn't have to wait the previous write even if the target file is the same, of course, write have to wait if the extent is overlapped, though.

Consequently, this works like parallel I/O but I want to call this async-async I/O ...

Anyway, I'm really bad in English so I doubt the people who read my comment completely understand what I want to say. Please feel free to ask me anything. Thank you.

Hiroya Nozaki (Inactive) added a comment - 08/Jun/15 7:44 AM - edited My understanding is that the benchmark results are for a single iozone thread running on one client? That's right. This was done by single iozone thread running on one client. Are there any benchmark results from many threads (e.g. one per core on a 16+ core client) to see what the overhead is from running the extra worker threads? There is but it was measured for FEFS v2.0 based on Lustre-2.6.0. I will attach a simple graph though I doubt whether this is suitable to introduce here. There is definitely a performance improvement seen during writes (about 50%) with the patch compared to the unpatched client, though the read performance appears to be a bit lower (maybe 5%?). Since there is already parallelism in IO submission at the ptlrpcd layer, it would be interesting to figure out where the performance improvement is coming from? Strictly Speaking, this patch doesn't work as parallel I/O but ... like Async-Async I/O. Focusing on write latency, page cache cost and waiting request cost, under vvp layer cost, are the most in write system call. So I tried to cut it of write system call using worker threads, thought I have to add memcpy one more. That's why the performance improvement is coming from .. (non-patched write cost) - (under vvp layer cost) + (an extra memcpy cost) And There was a good news to me. Lustre started to support range_lock then I didn't have to wait the previous write even if the target file is the same, of course, write have to wait if the extent is overlapped, though. Consequently, this works like parallel I/O but I want to call this async-async I/O ... Anyway, I'm really bad in English so I doubt the people who read my comment completely understand what I want to say. Please feel free to ask me anything. Thank you.

Andreas Dilger added a comment - 05/Jun/15 9:18 PM

This looks like very interesting work. My understanding is that the benchmark results are for a single iozone thread running on one client? Are there any benchmark results from many threads (e.g. one per core on a 16+ core client) to see what the overhead is from running the extra worker threads?

There is definitely a performance improvement seen during writes (about 50%) with the patch compared to the unpatched client, though the read performance appears to be a bit lower (maybe 5%?). Since there is already parallelism in IO submission at the ptlrpcd layer, it would be interesting to figure out where the performance improvement is coming from?

Andreas Dilger added a comment - 05/Jun/15 9:18 PM This looks like very interesting work. My understanding is that the benchmark results are for a single iozone thread running on one client? Are there any benchmark results from many threads (e.g. one per core on a 16+ core client) to see what the overhead is from running the extra worker threads? There is definitely a performance improvement seen during writes (about 50%) with the patch compared to the unpatched client, though the read performance appears to be a bit lower (maybe 5%?). Since there is already parallelism in IO submission at the ptlrpcd layer, it would be interesting to figure out where the performance improvement is coming from?

Hiroya Nozaki (Inactive) added a comment - 29/May/15 1:06 AM

Hi, Jinshan.

> 1. Does the system call thread wait for the worker thread to complete?
No, the system call thread returns immediately after registering the I/O information with a worker thread.

> 2. In your performance benchmark sheet, what's the stripe count and size you used?
ah, yes, I forgot to write it down in the sheet.
it was the following

lfs getstripe /mnt/lustre
/mnt/lustre
stripe_count: -1 stripe_size: 4194304 stripe_offset: -1
/mnt/lustre/test
lmm_stripe_count: 2
lmm_stripe_size: 4194304
lmm_pattern: 1
lmm_layout_gen: 0
lmm_stripe_offset: 2
obdidx objid objid group
2 2 0x2 0
1 2 0x2 0

> 3. it looks like you have done parallel IO on the LLITE layer, how much effort would it be to move it to LOV layer
> so that worker thread can do IO on stripe boundary?
ah ... yeah, actually it looks working well in LOV layer too. hmm ... it seems to takes a couple of month until I complete . But, you know, I had to deal with tons of issues when implementing this feature which is why I finally landed this place so I'm kind of diffident about the estimation and the difficulty.

Hiroya Nozaki (Inactive) added a comment - 29/May/15 1:06 AM Hi, Jinshan. > 1. Does the system call thread wait for the worker thread to complete? No, the system call thread returns immediately after registering the I/O information with a worker thread. > 2. In your performance benchmark sheet, what's the stripe count and size you used? ah, yes, I forgot to write it down in the sheet. it was the following lfs getstripe /mnt/lustre /mnt/lustre stripe_count: -1 stripe_size: 4194304 stripe_offset: -1 /mnt/lustre/test lmm_stripe_count: 2 lmm_stripe_size: 4194304 lmm_pattern: 1 lmm_layout_gen: 0 lmm_stripe_offset: 2 obdidx objid objid group 2 2 0x2 0 1 2 0x2 0 > 3. it looks like you have done parallel IO on the LLITE layer, how much effort would it be to move it to LOV layer > so that worker thread can do IO on stripe boundary? ah ... yeah, actually it looks working well in LOV layer too. hmm ... it seems to takes a couple of month until I complete . But, you know, I had to deal with tons of issues when implementing this feature which is why I finally landed this place so I'm kind of diffident about the estimation and the difficulty.

Jinshan Xiong (Inactive) added a comment - 28/May/15 3:35 PM

I didn't read the patch yet. I will appreciate if you can answer my some quick questions:

1. Does the system call thread wait for the worker thread to complete?
2. In your performance benchmark sheet, what's the stripe count and size you used?
3. it looks like you have done parallel IO on the LLITE layer, how much effort would it be to move it to LOV layer so that worker thread can do IO on stripe boundary?

Jinshan Xiong (Inactive) added a comment - 28/May/15 3:35 PM I didn't read the patch yet. I will appreciate if you can answer my some quick questions: 1. Does the system call thread wait for the worker thread to complete? 2. In your performance benchmark sheet, what's the stripe count and size you used? 3. it looks like you have done parallel IO on the LLITE layer, how much effort would it be to move it to LOV layer so that worker thread can do IO on stripe boundary?

Jinshan Xiong (Inactive) added a comment - 28/May/15 3:23 PM

That's wonderful, you guys have done parallel IO for Lustre.

Jinshan Xiong (Inactive) added a comment - 28/May/15 3:23 PM That's wonderful, you guys have done parallel IO for Lustre.

Gerrit Updater added a comment - 28/May/15 8:06 AM

Hiroya Nozaki (nozaki.hiroya@jp.fujitsu.com) uploaded a new patch: http://review.whamcloud.com/14990
Subject: ~~LU-6658~~ llite: single stream write performance improvement
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 3e0da7583c369e213e795e22367cbbab9d33b7d7

Gerrit Updater added a comment - 28/May/15 8:06 AM Hiroya Nozaki (nozaki.hiroya@jp.fujitsu.com) uploaded a new patch: http://review.whamcloud.com/14990 Subject: LU-6658 llite: single stream write performance improvement Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 3e0da7583c369e213e795e22367cbbab9d33b7d7

People

Assignee:: Dmitry Eremin (Inactive)

Reporter:: Hiroya Nozaki (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 17 Start watching this issue

Dates

Created:: 28/May/15 8:03 AM

Updated:: 12/Jan/18 4:00 PM

Resolved:: 12/Jan/18 4:00 PM