[LU-8964] use parallel I/O to improve performance on machines with slow single thread performance - Whamcloud Community JIRA

Details

Type: New Feature
Resolution: Duplicate
Priority: Major
Fix Version/s: None
Affects Version/s: None
Labels:
None

Rank (Obsolete):
9223372036854775807

Description

On machines with slow single thread performance like KNL the bottleneck of I/O performance moved into code which just copy memory from one buffer to other (from user space to kernel or vice versa). In current Lustre implementation all I/O performs in single thread and this is become an issue for KNL. Significantly improve performance can be with solution which do parallel memory transfer of large buffers.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

40ost_rpc_stats.txt
91 kB
01/Dec/17 5:37 AM
read_readahead_test.c
3 kB
06/Feb/17 3:53 PM

Issue Links

is duplicated by

LU-1056 Single-client, single-thread and single-file is limited at 1.5GB/s

Resolved

is related to

LU-11069 ifort lseek returns wrong position on lustre 2.10.3

Resolved

LU-10367 FIO Fails to run with libaio

Resolved

LU-9618 Connect readahead to prep_partial_page to improve small (< 1 page) write performance

Resolved

LU-11825 Remove LU-8964/pio feature & supporting framework

Resolved

LU-12043 improve Lustre single thread read performances

Resolved

is related to

LU-6658 single stream write performance improvement with worker threads in llite

Resolved

LU-8709 parallel asynchronous readahead

Resolved

(1 is related to, 2 is related to )

Activity

[LU-8964] use parallel I/O to improve performance on machines with slow single thread performance

Cory Spitz added a comment - 22/Oct/19 7:11 PM

simmonsja, for the record, you mean ~~LU-12043~~. ~~LU-12403~~ is "add e2fsprog support for RHEL-8".

Cory Spitz added a comment - 22/Oct/19 7:11 PM simmonsja , for the record, you mean LU-12043 . LU-12403 is "add e2fsprog support for RHEL-8".

James A Simmons added a comment - 07/Mar/19 11:39 PM

~~LU-12403~~ will do this work correctly.

James A Simmons added a comment - 07/Mar/19 11:39 PM LU-12403 will do this work correctly.

James A Simmons added a comment - 28/Nov/18 2:40 PM

Thanks Patrick for the heads up on ktask. I will be watching it closely and give it a spin under this ticket.

James A Simmons added a comment - 28/Nov/18 2:40 PM Thanks Patrick for the heads up on ktask. I will be watching it closely and give it a spin under this ticket.

Dmitry Eremin (Inactive) added a comment - 26/Apr/18 11:26 AM

Thanks for slides. I will loop at them carefully. But for now I disagree that padata API have a big overhead. It's mostly negligible comparing with other overhead to pass work into different thread. But having many threads will leads a sheduler delay to switch under heavy loads. So, I think padata will work more stable and predictable in this case.

Dmitry Eremin (Inactive) added a comment - 26/Apr/18 11:26 AM Thanks for slides. I will loop at them carefully. But for now I disagree that padata API have a big overhead. It's mostly negligible comparing with other overhead to pass work into different thread. But having many threads will leads a sheduler delay to switch under heavy loads. So, I think padata will work more stable and predictable in this case.

Patrick Farrell (Inactive) added a comment - 25/Apr/18 2:51 PM

Also, apologies for not posting these last year.

Patrick Farrell (Inactive) added a comment - 25/Apr/18 2:51 PM Also, apologies for not posting these last year.

Patrick Farrell (Inactive) added a comment - 25/Apr/18 2:50 PM

https://www.eofs.eu/_media/events/devsummit17/patrick_farrell_laddevsummit_pio.pdf

This is old and out of date, but I wanted to make sure these slides were seen. I think the performance of the readahead code would probably be helped a lot by changes to the parallelization framework (as would the performance of pio itself).

So slides 8, 9, and 10 would probably be of particular interest here. There are significant performance improvements available for PIO just by going from padata to something simpler. Also, the CPU binding behavior of padata is pretty bad - Binding explicitly to one CPU is problematic. Padata seems to assume the whole machine is dedicated, which is not a friendly assumption. (I discovered its CPU binding behavior because I saw performance problems - A particular CPU would be busy and the work assigned to that CPU would be delayed, which delays the completion of the whole i/o. At this time, other CPUs were idle, and not binding to a specific CPU would have allowed one of them to be used.)

Patrick Farrell (Inactive) added a comment - 25/Apr/18 2:50 PM https://www.eofs.eu/_media/events/devsummit17/patrick_farrell_laddevsummit_pio.pdf This is old and out of date, but I wanted to make sure these slides were seen. I think the performance of the readahead code would probably be helped a lot by changes to the parallelization framework (as would the performance of pio itself). So slides 8, 9, and 10 would probably be of particular interest here. There are significant performance improvements available for PIO just by going from padata to something simpler. Also, the CPU binding behavior of padata is pretty bad - Binding explicitly to one CPU is problematic. Padata seems to assume the whole machine is dedicated, which is not a friendly assumption. (I discovered its CPU binding behavior because I saw performance problems - A particular CPU would be busy and the work assigned to that CPU would be delayed, which delays the completion of the whole i/o. At this time, other CPUs were idle, and not binding to a specific CPU would have allowed one of them to be used.)

Patrick Farrell (Inactive) added a comment - 14/Dec/17 5:32 PM

Great

Patrick Farrell (Inactive) added a comment - 14/Dec/17 5:32 PM Great

Dmitry Eremin (Inactive) added a comment - 14/Dec/17 5:15 PM

The last version of patch don't have an issue with RPC splitting. For reading in my VM machine I have the following:

with PIO disabled:

                        read                    write
pages per rpc         rpcs   % cum % |       rpcs   % cum %
1:                       3   4   4   |          0   0   0
2:                       0   0   4   |          0   0   0
4:                       0   0   4   |          0   0   0
8:                       0   0   4   |          0   0   0
16:                      0   0   4   |          0   0   0
32:                      0   0   4   |          0   0   0
64:                      0   0   4   |          0   0   0
128:                     0   0   4   |          0   0   0
256:                     0   0   4   |          0   0   0
512:                     1   1   6   |          0   0   0
1024:                   62  93 100   |          0   0   0

with PIO enabled:

                        read                    write
pages per rpc         rpcs   % cum % |       rpcs   % cum %
1:                       2   2   2   |          0   0   0
2:                       0   0   2   |          0   0   0
4:                       0   0   2   |          0   0   0
8:                       0   0   2   |          0   0   0
16:                      0   0   2   |          0   0   0
32:                      0   0   2   |          0   0   0
64:                      0   0   2   |          0   0   0
128:                     0   0   2   |          0   0   0
256:                     1   1   4   |          0   0   0
512:                     4   5  10   |          0   0   0
1024:                   61  89 100   |          0   0   0

Dmitry Eremin (Inactive) added a comment - 14/Dec/17 5:15 PM The last version of patch don't have an issue with RPC splitting. For reading in my VM machine I have the following: with PIO disabled: read write pages per rpc rpcs % cum % | rpcs % cum % 1: 3 4 4 | 0 0 0 2: 0 0 4 | 0 0 0 4: 0 0 4 | 0 0 0 8: 0 0 4 | 0 0 0 16: 0 0 4 | 0 0 0 32: 0 0 4 | 0 0 0 64: 0 0 4 | 0 0 0 128: 0 0 4 | 0 0 0 256: 0 0 4 | 0 0 0 512: 1 1 6 | 0 0 0 1024: 62 93 100 | 0 0 0 with PIO enabled: read write pages per rpc rpcs % cum % | rpcs % cum % 1: 2 2 2 | 0 0 0 2: 0 0 2 | 0 0 0 4: 0 0 2 | 0 0 0 8: 0 0 2 | 0 0 0 16: 0 0 2 | 0 0 0 32: 0 0 2 | 0 0 0 64: 0 0 2 | 0 0 0 128: 0 0 2 | 0 0 0 256: 1 1 4 | 0 0 0 512: 4 5 10 | 0 0 0 1024: 61 89 100 | 0 0 0

Dmitry Eremin (Inactive) added a comment - 08/Dec/17 9:58 AM

This is regreassion of last version of my patch when I turned off async read ahead if PIO flag is not enabled. I'm going to fix this. But anyway I cannot avoid several 1K requests. I think we should not fix this because of this leed low latency in requests. This happens when user request miss page and async read ahead is initiated but minwhile it request a single page only and then unlock user application. In the next loop of reading next page will be available from async read ahead.

Dmitry Eremin (Inactive) added a comment - 08/Dec/17 9:58 AM This is regreassion of last version of my patch when I turned off async read ahead if PIO flag is not enabled. I'm going to fix this. But anyway I cannot avoid several 1K requests. I think we should not fix this because of this leed low latency in requests. This happens when user request miss page and async read ahead is initiated but minwhile it request a single page only and then unlock user application. In the next loop of reading next page will be available from async read ahead.

Patrick Farrell (Inactive) added a comment - 07/Dec/17 9:17 PM - edited

Looking at when these smaller reads happen, they're clustered in the middle of the job, to the point where there are no 1024 page reads for a while (I used the D_INODE debug in osc_build_rpc for this).

This is for the 4000 MiB case.

This is the first RPC, which is naturally enough 1 page:
00000008:00000002:0.0F:1512684358.524801:0:5975:0:(osc_request.c:2073:osc_build_rpc()) @@@ 1 pages, aa ffff880034753170. now 1r/0w in flight req@ffff8

That's followed by a large # of 1024 page RPCs, though with 1 page RPCs mixed in every so often, which seems weird. It looks like there is no point at which we hit a steady state of only 1024 page RPCs.

Then, here's the first of the set of only 1 page RPCs:
00000008:00000002:2.0:1512684366.335774:0:5973:0:(osc_request.c:2073:osc_build_rpc()) @@@ 1 pages, aa ffff88023602c170. now 1r/0w in flight req@ffff88

There are then NO 1024 page RPCs for some thousands of RPCs. Weirdly, at the end it seems to recover and we do some 1024 page RPCs again. Here's the first of those:
00000008:00000002:0.0:1512684370.524090:0:5974:0:(osc_request.c:2073:osc_build_rpc()) @@@ 1024 pages, aa ffff8800a506c770. now 1r/0w in flight req@fff

And here's the last RPC period:
00000008:00000002:1.0:1512684371.129494:0:5972:0:(osc_request.c:2073:osc_build_rpc()) @@@ 1022 pages, aa ffff88021f8c1c70. now 2r/0w in flight req@fff

So spend 8 second sending mostly large RPCs, then 4 seconds sending only 4 KiB RPCs, then another ~1 second sending large RPCs again.

That means that, as you said, we're only sending a few % of the data in those RPCs - about 3% in this case.
But they're taking about 30% of the total time, and it's all in one big lump.

Something's wrong.

Patrick Farrell (Inactive) added a comment - 07/Dec/17 9:17 PM - edited Looking at when these smaller reads happen, they're clustered in the middle of the job, to the point where there are no 1024 page reads for a while (I used the D_INODE debug in osc_build_rpc for this). This is for the 4000 MiB case. This is the first RPC, which is naturally enough 1 page: 00000008:00000002:0.0F:1512684358.524801:0:5975:0:(osc_request.c:2073:osc_build_rpc()) @@@ 1 pages, aa ffff880034753170. now 1r/0w in flight req@ffff8 That's followed by a large # of 1024 page RPCs, though with 1 page RPCs mixed in every so often, which seems weird. It looks like there is no point at which we hit a steady state of only 1024 page RPCs. Then, here's the first of the set of only 1 page RPCs: 00000008:00000002:2.0:1512684366.335774:0:5973:0:(osc_request.c:2073:osc_build_rpc()) @@@ 1 pages, aa ffff88023602c170. now 1r/0w in flight req@ffff88 There are then NO 1024 page RPCs for some thousands of RPCs. Weirdly, at the end it seems to recover and we do some 1024 page RPCs again. Here's the first of those: 00000008:00000002:0.0:1512684370.524090:0:5974:0:(osc_request.c:2073:osc_build_rpc()) @@@ 1024 pages, aa ffff8800a506c770. now 1r/0w in flight req@fff And here's the last RPC period: 00000008:00000002:1.0:1512684371.129494:0:5972:0:(osc_request.c:2073:osc_build_rpc()) @@@ 1022 pages, aa ffff88021f8c1c70. now 2r/0w in flight req@fff So spend 8 second sending mostly large RPCs, then 4 seconds sending only 4 KiB RPCs, then another ~1 second sending large RPCs again. That means that, as you said, we're only sending a few % of the data in those RPCs - about 3% in this case. But they're taking about 30% of the total time, and it's all in one big lump. Something's wrong.

People

Assignee:: James A Simmons

Reporter:: Dmitry Eremin (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 26 Start watching this issue

Dates

Created:: 21/Dec/16 3:02 PM

Updated:: 05/Nov/19 5:41 AM

Resolved:: 07/Mar/19 11:39 PM