[LU-6370] Read performance degrades with increasing read block size. - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: Lustre 2.8.0
Affects Version/s: Lustre 2.6.0, Lustre 2.7.0, Lustre 2.8.0
Labels:
- clio
Environment:
Clients running a Cray 2.5 which contains backport of CLIO changes for 2.6. Problem also observed with vanilla 2.6 and 2.7 clients,

Epic/Theme:
- Performance
Severity:
3
Rank (Obsolete):
9223372036854775807

Description

We're finding substantial read performance degradations with increasing read block sizes. This has been observed in eslogin nodes as well as in internal login nodes. Data provided below was gathered on an external node.

ext7:/lustre # dd if=/dev/urandom of=3gurandomdata bs=4M count=$((256*3))
ext7:/lustre # for i in 4K 1M 4M 16M 32M 64M 128M 256M 512M 1G 2G 3G ; do echo -en "$i\t" ; dd if=3gurandomdata bs=${i} of=/dev/null 2>&1 | egrep copied ; done

4K 3221225472 bytes (3.2 GB) copied, 13.9569 s, 231 MB/s
1M 3221225472 bytes (3.2 GB) copied, 4.94163 s, 652 MB/s
4M 3221225472 bytes (3.2 GB) copied, 6.24378 s, 516 MB/s
16M 3221225472 bytes (3.2 GB) copied, 5.24595 s, 614 MB/s
32M 3221225472 bytes (3.2 GB) copied, 5.48208 s, 588 MB/s
64M 3221225472 bytes (3.2 GB) copied, 5.36964 s, 600 MB/s
128M 3221225472 bytes (3.2 GB) copied, 5.12867 s, 628 MB/s
256M 3221225472 bytes (3.2 GB) copied, 5.1467 s, 626 MB/s
512M 3221225472 bytes (3.2 GB) copied, 5.31232 s, 606 MB/s
1G 3221225472 bytes (3.2 GB) copied, 12.4088 s, 260 MB/s
2G 3221225472 bytes (3.2 GB) copied, 339.646 s, 9.5 MB/s
3G 3221225472 bytes (3.2 GB) copied, 350.071 s, 9.2 MB/s

This shows up on 1008 striped file system but on smaller systems the impact is not nearly so substantial. On our 56 OST system we get
3G 3221225472 bytes (3.2 GB) copied, 4.77246 s, 675 MB/s

Another test case was used with C code rather than dd that provided similar results based on an fread call

int read_size = 256*1024*1024*2;
fread(buffer, sizeof(float), read_size, fp_in);

Also, file striping information on production and tds filesystems:
ext8:/lustre # lfs getstripe 3gurandomdata
3gurandomdata
lmm_stripe_count: 4
lmm_stripe_size: 1048576
lmm_pattern: 1
lmm_layout_gen: 0
lmm_stripe_offset: 833
obdidx objid objid group
833 5978755 0x5b3a83 0
834 5953949 0x5ad99d 0
835 5958818 0x5aeca2 0
836 5966400 0x5b0a40 0

ext8:/lustretds # lfs getstripe 3gurandomdata
3gurandomdata
lmm_stripe_count: 4
lmm_stripe_size: 1048576
lmm_pattern: 1
lmm_layout_gen: 0
lmm_stripe_offset: 51
obdidx objid objid group
51 1451231 0x1624df 0
52 1452258 0x1628e2 0
53 1450278 0x162126 0
54 1444772 0x160ba4 0

So this appears to only be happening on wide-stripe file systems. Here's the output from 'perf top' while a 'bad' dd is running:

8.74% [kernel] [k] _spin_lock - _spin_lock
22.23% osc_ap_completion
osc_extent_finish
brw_interpret
ptlrpc_check_set
ptlrpcd_check
ptlrpcd
kthread
child_rip
+ 13.76% cl_env_put
+ 12.37% cl_env_get
+ 7.10% vvp_write_complete
+ 6.51% kfree
+ 4.62% osc_teardown_async_page
+ 3.96% osc_page_delete
+ 3.89% osc_lru_add_batch
+ 2.69% kmem_cache_free
+ 2.23% osc_page_init
+ 1.71% sptlrpc_import_sec_ref
+ 1.64% osc_page_transfer_add
+ 1.57% osc_io_submit
+ 1.43% cfs_percpt_lock

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

LU6370_1GB_BS.lctldk.out.gz
0.2 kB
31/Mar/15 1:26 PM
LU6370_2GB_BS.lctldk.out.gz
0.2 kB
31/Mar/15 1:26 PM
LU6370_cpu_log_20150402.out.gz
20 kB
02/Apr/15 2:55 PM
LU6370_max_cached_mb_20150402.out.gz
521 kB
02/Apr/15 2:55 PM
lu-6370_perf_data.tgz
5 kB
18/Mar/15 3:37 PM
lu-6370-perf.tgz
4.03 MB
17/Mar/15 6:03 PM

Issue Links

mentioned in: Page No Confluence page found with the given URL.

Activity

[LU-6370] Read performance degrades with increasing read block size.

Jinshan Xiong (Inactive) added a comment - 01/Apr/15 6:56 PM

Hi Dustin,

Thanks for the debug information. From the 2G output, I can see some strange things were happening.

The reading process is 11235.

At the time 1427805640.741591, this process just finished a reading, and then it was preempted and rescheduled to processor 15.

From time 1427805640.741714 to 1427805644.468771, kiblnd_cq_completion(), which is running inside softirq context, took over the CPU for over 3 seconds.

00000080:00000001:13.0:1427805640.741591:0:11235:0:(rw.c:1140:ll_readpage()) Process leaving (rc=0 : 0 : 0)
00000800:00000200:15.2:1427805640.741714:0:11235:0:(o2iblnd_cb.c:3325:kiblnd_cq_completion()) conn[ffff88081ed6e000] (130)++
....
00000800:00000200:15.2:1427805643.543220:0:11235:0:(o2iblnd_cb.c:3325:kiblnd_cq_completion()) conn[ffff88081ed6e000] (130)++
00000800:00000200:15.2:1427805643.543271:0:11235:0:(o2iblnd_cb.c:3325:kiblnd_cq_completion()) conn[ffff88081ed6e000] (131)++
00000800:00000200:15.2:1427805643.543458:0:11235:0:(o2iblnd_cb.c:3325:kiblnd_cq_completion()) conn[ffff88081ed6e000] (131)++
00000800:00000200:15.2:1427805643.543499:0:11235:0:(o2iblnd_cb.c:3325:kiblnd_cq_completion()) conn[ffff88081ed6e000] (131)++
00000800:00000200:15.2:1427805643.543588:0:11235:0:(o2iblnd_cb.c:3325:kiblnd_cq_completion()) conn[ffff88081ed6e000] (130)++
00000800:00000200:15.2:1427805644.468771:0:11235:0:(o2iblnd_cb.c:3325:kiblnd_cq_completion()) conn[ffff88081ed6ea00] (131)++
00000020:00000001:15.0:1427805644.468787:0:11235:0:(cl_io.c:930:cl_page_list_del()) Process leaving

This happened quite often, from time 1427805640 to 1427805644, and then from 1427805644 to 1427805647. Roughly speaking, the pattern was to run 1 second, and then stall for 3 seconds. I didn't see this was happening for 1G reading but I don't know why.

Have you ever run this kind of reading 2G block size job before on pre 2.6 client?

I'm still reading the log and trying to dig more things. Stay tuned.

Jinshan Xiong (Inactive) added a comment - 01/Apr/15 6:56 PM Hi Dustin, Thanks for the debug information. From the 2G output, I can see some strange things were happening. The reading process is 11235. At the time 1427805640.741591, this process just finished a reading, and then it was preempted and rescheduled to processor 15. From time 1427805640.741714 to 1427805644.468771, kiblnd_cq_completion(), which is running inside softirq context, took over the CPU for over 3 seconds. 00000080:00000001:13.0:1427805640.741591:0:11235:0:(rw.c:1140:ll_readpage()) Process leaving (rc=0 : 0 : 0) 00000800:00000200:15.2:1427805640.741714:0:11235:0:(o2iblnd_cb.c:3325:kiblnd_cq_completion()) conn[ffff88081ed6e000] (130)++ .... 00000800:00000200:15.2:1427805643.543220:0:11235:0:(o2iblnd_cb.c:3325:kiblnd_cq_completion()) conn[ffff88081ed6e000] (130)++ 00000800:00000200:15.2:1427805643.543271:0:11235:0:(o2iblnd_cb.c:3325:kiblnd_cq_completion()) conn[ffff88081ed6e000] (131)++ 00000800:00000200:15.2:1427805643.543458:0:11235:0:(o2iblnd_cb.c:3325:kiblnd_cq_completion()) conn[ffff88081ed6e000] (131)++ 00000800:00000200:15.2:1427805643.543499:0:11235:0:(o2iblnd_cb.c:3325:kiblnd_cq_completion()) conn[ffff88081ed6e000] (131)++ 00000800:00000200:15.2:1427805643.543588:0:11235:0:(o2iblnd_cb.c:3325:kiblnd_cq_completion()) conn[ffff88081ed6e000] (130)++ 00000800:00000200:15.2:1427805644.468771:0:11235:0:(o2iblnd_cb.c:3325:kiblnd_cq_completion()) conn[ffff88081ed6ea00] (131)++ 00000020:00000001:15.0:1427805644.468787:0:11235:0:(cl_io.c:930:cl_page_list_del()) Process leaving This happened quite often, from time 1427805640 to 1427805644, and then from 1427805644 to 1427805647. Roughly speaking, the pattern was to run 1 second, and then stall for 3 seconds. I didn't see this was happening for 1G reading but I don't know why. Have you ever run this kind of reading 2G block size job before on pre 2.6 client? I'm still reading the log and trying to dig more things. Stay tuned.

Dustin Leverman added a comment - 31/Mar/15 1:26 PM

I have gathered the requested debug data for LU6370.

I used "-1" for LNET debugging, and ran the dd test with 1G and 2G blocksizes on a lustre-2.7 client for 180 seconds.

Each output file is roughly 30MB, compressed.

Thanks,
Dustin

Dustin Leverman added a comment - 31/Mar/15 1:26 PM I have gathered the requested debug data for LU6370. I used "-1" for LNET debugging, and ran the dd test with 1G and 2G blocksizes on a lustre-2.7 client for 180 seconds. Each output file is roughly 30MB, compressed. Thanks, Dustin

Jian Yu added a comment - 27/Mar/15 4:22 PM

Thank you John and Dustin.

Jian Yu added a comment - 27/Mar/15 4:22 PM Thank you John and Dustin.

John Lewis (Inactive) added a comment - 27/Mar/15 1:45 PM

My hosts won't be quiet and available until tomorrow, I'm coordinating with Dustin to hopefully get these uploaded today.

John Lewis (Inactive) added a comment - 27/Mar/15 1:45 PM My hosts won't be quiet and available until tomorrow, I'm coordinating with Dustin to hopefully get these uploaded today.

Jinshan Xiong (Inactive) added a comment - 26/Mar/15 4:43 PM

In that case, I just need the log after it has been running for a few minutes.

Jinshan Xiong (Inactive) added a comment - 26/Mar/15 4:43 PM In that case, I just need the log after it has been running for a few minutes.

John Lewis (Inactive) added a comment - 26/Mar/15 11:13 AM

Jinshan, what tracing options would you like to see, and for how many seconds? With "-1", I expect a bs=2G count=1 to take approximately 1 hour and to overflow any rational buffer many times over.

John Lewis (Inactive) added a comment - 26/Mar/15 11:13 AM Jinshan, what tracing options would you like to see, and for how many seconds? With "-1", I expect a bs=2G count=1 to take approximately 1 hour and to overflow any rational buffer many times over.

Jian Yu added a comment - 26/Mar/15 6:38 AM

Can you read just 1 block and collect Lustre log with block size as 1G and 2G specifically?

Hi John,
With stripe count of 4 on an 1008 OST filesystem, read performance degraded substantially when read block size was increased from 1G to 2G. You’ve reproduced the problem with only 1 block, and collected the performance data with 1G and 2G block size separately. Could you please reproduce the issue again and gather Lustre debug logs as per Jinshan’s suggestion above so as to help him do further investigation? Thank you!

Jian Yu added a comment - 26/Mar/15 6:38 AM Can you read just 1 block and collect Lustre log with block size as 1G and 2G specifically? Hi John, With stripe count of 4 on an 1008 OST filesystem, read performance degraded substantially when read block size was increased from 1G to 2G. You’ve reproduced the problem with only 1 block, and collected the performance data with 1G and 2G block size separately. Could you please reproduce the issue again and gather Lustre debug logs as per Jinshan’s suggestion above so as to help him do further investigation? Thank you!

Jinshan Xiong (Inactive) added a comment - 20/Mar/15 12:12 AM

Can you read just 1 block and collect Lustre log with block size as 1G and 2G specifically?

Jinshan Xiong (Inactive) added a comment - 20/Mar/15 12:12 AM Can you read just 1 block and collect Lustre log with block size as 1G and 2G specifically?

John Lewis (Inactive) added a comment - 19/Mar/15 2:50 PM - edited

We've found some interesting results while running permutations on file parameters. While the problem is very repeatable in a given configuration, it's not at all stable as we change parameters. Specifically, changing the stripe count from 4 to

{1,16,56}

alleviates the huge performance degradation.

The following table lists time (in seconds, rounded to nearest second) to copy 3 gb of data from lustre to dev null using dd. Column headers indicate the block size used, row headers indicate the stripe count of the file. Tests were performed on two separate filesystems, a dedicated 56 OST TDS filesystem as well as a shared (not quiet) 1008 OST filesystem. Timings across multiple runs vary, but are stable to within approximately 10%.

1008 OST fs
                bs=512M bs=1G   bs=2G
Count 1         15      16      23
Count 4         5       12      384
Count 16        4       2       2
Count 56        4       2       2

56 OST fs
                bs=512M bs=1G   bs=2G
Count 1         2       2       3
Count 4         2       2       2
Count 16        2       2       2
Count 56        2       2       4

Results above were with stripe widths of 1M. 4M widths were also tested, but have not been listed as they tracked the 1M results. This testing was performed on a whitebox IB lustre client using a 2.5 Cray client. Tests with other client builds are planned.

John Lewis (Inactive) added a comment - 19/Mar/15 2:50 PM - edited We've found some interesting results while running permutations on file parameters. While the problem is very repeatable in a given configuration, it's not at all stable as we change parameters. Specifically, changing the stripe count from 4 to {1,16,56} alleviates the huge performance degradation. The following table lists time (in seconds, rounded to nearest second) to copy 3 gb of data from lustre to dev null using dd. Column headers indicate the block size used, row headers indicate the stripe count of the file. Tests were performed on two separate filesystems, a dedicated 56 OST TDS filesystem as well as a shared (not quiet) 1008 OST filesystem. Timings across multiple runs vary, but are stable to within approximately 10%. 1008 OST fs bs=512M bs=1G bs=2G Count 1 15 16 23 Count 4 5 12 384 Count 16 4 2 2 Count 56 4 2 2 56 OST fs bs=512M bs=1G bs=2G Count 1 2 2 3 Count 4 2 2 2 Count 16 2 2 2 Count 56 2 2 4 Results above were with stripe widths of 1M. 4M widths were also tested, but have not been listed as they tracked the 1M results. This testing was performed on a whitebox IB lustre client using a 2.5 Cray client. Tests with other client builds are planned.

John Lewis (Inactive) added a comment - 18/Mar/15 3:37 PM

Jinshan, please see attached reports. count=1 was used for both of these files.

John Lewis (Inactive) added a comment - 18/Mar/15 3:37 PM Jinshan, please see attached reports. count=1 was used for both of these files.

Jinshan Xiong (Inactive) added a comment - 17/Mar/15 6:14 PM

I can't parse the results on my node:

[root@mercury ~]# perf report -v -i perf.1G 
legacy perf.data format
magic/endian check failed
incompatible file format (rerun with -v to learn more)[root@mercury ~]#

Can you do it on your side and post the result here?

Obviously there are much more perf samples collected for 2G block size. Please try to read only 1 block with block size 1G and 2G specifically and see what we can get. If we can reproduce the problem with only 1 block, then please turn on debug and run it again, then post the lustre log here. Thanks,

Jinshan Xiong (Inactive) added a comment - 17/Mar/15 6:14 PM I can't parse the results on my node: [root@mercury ~]# perf report -v -i perf.1G legacy perf.data format magic/endian check failed incompatible file format (rerun with -v to learn more)[root@mercury ~]# Can you do it on your side and post the result here? Obviously there are much more perf samples collected for 2G block size. Please try to read only 1 block with block size 1G and 2G specifically and see what we can get. If we can reproduce the problem with only 1 block, then please turn on debug and run it again, then post the lustre log here. Thanks,

People

Assignee:: Jinshan Xiong (Inactive)

Reporter:: James A Simmons

Votes:: 0 Vote for this issue

Watchers:: 15 Start watching this issue

Dates

Created:: 16/Mar/15 8:45 PM

Updated:: 12/Aug/15 10:07 PM

Resolved:: 26/May/15 11:32 PM