[LU-6370] Read performance degrades with increasing read block size. - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: Lustre 2.8.0
Affects Version/s: Lustre 2.6.0, Lustre 2.7.0, Lustre 2.8.0
Labels:
- clio
Environment:
Clients running a Cray 2.5 which contains backport of CLIO changes for 2.6. Problem also observed with vanilla 2.6 and 2.7 clients,

Epic/Theme:
- Performance
Severity:
3
Rank (Obsolete):
9223372036854775807

Description

We're finding substantial read performance degradations with increasing read block sizes. This has been observed in eslogin nodes as well as in internal login nodes. Data provided below was gathered on an external node.

ext7:/lustre # dd if=/dev/urandom of=3gurandomdata bs=4M count=$((256*3))
ext7:/lustre # for i in 4K 1M 4M 16M 32M 64M 128M 256M 512M 1G 2G 3G ; do echo -en "$i\t" ; dd if=3gurandomdata bs=${i} of=/dev/null 2>&1 | egrep copied ; done

4K 3221225472 bytes (3.2 GB) copied, 13.9569 s, 231 MB/s
1M 3221225472 bytes (3.2 GB) copied, 4.94163 s, 652 MB/s
4M 3221225472 bytes (3.2 GB) copied, 6.24378 s, 516 MB/s
16M 3221225472 bytes (3.2 GB) copied, 5.24595 s, 614 MB/s
32M 3221225472 bytes (3.2 GB) copied, 5.48208 s, 588 MB/s
64M 3221225472 bytes (3.2 GB) copied, 5.36964 s, 600 MB/s
128M 3221225472 bytes (3.2 GB) copied, 5.12867 s, 628 MB/s
256M 3221225472 bytes (3.2 GB) copied, 5.1467 s, 626 MB/s
512M 3221225472 bytes (3.2 GB) copied, 5.31232 s, 606 MB/s
1G 3221225472 bytes (3.2 GB) copied, 12.4088 s, 260 MB/s
2G 3221225472 bytes (3.2 GB) copied, 339.646 s, 9.5 MB/s
3G 3221225472 bytes (3.2 GB) copied, 350.071 s, 9.2 MB/s

This shows up on 1008 striped file system but on smaller systems the impact is not nearly so substantial. On our 56 OST system we get
3G 3221225472 bytes (3.2 GB) copied, 4.77246 s, 675 MB/s

Another test case was used with C code rather than dd that provided similar results based on an fread call

int read_size = 256*1024*1024*2;
fread(buffer, sizeof(float), read_size, fp_in);

Also, file striping information on production and tds filesystems:
ext8:/lustre # lfs getstripe 3gurandomdata
3gurandomdata
lmm_stripe_count: 4
lmm_stripe_size: 1048576
lmm_pattern: 1
lmm_layout_gen: 0
lmm_stripe_offset: 833
obdidx objid objid group
833 5978755 0x5b3a83 0
834 5953949 0x5ad99d 0
835 5958818 0x5aeca2 0
836 5966400 0x5b0a40 0

ext8:/lustretds # lfs getstripe 3gurandomdata
3gurandomdata
lmm_stripe_count: 4
lmm_stripe_size: 1048576
lmm_pattern: 1
lmm_layout_gen: 0
lmm_stripe_offset: 51
obdidx objid objid group
51 1451231 0x1624df 0
52 1452258 0x1628e2 0
53 1450278 0x162126 0
54 1444772 0x160ba4 0

So this appears to only be happening on wide-stripe file systems. Here's the output from 'perf top' while a 'bad' dd is running:

8.74% [kernel] [k] _spin_lock - _spin_lock
22.23% osc_ap_completion
osc_extent_finish
brw_interpret
ptlrpc_check_set
ptlrpcd_check
ptlrpcd
kthread
child_rip
+ 13.76% cl_env_put
+ 12.37% cl_env_get
+ 7.10% vvp_write_complete
+ 6.51% kfree
+ 4.62% osc_teardown_async_page
+ 3.96% osc_page_delete
+ 3.89% osc_lru_add_batch
+ 2.69% kmem_cache_free
+ 2.23% osc_page_init
+ 1.71% sptlrpc_import_sec_ref
+ 1.64% osc_page_transfer_add
+ 1.57% osc_io_submit
+ 1.43% cfs_percpt_lock

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

LU6370_1GB_BS.lctldk.out.gz
0.2 kB
31/Mar/15 1:26 PM
LU6370_2GB_BS.lctldk.out.gz
0.2 kB
31/Mar/15 1:26 PM
LU6370_cpu_log_20150402.out.gz
20 kB
02/Apr/15 2:55 PM
LU6370_max_cached_mb_20150402.out.gz
521 kB
02/Apr/15 2:55 PM
lu-6370_perf_data.tgz
5 kB
18/Mar/15 3:37 PM
lu-6370-perf.tgz
4.03 MB
17/Mar/15 6:03 PM

Issue Links

mentioned in: Page Loading...

Activity

[LU-6370] Read performance degrades with increasing read block size.

Jinshan Xiong (Inactive) added a comment - 26/Mar/15 4:43 PM

In that case, I just need the log after it has been running for a few minutes.

Jinshan Xiong (Inactive) added a comment - 26/Mar/15 4:43 PM In that case, I just need the log after it has been running for a few minutes.

John Lewis (Inactive) added a comment - 26/Mar/15 11:13 AM

Jinshan, what tracing options would you like to see, and for how many seconds? With "-1", I expect a bs=2G count=1 to take approximately 1 hour and to overflow any rational buffer many times over.

John Lewis (Inactive) added a comment - 26/Mar/15 11:13 AM Jinshan, what tracing options would you like to see, and for how many seconds? With "-1", I expect a bs=2G count=1 to take approximately 1 hour and to overflow any rational buffer many times over.

Jian Yu added a comment - 26/Mar/15 6:38 AM

Can you read just 1 block and collect Lustre log with block size as 1G and 2G specifically?

Hi John,
With stripe count of 4 on an 1008 OST filesystem, read performance degraded substantially when read block size was increased from 1G to 2G. You’ve reproduced the problem with only 1 block, and collected the performance data with 1G and 2G block size separately. Could you please reproduce the issue again and gather Lustre debug logs as per Jinshan’s suggestion above so as to help him do further investigation? Thank you!

Jian Yu added a comment - 26/Mar/15 6:38 AM Can you read just 1 block and collect Lustre log with block size as 1G and 2G specifically? Hi John, With stripe count of 4 on an 1008 OST filesystem, read performance degraded substantially when read block size was increased from 1G to 2G. You’ve reproduced the problem with only 1 block, and collected the performance data with 1G and 2G block size separately. Could you please reproduce the issue again and gather Lustre debug logs as per Jinshan’s suggestion above so as to help him do further investigation? Thank you!

Jinshan Xiong (Inactive) added a comment - 20/Mar/15 12:12 AM

Can you read just 1 block and collect Lustre log with block size as 1G and 2G specifically?

Jinshan Xiong (Inactive) added a comment - 20/Mar/15 12:12 AM Can you read just 1 block and collect Lustre log with block size as 1G and 2G specifically?

John Lewis (Inactive) added a comment - 19/Mar/15 2:50 PM - edited

We've found some interesting results while running permutations on file parameters. While the problem is very repeatable in a given configuration, it's not at all stable as we change parameters. Specifically, changing the stripe count from 4 to

{1,16,56}

alleviates the huge performance degradation.

The following table lists time (in seconds, rounded to nearest second) to copy 3 gb of data from lustre to dev null using dd. Column headers indicate the block size used, row headers indicate the stripe count of the file. Tests were performed on two separate filesystems, a dedicated 56 OST TDS filesystem as well as a shared (not quiet) 1008 OST filesystem. Timings across multiple runs vary, but are stable to within approximately 10%.

1008 OST fs
                bs=512M bs=1G   bs=2G
Count 1         15      16      23
Count 4         5       12      384
Count 16        4       2       2
Count 56        4       2       2

56 OST fs
                bs=512M bs=1G   bs=2G
Count 1         2       2       3
Count 4         2       2       2
Count 16        2       2       2
Count 56        2       2       4

Results above were with stripe widths of 1M. 4M widths were also tested, but have not been listed as they tracked the 1M results. This testing was performed on a whitebox IB lustre client using a 2.5 Cray client. Tests with other client builds are planned.

John Lewis (Inactive) added a comment - 19/Mar/15 2:50 PM - edited We've found some interesting results while running permutations on file parameters. While the problem is very repeatable in a given configuration, it's not at all stable as we change parameters. Specifically, changing the stripe count from 4 to {1,16,56} alleviates the huge performance degradation. The following table lists time (in seconds, rounded to nearest second) to copy 3 gb of data from lustre to dev null using dd. Column headers indicate the block size used, row headers indicate the stripe count of the file. Tests were performed on two separate filesystems, a dedicated 56 OST TDS filesystem as well as a shared (not quiet) 1008 OST filesystem. Timings across multiple runs vary, but are stable to within approximately 10%. 1008 OST fs bs=512M bs=1G bs=2G Count 1 15 16 23 Count 4 5 12 384 Count 16 4 2 2 Count 56 4 2 2 56 OST fs bs=512M bs=1G bs=2G Count 1 2 2 3 Count 4 2 2 2 Count 16 2 2 2 Count 56 2 2 4 Results above were with stripe widths of 1M. 4M widths were also tested, but have not been listed as they tracked the 1M results. This testing was performed on a whitebox IB lustre client using a 2.5 Cray client. Tests with other client builds are planned.

John Lewis (Inactive) added a comment - 18/Mar/15 3:37 PM

Jinshan, please see attached reports. count=1 was used for both of these files.

John Lewis (Inactive) added a comment - 18/Mar/15 3:37 PM Jinshan, please see attached reports. count=1 was used for both of these files.

Jinshan Xiong (Inactive) added a comment - 17/Mar/15 6:14 PM

I can't parse the results on my node:

[root@mercury ~]# perf report -v -i perf.1G 
legacy perf.data format
magic/endian check failed
incompatible file format (rerun with -v to learn more)[root@mercury ~]#

Can you do it on your side and post the result here?

Obviously there are much more perf samples collected for 2G block size. Please try to read only 1 block with block size 1G and 2G specifically and see what we can get. If we can reproduce the problem with only 1 block, then please turn on debug and run it again, then post the lustre log here. Thanks,

Jinshan Xiong (Inactive) added a comment - 17/Mar/15 6:14 PM I can't parse the results on my node: [root@mercury ~]# perf report -v -i perf.1G legacy perf.data format magic/endian check failed incompatible file format (rerun with -v to learn more)[root@mercury ~]# Can you do it on your side and post the result here? Obviously there are much more perf samples collected for 2G block size. Please try to read only 1 block with block size 1G and 2G specifically and see what we can get. If we can reproduce the problem with only 1 block, then please turn on debug and run it again, then post the lustre log here. Thanks,

John Lewis (Inactive) added a comment - 17/Mar/15 6:03 PM

perf results from bs=

{1G,2G}

testing

ext8:/lustre # (perf record dd if=3gurandomdata of=/dev/null bs=1G ; rm perf.data) >dev/null 2>&1
ext8:/lustre # perf record dd if=3gurandomdata of=/dev/null bs=1G ; mv perf.data perf.1G
3+0 records in
3+0 records out
3221225472 bytes (3.2 GB) copied, 10.8859 s, 296 MB/s
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.423 MB perf.data (~18500 samples) ]
titan-ext8:/lustre/atlas/scratch/lewisj/ven004/read_test_dd # ls -lrt
total 3145745
~~rw-r~~r- 1 root root 3221225472 Mar 16 14:21 3gurandomdata
rw------ 1 root root 445452 Mar 17 13:05 perf.1G
ext8:/lustre # perf record dd if=3gurandomdata of=/dev/null bs=2G ; mv perf.data perf.2G
0+2 records in
0+2 records out
3221225472 bytes (3.2 GB) copied, 416.124 s, 7.7 MB/s
[ perf record: Woken up 61 times to write data ]
[ perf record: Captured and wrote 15.869 MB perf.data (~693322 samples) ]

John Lewis (Inactive) added a comment - 17/Mar/15 6:03 PM perf results from bs= {1G,2G} testing ext8:/lustre # (perf record dd if=3gurandomdata of=/dev/null bs=1G ; rm perf.data) >dev/null 2>&1 ext8:/lustre # perf record dd if=3gurandomdata of=/dev/null bs=1G ; mv perf.data perf.1G 3+0 records in 3+0 records out 3221225472 bytes (3.2 GB) copied, 10.8859 s, 296 MB/s [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.423 MB perf.data (~18500 samples) ] titan-ext8:/lustre/atlas/scratch/lewisj/ven004/read_test_dd # ls -lrt total 3145745 rw-r r - 1 root root 3221225472 Mar 16 14:21 3gurandomdata rw ------ 1 root root 445452 Mar 17 13:05 perf.1G ext8:/lustre # perf record dd if=3gurandomdata of=/dev/null bs=2G ; mv perf.data perf.2G 0+2 records in 0+2 records out 3221225472 bytes (3.2 GB) copied, 416.124 s, 7.7 MB/s [ perf record: Woken up 61 times to write data ] [ perf record: Captured and wrote 15.869 MB perf.data (~693322 samples) ]

Patrick Farrell (Inactive) added a comment - 17/Mar/15 4:31 PM

Jinshan - They're referring to ~~LU-3321~~. I wouldn't call it CLIO work either.

Patrick Farrell (Inactive) added a comment - 17/Mar/15 4:31 PM Jinshan - They're referring to LU-3321 . I wouldn't call it CLIO work either.

Jinshan Xiong (Inactive) added a comment - 17/Mar/15 4:21 PM

From the problem can be seen on vanilla 2.6, then it's not related to CLIO simplification work. Maybe some changes between 2.5 and 2.6 caused this problem.

From the test results, there is huge performance degradation when the read size is increasing from 1G to 2G. Will you run those two tests again and use perf to collect performance data?

Jinshan Xiong (Inactive) added a comment - 17/Mar/15 4:21 PM From the problem can be seen on vanilla 2.6, then it's not related to CLIO simplification work. Maybe some changes between 2.5 and 2.6 caused this problem. From the test results, there is huge performance degradation when the read size is increasing from 1G to 2G. Will you run those two tests again and use perf to collect performance data?

Patrick Farrell (Inactive) added a comment - 17/Mar/15 3:28 PM

This definitely seems to be file system size dependent. I've tested various Intel client versions and an almost identical Cray client in a VM environment and while I (sometimes) saw a slight drop off at higher read sizes, it's on the order of 5-20%; nothing like that reported by ORNL.

I've found similar results on our in house hardware. Not surprisingly, we don't have anything in house that approaches the size of the file systems connected to Titan.

I'd be very curious to see actual tests results from a 2.x client version that does not have this issue so we can confirm this as a regression from a particular point. I've been told that's the case, but I have yet to see data.

Patrick Farrell (Inactive) added a comment - 17/Mar/15 3:28 PM This definitely seems to be file system size dependent. I've tested various Intel client versions and an almost identical Cray client in a VM environment and while I (sometimes) saw a slight drop off at higher read sizes, it's on the order of 5-20%; nothing like that reported by ORNL. I've found similar results on our in house hardware. Not surprisingly, we don't have anything in house that approaches the size of the file systems connected to Titan. I'd be very curious to see actual tests results from a 2.x client version that does not have this issue so we can confirm this as a regression from a particular point. I've been told that's the case, but I have yet to see data.

People

Assignee:: Jinshan Xiong (Inactive)

Reporter:: James A Simmons

Votes:: 0 Vote for this issue

Watchers:: 15 Start watching this issue

Dates

Created:: 16/Mar/15 8:45 PM

Updated:: 12/Aug/15 10:07 PM

Resolved:: 26/May/15 11:32 PM