[LU-6370] Read performance degrades with increasing read block size. - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: Lustre 2.8.0
Affects Version/s: Lustre 2.6.0, Lustre 2.7.0, Lustre 2.8.0
Labels:
- clio
Environment:
Clients running a Cray 2.5 which contains backport of CLIO changes for 2.6. Problem also observed with vanilla 2.6 and 2.7 clients,

Epic/Theme:
- Performance
Severity:
3
Rank (Obsolete):
9223372036854775807

Description

We're finding substantial read performance degradations with increasing read block sizes. This has been observed in eslogin nodes as well as in internal login nodes. Data provided below was gathered on an external node.

ext7:/lustre # dd if=/dev/urandom of=3gurandomdata bs=4M count=$((256*3))
ext7:/lustre # for i in 4K 1M 4M 16M 32M 64M 128M 256M 512M 1G 2G 3G ; do echo -en "$i\t" ; dd if=3gurandomdata bs=${i} of=/dev/null 2>&1 | egrep copied ; done

4K 3221225472 bytes (3.2 GB) copied, 13.9569 s, 231 MB/s
1M 3221225472 bytes (3.2 GB) copied, 4.94163 s, 652 MB/s
4M 3221225472 bytes (3.2 GB) copied, 6.24378 s, 516 MB/s
16M 3221225472 bytes (3.2 GB) copied, 5.24595 s, 614 MB/s
32M 3221225472 bytes (3.2 GB) copied, 5.48208 s, 588 MB/s
64M 3221225472 bytes (3.2 GB) copied, 5.36964 s, 600 MB/s
128M 3221225472 bytes (3.2 GB) copied, 5.12867 s, 628 MB/s
256M 3221225472 bytes (3.2 GB) copied, 5.1467 s, 626 MB/s
512M 3221225472 bytes (3.2 GB) copied, 5.31232 s, 606 MB/s
1G 3221225472 bytes (3.2 GB) copied, 12.4088 s, 260 MB/s
2G 3221225472 bytes (3.2 GB) copied, 339.646 s, 9.5 MB/s
3G 3221225472 bytes (3.2 GB) copied, 350.071 s, 9.2 MB/s

This shows up on 1008 striped file system but on smaller systems the impact is not nearly so substantial. On our 56 OST system we get
3G 3221225472 bytes (3.2 GB) copied, 4.77246 s, 675 MB/s

Another test case was used with C code rather than dd that provided similar results based on an fread call

int read_size = 256*1024*1024*2;
fread(buffer, sizeof(float), read_size, fp_in);

Also, file striping information on production and tds filesystems:
ext8:/lustre # lfs getstripe 3gurandomdata
3gurandomdata
lmm_stripe_count: 4
lmm_stripe_size: 1048576
lmm_pattern: 1
lmm_layout_gen: 0
lmm_stripe_offset: 833
obdidx objid objid group
833 5978755 0x5b3a83 0
834 5953949 0x5ad99d 0
835 5958818 0x5aeca2 0
836 5966400 0x5b0a40 0

ext8:/lustretds # lfs getstripe 3gurandomdata
3gurandomdata
lmm_stripe_count: 4
lmm_stripe_size: 1048576
lmm_pattern: 1
lmm_layout_gen: 0
lmm_stripe_offset: 51
obdidx objid objid group
51 1451231 0x1624df 0
52 1452258 0x1628e2 0
53 1450278 0x162126 0
54 1444772 0x160ba4 0

So this appears to only be happening on wide-stripe file systems. Here's the output from 'perf top' while a 'bad' dd is running:

8.74% [kernel] [k] _spin_lock - _spin_lock
22.23% osc_ap_completion
osc_extent_finish
brw_interpret
ptlrpc_check_set
ptlrpcd_check
ptlrpcd
kthread
child_rip
+ 13.76% cl_env_put
+ 12.37% cl_env_get
+ 7.10% vvp_write_complete
+ 6.51% kfree
+ 4.62% osc_teardown_async_page
+ 3.96% osc_page_delete
+ 3.89% osc_lru_add_batch
+ 2.69% kmem_cache_free
+ 2.23% osc_page_init
+ 1.71% sptlrpc_import_sec_ref
+ 1.64% osc_page_transfer_add
+ 1.57% osc_io_submit
+ 1.43% cfs_percpt_lock

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

LU6370_1GB_BS.lctldk.out.gz
0.2 kB
31/Mar/15 1:26 PM
LU6370_2GB_BS.lctldk.out.gz
0.2 kB
31/Mar/15 1:26 PM
LU6370_cpu_log_20150402.out.gz
20 kB
02/Apr/15 2:55 PM
LU6370_max_cached_mb_20150402.out.gz
521 kB
02/Apr/15 2:55 PM
lu-6370_perf_data.tgz
5 kB
18/Mar/15 3:37 PM
lu-6370-perf.tgz
4.03 MB
17/Mar/15 6:03 PM

Issue Links

mentioned in: Page Loading...

Activity

[LU-6370] Read performance degrades with increasing read block size.

Matt Ezell added a comment - 03/Apr/15 1:19 AM

Jinshan - good news that you understand the root cause of the issue.

Is readahead not bounded by:

# lctl get_param llite.*.max_read_ahead_mb
llite.atlas1-ffff8817e6a85000.max_read_ahead_mb=40
llite.atlas2-ffff880fef58fc00.max_read_ahead_mb=40
llite.linkfarm-ffff8827ef28c000.max_read_ahead_mb=40

Is there a different way that we can tune readahead?

Anyway, we would appreciate a "hot fix" against master. We can probably trivially backport it to b2_7 for our non-Cray clients. James Simmons and Cray can figure out how easy it will be to backport to Cray's 2.5 branch.

Thanks!

Matt Ezell added a comment - 03/Apr/15 1:19 AM Jinshan - good news that you understand the root cause of the issue. Is readahead not bounded by: # lctl get_param llite.*.max_read_ahead_mb llite.atlas1-ffff8817e6a85000.max_read_ahead_mb=40 llite.atlas2-ffff880fef58fc00.max_read_ahead_mb=40 llite.linkfarm-ffff8827ef28c000.max_read_ahead_mb=40 Is there a different way that we can tune readahead? Anyway, we would appreciate a "hot fix" against master. We can probably trivially backport it to b2_7 for our non-Cray clients. James Simmons and Cray can figure out how easy it will be to backport to Cray's 2.5 branch. Thanks!

Jinshan Xiong (Inactive) added a comment - 03/Apr/15 12:06 AM

I've known the root cause of this issue - and at last I can reproduce it by adding some tricks in the code. The problem boils down to the inconsistency of LRU management and read ahead algorithm. The end result is that read ahead brings a lot of pages into the memory but LRU drops them due to tightness of per-OSC LRU budget.

It will take huge effort to make a general policy to fit every I/O cases. However, I can make a hot fix to solve the problem you're experiencing. If that's okay for you, I can make a patch for master.

Jinshan Xiong (Inactive) added a comment - 03/Apr/15 12:06 AM I've known the root cause of this issue - and at last I can reproduce it by adding some tricks in the code. The problem boils down to the inconsistency of LRU management and read ahead algorithm. The end result is that read ahead brings a lot of pages into the memory but LRU drops them due to tightness of per-OSC LRU budget. It will take huge effort to make a general policy to fit every I/O cases. However, I can make a hot fix to solve the problem you're experiencing. If that's okay for you, I can make a patch for master.

Patrick Farrell (Inactive) added a comment - 02/Apr/15 6:45 PM

Jinshan -

Since the problem is in 2.6/2.7, Cray should be able to handle back porting a patch done against a newer version. In the end, our client is our responsibility. If it does prove problematic and ORNL would like to ask for your help in getting that fix on Cray's 2.5 (and you feel that's covered by ORNL's arrangements with Intel), I assume we'd be able to provide the code base (and would be happy to have the help if it's needed).

Patrick

Patrick Farrell (Inactive) added a comment - 02/Apr/15 6:45 PM Jinshan - Since the problem is in 2.6/2.7, Cray should be able to handle back porting a patch done against a newer version. In the end, our client is our responsibility. If it does prove problematic and ORNL would like to ask for your help in getting that fix on Cray's 2.5 (and you feel that's covered by ORNL's arrangements with Intel), I assume we'd be able to provide the code base (and would be happy to have the help if it's needed). Patrick

Jinshan Xiong (Inactive) added a comment - 02/Apr/15 6:12 PM

Sooner or later I will need the code base for a fix.

Dustin, can you please help me do one more thing - while the read is going on, run `echo t > /proc/sysrq-trigger; dmesg > dmesg-$(date +%s).log' on console so that it will generate stack traces for all processes running on the system right now. You may do it a couple of times with 1 second interval to make sure it will collect sufficient log. Once this is done, please send me all dmesg-xyz.log.

Thanks,

Jinshan Xiong (Inactive) added a comment - 02/Apr/15 6:12 PM Sooner or later I will need the code base for a fix. Dustin, can you please help me do one more thing - while the read is going on, run `echo t > /proc/sysrq-trigger; dmesg > dmesg-$(date +%s).log' on console so that it will generate stack traces for all processes running on the system right now. You may do it a couple of times with 1 second interval to make sure it will collect sufficient log. Once this is done, please send me all dmesg-xyz.log. Thanks,

Dustin Leverman added a comment - 02/Apr/15 6:05 PM

Patrick: I did not have a lustre-2.6 client laying around and had to build on and test it. The answer to your question though is that the problem is reproducible in lustre-2.6.

Thanks,
Dustin

Dustin Leverman added a comment - 02/Apr/15 6:05 PM Patrick: I did not have a lustre-2.6 client laying around and had to build on and test it. The answer to your question though is that the problem is reproducible in lustre-2.6. Thanks, Dustin

Patrick Farrell (Inactive) added a comment - 02/Apr/15 5:37 PM - edited

We can certainly do that (the GPL requires we do it ), but I can't personally without permission from my boss (who's not available for a few days). As our customer, ORNL has the legal right to have the source, but I think Cray would want some sort of formal request...? If the ORNL staff feel it's important, I can look in to getting permission.

So, at least, I can't do it right away. So for this bug, it's likely going to be easier to proceed with examining it from a 2.7 perspective.

Matt/Dustin - Are you able to confirm this happens in 2.6? (Sorry if you already have and I missed it) If it doesn't happen in 2.6 (but does in Cray's 2.5 and Intel's 2.7), then ~~LU-3321~~ is unlikely to be the culprit.

Patrick Farrell (Inactive) added a comment - 02/Apr/15 5:37 PM - edited We can certainly do that (the GPL requires we do it ), but I can't personally without permission from my boss (who's not available for a few days). As our customer, ORNL has the legal right to have the source, but I think Cray would want some sort of formal request...? If the ORNL staff feel it's important, I can look in to getting permission. So, at least, I can't do it right away. So for this bug, it's likely going to be easier to proceed with examining it from a 2.7 perspective. Matt/Dustin - Are you able to confirm this happens in 2.6? (Sorry if you already have and I missed it) If it doesn't happen in 2.6 (but does in Cray's 2.5 and Intel's 2.7), then LU-3321 is unlikely to be the culprit.

Matt Ezell added a comment - 02/Apr/15 5:18 PM

> You mentioned that you've back ported ~~LU-3321~~, is it possible for me to get access to the code?

Cray (Patrick?) would need to provide this. We can either track this down with Cray's client that includes ~~LU-3321~~, or with b2_7/master that also includes this patch (plus many more things).

Matt Ezell added a comment - 02/Apr/15 5:18 PM > You mentioned that you've back ported LU-3321 , is it possible for me to get access to the code? Cray (Patrick?) would need to provide this. We can either track this down with Cray's client that includes LU-3321 , or with b2_7/master that also includes this patch (plus many more things).

Jinshan Xiong (Inactive) added a comment - 02/Apr/15 5:16 PM

Hi Dustin,

Thanks for the debug info. You mentioned that you've back ported ~~LU-3321~~, is it possible for me to get access to the code?

Jinshan Xiong (Inactive) added a comment - 02/Apr/15 5:16 PM Hi Dustin, Thanks for the debug info. You mentioned that you've back ported LU-3321 , is it possible for me to get access to the code?

Dustin Leverman added a comment - 02/Apr/15 2:48 PM

Jinshan,
I have gathered the CPU data during the run using sar. The output is kinda messy, but it is all there. I also gathered the mac_cached_mb data you requested.

Please let me know if you need additional info.

Thanks,
Dustin

Dustin Leverman added a comment - 02/Apr/15 2:48 PM Jinshan, I have gathered the CPU data during the run using sar. The output is kinda messy, but it is all there. I also gathered the mac_cached_mb data you requested. Please let me know if you need additional info. Thanks, Dustin

Patrick Farrell (Inactive) added a comment - 02/Apr/15 1:25 PM

Sorry if I've missed this in the shuffle, but have you tested 2.6, Dustin?

The latest Cray clients on your mainframe have the performance improvements from ~~LU-3321~~ back ported in to their version of 2.5, so if you see it not in Intel's 2.5, but in Cray's 2.5 and Intel's 2.6 and 2.7, that would suggest a place to look.

Patrick Farrell (Inactive) added a comment - 02/Apr/15 1:25 PM Sorry if I've missed this in the shuffle, but have you tested 2.6, Dustin? The latest Cray clients on your mainframe have the performance improvements from LU-3321 back ported in to their version of 2.5, so if you see it not in Intel's 2.5, but in Cray's 2.5 and Intel's 2.6 and 2.7, that would suggest a place to look.

Dustin Leverman added a comment - 02/Apr/15 12:51 PM

I do not see the same issue with a block size of 4G on the 2.5.3 client (takes ~6 seconds), but the problem still exists on the 2.7 client (let it run for about 30min then killed it).

I am working on gathering the requested CPU and LRU info. I will have it for you soon.

-Dustin

Dustin Leverman added a comment - 02/Apr/15 12:51 PM I do not see the same issue with a block size of 4G on the 2.5.3 client (takes ~6 seconds), but the problem still exists on the 2.7 client (let it run for about 30min then killed it). I am working on gathering the requested CPU and LRU info. I will have it for you soon. -Dustin

People

Assignee:: Jinshan Xiong (Inactive)

Reporter:: James A Simmons

Votes:: 0 Vote for this issue

Watchers:: 15 Start watching this issue

Dates

Created:: 16/Mar/15 8:45 PM

Updated:: 12/Aug/15 10:07 PM

Resolved:: 26/May/15 11:32 PM