[LU-6370] Read performance degrades with increasing read block size. - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: Lustre 2.8.0
Affects Version/s: Lustre 2.6.0, Lustre 2.7.0, Lustre 2.8.0
Labels:
- clio
Environment:
Clients running a Cray 2.5 which contains backport of CLIO changes for 2.6. Problem also observed with vanilla 2.6 and 2.7 clients,

Epic/Theme:
- Performance
Severity:
3
Rank (Obsolete):
9223372036854775807

Description

We're finding substantial read performance degradations with increasing read block sizes. This has been observed in eslogin nodes as well as in internal login nodes. Data provided below was gathered on an external node.

ext7:/lustre # dd if=/dev/urandom of=3gurandomdata bs=4M count=$((256*3))
ext7:/lustre # for i in 4K 1M 4M 16M 32M 64M 128M 256M 512M 1G 2G 3G ; do echo -en "$i\t" ; dd if=3gurandomdata bs=${i} of=/dev/null 2>&1 | egrep copied ; done

4K 3221225472 bytes (3.2 GB) copied, 13.9569 s, 231 MB/s
1M 3221225472 bytes (3.2 GB) copied, 4.94163 s, 652 MB/s
4M 3221225472 bytes (3.2 GB) copied, 6.24378 s, 516 MB/s
16M 3221225472 bytes (3.2 GB) copied, 5.24595 s, 614 MB/s
32M 3221225472 bytes (3.2 GB) copied, 5.48208 s, 588 MB/s
64M 3221225472 bytes (3.2 GB) copied, 5.36964 s, 600 MB/s
128M 3221225472 bytes (3.2 GB) copied, 5.12867 s, 628 MB/s
256M 3221225472 bytes (3.2 GB) copied, 5.1467 s, 626 MB/s
512M 3221225472 bytes (3.2 GB) copied, 5.31232 s, 606 MB/s
1G 3221225472 bytes (3.2 GB) copied, 12.4088 s, 260 MB/s
2G 3221225472 bytes (3.2 GB) copied, 339.646 s, 9.5 MB/s
3G 3221225472 bytes (3.2 GB) copied, 350.071 s, 9.2 MB/s

This shows up on 1008 striped file system but on smaller systems the impact is not nearly so substantial. On our 56 OST system we get
3G 3221225472 bytes (3.2 GB) copied, 4.77246 s, 675 MB/s

Another test case was used with C code rather than dd that provided similar results based on an fread call

int read_size = 256*1024*1024*2;
fread(buffer, sizeof(float), read_size, fp_in);

Also, file striping information on production and tds filesystems:
ext8:/lustre # lfs getstripe 3gurandomdata
3gurandomdata
lmm_stripe_count: 4
lmm_stripe_size: 1048576
lmm_pattern: 1
lmm_layout_gen: 0
lmm_stripe_offset: 833
obdidx objid objid group
833 5978755 0x5b3a83 0
834 5953949 0x5ad99d 0
835 5958818 0x5aeca2 0
836 5966400 0x5b0a40 0

ext8:/lustretds # lfs getstripe 3gurandomdata
3gurandomdata
lmm_stripe_count: 4
lmm_stripe_size: 1048576
lmm_pattern: 1
lmm_layout_gen: 0
lmm_stripe_offset: 51
obdidx objid objid group
51 1451231 0x1624df 0
52 1452258 0x1628e2 0
53 1450278 0x162126 0
54 1444772 0x160ba4 0

So this appears to only be happening on wide-stripe file systems. Here's the output from 'perf top' while a 'bad' dd is running:

8.74% [kernel] [k] _spin_lock - _spin_lock
22.23% osc_ap_completion
osc_extent_finish
brw_interpret
ptlrpc_check_set
ptlrpcd_check
ptlrpcd
kthread
child_rip
+ 13.76% cl_env_put
+ 12.37% cl_env_get
+ 7.10% vvp_write_complete
+ 6.51% kfree
+ 4.62% osc_teardown_async_page
+ 3.96% osc_page_delete
+ 3.89% osc_lru_add_batch
+ 2.69% kmem_cache_free
+ 2.23% osc_page_init
+ 1.71% sptlrpc_import_sec_ref
+ 1.64% osc_page_transfer_add
+ 1.57% osc_io_submit
+ 1.43% cfs_percpt_lock

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

lu-6370-perf.tgz
4.03 MB
17/Mar/15 6:03 PM
lu-6370_perf_data.tgz
5 kB
18/Mar/15 3:37 PM
LU6370_1GB_BS.lctldk.out.gz
0.2 kB
31/Mar/15 1:26 PM
LU6370_2GB_BS.lctldk.out.gz
0.2 kB
31/Mar/15 1:26 PM
LU6370_cpu_log_20150402.out.gz
20 kB
02/Apr/15 2:55 PM
LU6370_max_cached_mb_20150402.out.gz
521 kB
02/Apr/15 2:55 PM

Issue Links

mentioned in: Page No Confluence page found with the given URL.

Activity

[LU-6370] Read performance degrades with increasing read block size.

Peter Jones added a comment - 26/May/15 11:32 PM

Landed for 2.8

Peter Jones added a comment - 26/May/15 11:32 PM Landed for 2.8

Gerrit Updater added a comment - 09/Apr/15 3:23 AM

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/14347/
Subject: ~~LU-6370~~ osc: disable to control per-OSC LRU budget
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 642dd7e50b4ff39e91c5fd0a771a26c59b5b6637

Gerrit Updater added a comment - 09/Apr/15 3:23 AM Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/14347/ Subject: LU-6370 osc: disable to control per-OSC LRU budget Project: fs/lustre-release Branch: master Current Patch Set: Commit: 642dd7e50b4ff39e91c5fd0a771a26c59b5b6637

Dustin Leverman added a comment - 08/Apr/15 8:00 PM

I was able to test this on our TDS this afternoon and it appears to have fixed the problem. Our next step is to test this on a Cray client. We will send out the results.

Below is the output from my run:

[root@atlas-tds-mds2 leverman]# dd if=10GB.out of=10GB.out.test bs=2G count=1
0+1 records in
0+1 records out
2147479552 bytes (2.1 GB) copied, 6.37421 s, 337 MB/s
[root@atlas-tds-mds2 leverman]# rpm -qa| grep lustre
lustre-client-2.7.0-2.6.322.6.322.6.322.6.32_431.17.1.el6.wc.x86_64.x86_64
lustre-client-modules-2.7.0-2.6.322.6.322.6.322.6.32_431.17.1.el6.wc.x86_64.x86_64

Dustin Leverman added a comment - 08/Apr/15 8:00 PM I was able to test this on our TDS this afternoon and it appears to have fixed the problem. Our next step is to test this on a Cray client. We will send out the results. Below is the output from my run: [root@atlas-tds-mds2 leverman] # dd if=10GB.out of=10GB.out.test bs=2G count=1 0+1 records in 0+1 records out 2147479552 bytes (2.1 GB) copied, 6.37421 s, 337 MB/s [root@atlas-tds-mds2 leverman] # rpm -qa| grep lustre lustre-client-2.7.0-2.6.322.6.322.6.322.6.32_431.17.1.el6.wc.x86_64.x86_64 lustre-client-modules-2.7.0-2.6.322.6.322.6.322.6.32_431.17.1.el6.wc.x86_64.x86_64

Patrick Farrell (Inactive) added a comment - 03/Apr/15 7:33 PM

Jinshan - Ah, OK. Makes sense. I guess doing a % would only help a little in certain edge cases.

Matt - Cray can make sure the patch passes sanity testing before sending it your way, but it's tricky for us to verify it solves the problem with a large OST/OSC count.

It would be very helpful if ORNL can verify it clears up the performance problem on your non-Cray clients before we proceed with the patch - And it would be more convincing than attempting to replicate it here, since we don't have physical hardware on the scale of Titan available for testing. We can use various tricks to get OST count up on limited hardware, but it's not nearly the same.

Patrick Farrell (Inactive) added a comment - 03/Apr/15 7:33 PM Jinshan - Ah, OK. Makes sense. I guess doing a % would only help a little in certain edge cases. Matt - Cray can make sure the patch passes sanity testing before sending it your way, but it's tricky for us to verify it solves the problem with a large OST/OSC count. It would be very helpful if ORNL can verify it clears up the performance problem on your non-Cray clients before we proceed with the patch - And it would be more convincing than attempting to replicate it here, since we don't have physical hardware on the scale of Titan available for testing. We can use various tricks to get OST count up on limited hardware, but it's not nearly the same.

Jinshan Xiong (Inactive) added a comment - 03/Apr/15 6:20 PM

the ultimate solution would be a self adaptive policy that an OSC can use as many LRU slots as it wants, if there is no competition from other OSCs. However, once other OSCs are starting consuming LRU slots, the over budget OSC should release slots with a faster speed to maintain fairness.

There is no difference to use percentage of slots an OSC can use in maximum - from my point of view.

Jinshan Xiong (Inactive) added a comment - 03/Apr/15 6:20 PM the ultimate solution would be a self adaptive policy that an OSC can use as many LRU slots as it wants, if there is no competition from other OSCs. However, once other OSCs are starting consuming LRU slots, the over budget OSC should release slots with a faster speed to maintain fairness. There is no difference to use percentage of slots an OSC can use in maximum - from my point of view.

Patrick Farrell (Inactive) added a comment - 03/Apr/15 4:08 PM

Matt -

The code in question is effectively identical in Cray's 2.5, so porting the patch is trivial.

Jinshan -

I've been thinking about this and had a suggestion.

First, I'll explain my understanding in case I've missed something.
The current code first checks to see if we're running out of LRU slots. If so, decides to free pages (more or less of them depending on whether or not the OSC is over budget).

Also, separately, it checks to see if the OSC in question is at more than 2*budget, and if so, decides to free pages.

The problem is that these large reads are overflowing that 2*budget limit for a particular OSC, so your patch comments out that limit, which would allow a particular OSC to consume any amount of cache, as long as LRU slots are available.

The reason this is a particular issue for ORNL is the per OSC cache budget is calculated by taking the total budget and dividing by the number of OSCs. Since ORNL has a very large number of OSTs, this means the budget for each OSC could be quite small.

In general, freeing down to max_cache/[number_of_OSCs] when low on LRU pages seems correct, but we'd like to let a particular OSC use a larger portion of cache if it's available, but... Probably not ALL of it.

And without your patch, the limit on that larger OSC is 2*budget. How about instead making it a % of total cache? That would cover the case when budget is small due to the number of OSCs without letting a single OSC totally dominate the cache (which is the downside to your quick fix). It could perhaps be made a tunable - "max_single_osc_cache_percent" or similar?

Patrick Farrell (Inactive) added a comment - 03/Apr/15 4:08 PM Matt - The code in question is effectively identical in Cray's 2.5, so porting the patch is trivial. Jinshan - I've been thinking about this and had a suggestion. First, I'll explain my understanding in case I've missed something. The current code first checks to see if we're running out of LRU slots. If so, decides to free pages (more or less of them depending on whether or not the OSC is over budget). Also, separately, it checks to see if the OSC in question is at more than 2*budget, and if so, decides to free pages. The problem is that these large reads are overflowing that 2*budget limit for a particular OSC, so your patch comments out that limit, which would allow a particular OSC to consume any amount of cache, as long as LRU slots are available. The reason this is a particular issue for ORNL is the per OSC cache budget is calculated by taking the total budget and dividing by the number of OSCs. Since ORNL has a very large number of OSTs, this means the budget for each OSC could be quite small. In general, freeing down to max_cache/ [number_of_OSCs] when low on LRU pages seems correct, but we'd like to let a particular OSC use a larger portion of cache if it's available, but... Probably not ALL of it. And without your patch, the limit on that larger OSC is 2*budget. How about instead making it a % of total cache? That would cover the case when budget is small due to the number of OSCs without letting a single OSC totally dominate the cache (which is the downside to your quick fix). It could perhaps be made a tunable - "max_single_osc_cache_percent" or similar?

Gerrit Updater added a comment - 03/Apr/15 5:57 AM

Jinshan Xiong (jinshan.xiong@intel.com) uploaded a new patch: http://review.whamcloud.com/14347
Subject: ~~LU-6370~~ osc: disable to control per-OSC LRU budget
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 19ac637dc7256ba64544df5ab3d2c176c364a27e

Gerrit Updater added a comment - 03/Apr/15 5:57 AM Jinshan Xiong (jinshan.xiong@intel.com) uploaded a new patch: http://review.whamcloud.com/14347 Subject: LU-6370 osc: disable to control per-OSC LRU budget Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 19ac637dc7256ba64544df5ab3d2c176c364a27e

Matt Ezell added a comment - 03/Apr/15 1:19 AM

Jinshan - good news that you understand the root cause of the issue.

Is readahead not bounded by:

# lctl get_param llite.*.max_read_ahead_mb
llite.atlas1-ffff8817e6a85000.max_read_ahead_mb=40
llite.atlas2-ffff880fef58fc00.max_read_ahead_mb=40
llite.linkfarm-ffff8827ef28c000.max_read_ahead_mb=40

Is there a different way that we can tune readahead?

Anyway, we would appreciate a "hot fix" against master. We can probably trivially backport it to b2_7 for our non-Cray clients. James Simmons and Cray can figure out how easy it will be to backport to Cray's 2.5 branch.

Thanks!

Matt Ezell added a comment - 03/Apr/15 1:19 AM Jinshan - good news that you understand the root cause of the issue. Is readahead not bounded by: # lctl get_param llite.*.max_read_ahead_mb llite.atlas1-ffff8817e6a85000.max_read_ahead_mb=40 llite.atlas2-ffff880fef58fc00.max_read_ahead_mb=40 llite.linkfarm-ffff8827ef28c000.max_read_ahead_mb=40 Is there a different way that we can tune readahead? Anyway, we would appreciate a "hot fix" against master. We can probably trivially backport it to b2_7 for our non-Cray clients. James Simmons and Cray can figure out how easy it will be to backport to Cray's 2.5 branch. Thanks!

Jinshan Xiong (Inactive) added a comment - 03/Apr/15 12:06 AM

I've known the root cause of this issue - and at last I can reproduce it by adding some tricks in the code. The problem boils down to the inconsistency of LRU management and read ahead algorithm. The end result is that read ahead brings a lot of pages into the memory but LRU drops them due to tightness of per-OSC LRU budget.

It will take huge effort to make a general policy to fit every I/O cases. However, I can make a hot fix to solve the problem you're experiencing. If that's okay for you, I can make a patch for master.

Jinshan Xiong (Inactive) added a comment - 03/Apr/15 12:06 AM I've known the root cause of this issue - and at last I can reproduce it by adding some tricks in the code. The problem boils down to the inconsistency of LRU management and read ahead algorithm. The end result is that read ahead brings a lot of pages into the memory but LRU drops them due to tightness of per-OSC LRU budget. It will take huge effort to make a general policy to fit every I/O cases. However, I can make a hot fix to solve the problem you're experiencing. If that's okay for you, I can make a patch for master.

Patrick Farrell (Inactive) added a comment - 02/Apr/15 6:45 PM

Jinshan -

Since the problem is in 2.6/2.7, Cray should be able to handle back porting a patch done against a newer version. In the end, our client is our responsibility. If it does prove problematic and ORNL would like to ask for your help in getting that fix on Cray's 2.5 (and you feel that's covered by ORNL's arrangements with Intel), I assume we'd be able to provide the code base (and would be happy to have the help if it's needed).

Patrick

Patrick Farrell (Inactive) added a comment - 02/Apr/15 6:45 PM Jinshan - Since the problem is in 2.6/2.7, Cray should be able to handle back porting a patch done against a newer version. In the end, our client is our responsibility. If it does prove problematic and ORNL would like to ask for your help in getting that fix on Cray's 2.5 (and you feel that's covered by ORNL's arrangements with Intel), I assume we'd be able to provide the code base (and would be happy to have the help if it's needed). Patrick

People

Assignee:: Jinshan Xiong (Inactive)

Reporter:: James A Simmons

Votes:: 0 Vote for this issue

Watchers:: 15 Start watching this issue

Dates

Created:: 16/Mar/15 8:45 PM

Updated:: 12/Aug/15 10:07 PM

Resolved:: 26/May/15 11:32 PM