[LU-6370] Read performance degrades with increasing read block size. Created: 16/Mar/15  Updated: 12/Aug/15  Resolved: 26/May/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.6.0, Lustre 2.7.0, Lustre 2.8.0
Fix Version/s: Lustre 2.8.0

Type: Bug Priority: Major
Reporter: James A Simmons Assignee: Jinshan Xiong (Inactive)
Resolution: Fixed Votes: 0
Labels: clio
Environment:

Clients running a Cray 2.5 which contains backport of CLIO changes for 2.6. Problem also observed with vanilla 2.6 and 2.7 clients,


Attachments: Text File LU6370_1GB_BS.lctldk.out.gz     Text File LU6370_2GB_BS.lctldk.out.gz     File LU6370_cpu_log_20150402.out.gz     File LU6370_max_cached_mb_20150402.out.gz     File lu-6370-perf.tgz     File lu-6370_perf_data.tgz    
Issue Links:
Related
Epic/Theme: Performance
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

We're finding substantial read performance degradations with increasing read block sizes. This has been observed in eslogin nodes as well as in internal login nodes. Data provided below was gathered on an external node.

ext7:/lustre # dd if=/dev/urandom of=3gurandomdata bs=4M count=$((256*3))
ext7:/lustre # for i in 4K 1M 4M 16M 32M 64M 128M 256M 512M 1G 2G 3G ; do echo -en "$i\t" ; dd if=3gurandomdata bs=${i} of=/dev/null 2>&1 | egrep copied ; done

4K 3221225472 bytes (3.2 GB) copied, 13.9569 s, 231 MB/s
1M 3221225472 bytes (3.2 GB) copied, 4.94163 s, 652 MB/s
4M 3221225472 bytes (3.2 GB) copied, 6.24378 s, 516 MB/s
16M 3221225472 bytes (3.2 GB) copied, 5.24595 s, 614 MB/s
32M 3221225472 bytes (3.2 GB) copied, 5.48208 s, 588 MB/s
64M 3221225472 bytes (3.2 GB) copied, 5.36964 s, 600 MB/s
128M 3221225472 bytes (3.2 GB) copied, 5.12867 s, 628 MB/s
256M 3221225472 bytes (3.2 GB) copied, 5.1467 s, 626 MB/s
512M 3221225472 bytes (3.2 GB) copied, 5.31232 s, 606 MB/s
1G 3221225472 bytes (3.2 GB) copied, 12.4088 s, 260 MB/s
2G 3221225472 bytes (3.2 GB) copied, 339.646 s, 9.5 MB/s
3G 3221225472 bytes (3.2 GB) copied, 350.071 s, 9.2 MB/s

This shows up on 1008 striped file system but on smaller systems the impact is not nearly so substantial. On our 56 OST system we get
3G 3221225472 bytes (3.2 GB) copied, 4.77246 s, 675 MB/s

Another test case was used with C code rather than dd that provided similar results based on an fread call

int read_size = 256*1024*1024*2;
fread(buffer, sizeof(float), read_size, fp_in);

Also, file striping information on production and tds filesystems:
ext8:/lustre # lfs getstripe 3gurandomdata
3gurandomdata
lmm_stripe_count: 4
lmm_stripe_size: 1048576
lmm_pattern: 1
lmm_layout_gen: 0
lmm_stripe_offset: 833
obdidx objid objid group
833 5978755 0x5b3a83 0
834 5953949 0x5ad99d 0
835 5958818 0x5aeca2 0
836 5966400 0x5b0a40 0

ext8:/lustretds # lfs getstripe 3gurandomdata
3gurandomdata
lmm_stripe_count: 4
lmm_stripe_size: 1048576
lmm_pattern: 1
lmm_layout_gen: 0
lmm_stripe_offset: 51
obdidx objid objid group
51 1451231 0x1624df 0
52 1452258 0x1628e2 0
53 1450278 0x162126 0
54 1444772 0x160ba4 0

So this appears to only be happening on wide-stripe file systems. Here's the output from 'perf top' while a 'bad' dd is running:

  • 8.74% [kernel] [k] _spin_lock - _spin_lock
  • 22.23% osc_ap_completion
    osc_extent_finish
    brw_interpret
    ptlrpc_check_set
    ptlrpcd_check
    ptlrpcd
    kthread
    child_rip
    + 13.76% cl_env_put
    + 12.37% cl_env_get
    + 7.10% vvp_write_complete
    + 6.51% kfree
    + 4.62% osc_teardown_async_page
    + 3.96% osc_page_delete
    + 3.89% osc_lru_add_batch
    + 2.69% kmem_cache_free
    + 2.23% osc_page_init
    + 1.71% sptlrpc_import_sec_ref
    + 1.64% osc_page_transfer_add
    + 1.57% osc_io_submit
    + 1.43% cfs_percpt_lock


 Comments   
Comment by Matt Ezell [ 16/Mar/15 ]

While the dd process is running, 'top' shows 100% CPU for a core.

Comment by Jian Yu [ 16/Mar/15 ]

Hi Jinshan,

Could you please take a look at this performance degradation issue? Thank you.

Comment by Andreas Dilger [ 17/Mar/15 ]

It seems to me that the performance only goes badly when the read() syscall size is very large (e.g. 1GB or more). This is causing of extra overhead to allocate and double-buffer the pages in both the kernel and userspace. How much RAM is on this client?

Comment by Matt Ezell [ 17/Mar/15 ]

How much RAM is on this client?

The 2.7GA client was a dedicated spare node with 64GB ram.
Our external login nodes (running Cray 2.5+CLIO changes) have 256GB. We've seen the issue when only root is logged into the machine (should be no memory pressure).
Our production compute nodes on Titan (where the problem was originally reported, running Cray 2.5+CLIO changes) have 32GB of ram.

I agree this might be a bit of a pathological case. A workaround would be to tell users to split up read calls

while (bytes_read < bytes_needed)
  bytes_read += read(...)

but it seems that we have existing applications that do not do this. From a programmer's point of view, if they need to load 500 million floats from a file, it's straightforward to do

fread(buffer, sizeof(float), 500000000, fp_in);

This works OK in vanilla 2.5, so it seems like a regression from the CLIO simplification re-work.

Comment by Patrick Farrell (Inactive) [ 17/Mar/15 ]

This definitely seems to be file system size dependent. I've tested various Intel client versions and an almost identical Cray client in a VM environment and while I (sometimes) saw a slight drop off at higher read sizes, it's on the order of 5-20%; nothing like that reported by ORNL.

I've found similar results on our in house hardware. Not surprisingly, we don't have anything in house that approaches the size of the file systems connected to Titan.

I'd be very curious to see actual tests results from a 2.x client version that does not have this issue so we can confirm this as a regression from a particular point. I've been told that's the case, but I have yet to see data.

Comment by Jinshan Xiong (Inactive) [ 17/Mar/15 ]

From the problem can be seen on vanilla 2.6, then it's not related to CLIO simplification work. Maybe some changes between 2.5 and 2.6 caused this problem.

From the test results, there is huge performance degradation when the read size is increasing from 1G to 2G. Will you run those two tests again and use perf to collect performance data?

Comment by Patrick Farrell (Inactive) [ 17/Mar/15 ]

Jinshan - They're referring to LU-3321. I wouldn't call it CLIO work either.

Comment by John Lewis (Inactive) [ 17/Mar/15 ]

perf results from bs=

{1G,2G}

testing

ext8:/lustre # (perf record dd if=3gurandomdata of=/dev/null bs=1G ; rm perf.data) >dev/null 2>&1
ext8:/lustre # perf record dd if=3gurandomdata of=/dev/null bs=1G ; mv perf.data perf.1G
3+0 records in
3+0 records out
3221225472 bytes (3.2 GB) copied, 10.8859 s, 296 MB/s
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.423 MB perf.data (~18500 samples) ]
titan-ext8:/lustre/atlas/scratch/lewisj/ven004/read_test_dd # ls -lrt
total 3145745
rw-rr- 1 root root 3221225472 Mar 16 14:21 3gurandomdata
rw------ 1 root root 445452 Mar 17 13:05 perf.1G
ext8:/lustre # perf record dd if=3gurandomdata of=/dev/null bs=2G ; mv perf.data perf.2G
0+2 records in
0+2 records out
3221225472 bytes (3.2 GB) copied, 416.124 s, 7.7 MB/s
[ perf record: Woken up 61 times to write data ]
[ perf record: Captured and wrote 15.869 MB perf.data (~693322 samples) ]

Comment by Jinshan Xiong (Inactive) [ 17/Mar/15 ]

I can't parse the results on my node:

[root@mercury ~]# perf report -v -i perf.1G 
legacy perf.data format
magic/endian check failed
incompatible file format (rerun with -v to learn more)[root@mercury ~]# 

Can you do it on your side and post the result here?

Obviously there are much more perf samples collected for 2G block size. Please try to read only 1 block with block size 1G and 2G specifically and see what we can get. If we can reproduce the problem with only 1 block, then please turn on debug and run it again, then post the lustre log here. Thanks,

Comment by John Lewis (Inactive) [ 18/Mar/15 ]

Jinshan, please see attached reports. count=1 was used for both of these files.

Comment by John Lewis (Inactive) [ 19/Mar/15 ]

We've found some interesting results while running permutations on file parameters. While the problem is very repeatable in a given configuration, it's not at all stable as we change parameters. Specifically, changing the stripe count from 4 to

{1,16,56}

alleviates the huge performance degradation.

The following table lists time (in seconds, rounded to nearest second) to copy 3 gb of data from lustre to dev null using dd. Column headers indicate the block size used, row headers indicate the stripe count of the file. Tests were performed on two separate filesystems, a dedicated 56 OST TDS filesystem as well as a shared (not quiet) 1008 OST filesystem. Timings across multiple runs vary, but are stable to within approximately 10%.

1008 OST fs
                bs=512M bs=1G   bs=2G
Count 1         15      16      23
Count 4         5       12      384
Count 16        4       2       2
Count 56        4       2       2

56 OST fs
                bs=512M bs=1G   bs=2G
Count 1         2       2       3
Count 4         2       2       2
Count 16        2       2       2
Count 56        2       2       4

Results above were with stripe widths of 1M. 4M widths were also tested, but have not been listed as they tracked the 1M results. This testing was performed on a whitebox IB lustre client using a 2.5 Cray client. Tests with other client builds are planned.

Comment by Jinshan Xiong (Inactive) [ 20/Mar/15 ]

Can you read just 1 block and collect Lustre log with block size as 1G and 2G specifically?

Comment by Jian Yu [ 26/Mar/15 ]

Can you read just 1 block and collect Lustre log with block size as 1G and 2G specifically?

Hi John,
With stripe count of 4 on an 1008 OST filesystem, read performance degraded substantially when read block size was increased from 1G to 2G. You’ve reproduced the problem with only 1 block, and collected the performance data with 1G and 2G block size separately. Could you please reproduce the issue again and gather Lustre debug logs as per Jinshan’s suggestion above so as to help him do further investigation? Thank you!

Comment by John Lewis (Inactive) [ 26/Mar/15 ]

Jinshan, what tracing options would you like to see, and for how many seconds? With "-1", I expect a bs=2G count=1 to take approximately 1 hour and to overflow any rational buffer many times over.

Comment by Jinshan Xiong (Inactive) [ 26/Mar/15 ]

In that case, I just need the log after it has been running for a few minutes.

Comment by John Lewis (Inactive) [ 27/Mar/15 ]

My hosts won't be quiet and available until tomorrow, I'm coordinating with Dustin to hopefully get these uploaded today.

Comment by Jian Yu [ 27/Mar/15 ]

Thank you John and Dustin.

Comment by Dustin Leverman [ 31/Mar/15 ]

I have gathered the requested debug data for LU6370.

I used "-1" for LNET debugging, and ran the dd test with 1G and 2G blocksizes on a lustre-2.7 client for 180 seconds.

Each output file is roughly 30MB, compressed.

Thanks,
Dustin

Comment by Jinshan Xiong (Inactive) [ 01/Apr/15 ]

Hi Dustin,

Thanks for the debug information. From the 2G output, I can see some strange things were happening.

The reading process is 11235.

At the time 1427805640.741591, this process just finished a reading, and then it was preempted and rescheduled to processor 15.

From time 1427805640.741714 to 1427805644.468771, kiblnd_cq_completion(), which is running inside softirq context, took over the CPU for over 3 seconds.

00000080:00000001:13.0:1427805640.741591:0:11235:0:(rw.c:1140:ll_readpage()) Process leaving (rc=0 : 0 : 0)
00000800:00000200:15.2:1427805640.741714:0:11235:0:(o2iblnd_cb.c:3325:kiblnd_cq_completion()) conn[ffff88081ed6e000] (130)++
....
00000800:00000200:15.2:1427805643.543220:0:11235:0:(o2iblnd_cb.c:3325:kiblnd_cq_completion()) conn[ffff88081ed6e000] (130)++
00000800:00000200:15.2:1427805643.543271:0:11235:0:(o2iblnd_cb.c:3325:kiblnd_cq_completion()) conn[ffff88081ed6e000] (131)++
00000800:00000200:15.2:1427805643.543458:0:11235:0:(o2iblnd_cb.c:3325:kiblnd_cq_completion()) conn[ffff88081ed6e000] (131)++
00000800:00000200:15.2:1427805643.543499:0:11235:0:(o2iblnd_cb.c:3325:kiblnd_cq_completion()) conn[ffff88081ed6e000] (131)++
00000800:00000200:15.2:1427805643.543588:0:11235:0:(o2iblnd_cb.c:3325:kiblnd_cq_completion()) conn[ffff88081ed6e000] (130)++
00000800:00000200:15.2:1427805644.468771:0:11235:0:(o2iblnd_cb.c:3325:kiblnd_cq_completion()) conn[ffff88081ed6ea00] (131)++
00000020:00000001:15.0:1427805644.468787:0:11235:0:(cl_io.c:930:cl_page_list_del()) Process leaving

This happened quite often, from time 1427805640 to 1427805644, and then from 1427805644 to 1427805647. Roughly speaking, the pattern was to run 1 second, and then stall for 3 seconds. I didn't see this was happening for 1G reading but I don't know why.

Have you ever run this kind of reading 2G block size job before on pre 2.6 client?

I'm still reading the log and trying to dig more things. Stay tuned.

Comment by Dustin Leverman [ 01/Apr/15 ]

Jinshan,
We did the same testing on the 2.5.3 client and do not see the issue.

The data I gave you was from a client that was IB attached, but we are seeing this on our Cray systems as well (GNI attached). I wanted to give you a heads up in case that helps you with tracking down the issue.

-Dustin

Comment by Jinshan Xiong (Inactive) [ 01/Apr/15 ]

After taking a further look, the kiblnd_cq_completion() was running on CPU 15 for even longer time, before the reading process was rescheduled to that CPU. It had been using idle process's stack so kernel scheduler considered that CPU as idle, this is why the reading process was scheduled. If a task can take so long time, it should really be using process context.

It's still unclear why this is not happening in 1G block size. Not sure if this is related to LRU management.

Hi Dustin,

If you have another chance to run the test, can you collect some log for CPU usage info and the log for Lustre LRU stats(the output of `lctl get_param llite.*.max_cached_mb and osc.*.osc_cached_mb').

Thanks,

Comment by Jinshan Xiong (Inactive) [ 01/Apr/15 ]

btw, Dustin, you don't need to wait the test to finish, just clear the log beforehand and let it run for a few minutes, then collect the log. Thanks again.

Comment by Jinshan Xiong (Inactive) [ 01/Apr/15 ]

Can you see the same problem for even bigger block size, for example 4G?

Comment by Dustin Leverman [ 02/Apr/15 ]

I do not see the same issue with a block size of 4G on the 2.5.3 client (takes ~6 seconds), but the problem still exists on the 2.7 client (let it run for about 30min then killed it).

I am working on gathering the requested CPU and LRU info. I will have it for you soon.

-Dustin

Comment by Patrick Farrell (Inactive) [ 02/Apr/15 ]

Sorry if I've missed this in the shuffle, but have you tested 2.6, Dustin?

The latest Cray clients on your mainframe have the performance improvements from LU-3321 back ported in to their version of 2.5, so if you see it not in Intel's 2.5, but in Cray's 2.5 and Intel's 2.6 and 2.7, that would suggest a place to look.

Comment by Dustin Leverman [ 02/Apr/15 ]

Jinshan,
I have gathered the CPU data during the run using sar. The output is kinda messy, but it is all there. I also gathered the mac_cached_mb data you requested.

Please let me know if you need additional info.

Thanks,
Dustin

Comment by Jinshan Xiong (Inactive) [ 02/Apr/15 ]

Hi Dustin,

Thanks for the debug info. You mentioned that you've back ported LU-3321, is it possible for me to get access to the code?

Comment by Matt Ezell [ 02/Apr/15 ]

> You mentioned that you've back ported LU-3321, is it possible for me to get access to the code?

Cray (Patrick?) would need to provide this. We can either track this down with Cray's client that includes LU-3321, or with b2_7/master that also includes this patch (plus many more things).

Comment by Patrick Farrell (Inactive) [ 02/Apr/15 ]

We can certainly do that (the GPL requires we do it ), but I can't personally without permission from my boss (who's not available for a few days). As our customer, ORNL has the legal right to have the source, but I think Cray would want some sort of formal request...? If the ORNL staff feel it's important, I can look in to getting permission.

So, at least, I can't do it right away. So for this bug, it's likely going to be easier to proceed with examining it from a 2.7 perspective.

Matt/Dustin - Are you able to confirm this happens in 2.6? (Sorry if you already have and I missed it) If it doesn't happen in 2.6 (but does in Cray's 2.5 and Intel's 2.7), then LU-3321 is unlikely to be the culprit.

Comment by Dustin Leverman [ 02/Apr/15 ]

Patrick: I did not have a lustre-2.6 client laying around and had to build on and test it. The answer to your question though is that the problem is reproducible in lustre-2.6.

Thanks,
Dustin

Comment by Jinshan Xiong (Inactive) [ 02/Apr/15 ]

Sooner or later I will need the code base for a fix.

Dustin, can you please help me do one more thing - while the read is going on, run `echo t > /proc/sysrq-trigger; dmesg > dmesg-$(date +%s).log' on console so that it will generate stack traces for all processes running on the system right now. You may do it a couple of times with 1 second interval to make sure it will collect sufficient log. Once this is done, please send me all dmesg-xyz.log.

Thanks,

Comment by Patrick Farrell (Inactive) [ 02/Apr/15 ]

Jinshan -

Since the problem is in 2.6/2.7, Cray should be able to handle back porting a patch done against a newer version. In the end, our client is our responsibility. If it does prove problematic and ORNL would like to ask for your help in getting that fix on Cray's 2.5 (and you feel that's covered by ORNL's arrangements with Intel), I assume we'd be able to provide the code base (and would be happy to have the help if it's needed).

  • Patrick
Comment by Jinshan Xiong (Inactive) [ 03/Apr/15 ]

I've known the root cause of this issue - and at last I can reproduce it by adding some tricks in the code. The problem boils down to the inconsistency of LRU management and read ahead algorithm. The end result is that read ahead brings a lot of pages into the memory but LRU drops them due to tightness of per-OSC LRU budget.

It will take huge effort to make a general policy to fit every I/O cases. However, I can make a hot fix to solve the problem you're experiencing. If that's okay for you, I can make a patch for master.

Comment by Matt Ezell [ 03/Apr/15 ]

Jinshan - good news that you understand the root cause of the issue.

Is readahead not bounded by:

# lctl get_param llite.*.max_read_ahead_mb
llite.atlas1-ffff8817e6a85000.max_read_ahead_mb=40
llite.atlas2-ffff880fef58fc00.max_read_ahead_mb=40
llite.linkfarm-ffff8827ef28c000.max_read_ahead_mb=40

Is there a different way that we can tune readahead?

Anyway, we would appreciate a "hot fix" against master. We can probably trivially backport it to b2_7 for our non-Cray clients. James Simmons and Cray can figure out how easy it will be to backport to Cray's 2.5 branch.

Thanks!

Comment by Gerrit Updater [ 03/Apr/15 ]

Jinshan Xiong (jinshan.xiong@intel.com) uploaded a new patch: http://review.whamcloud.com/14347
Subject: LU-6370 osc: disable to control per-OSC LRU budget
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 19ac637dc7256ba64544df5ab3d2c176c364a27e

Comment by Patrick Farrell (Inactive) [ 03/Apr/15 ]

Matt -

The code in question is effectively identical in Cray's 2.5, so porting the patch is trivial.

Jinshan -

I've been thinking about this and had a suggestion.

First, I'll explain my understanding in case I've missed something.
The current code first checks to see if we're running out of LRU slots. If so, decides to free pages (more or less of them depending on whether or not the OSC is over budget).

Also, separately, it checks to see if the OSC in question is at more than 2*budget, and if so, decides to free pages.

The problem is that these large reads are overflowing that 2*budget limit for a particular OSC, so your patch comments out that limit, which would allow a particular OSC to consume any amount of cache, as long as LRU slots are available.

The reason this is a particular issue for ORNL is the per OSC cache budget is calculated by taking the total budget and dividing by the number of OSCs. Since ORNL has a very large number of OSTs, this means the budget for each OSC could be quite small.

In general, freeing down to max_cache/[number_of_OSCs] when low on LRU pages seems correct, but we'd like to let a particular OSC use a larger portion of cache if it's available, but... Probably not ALL of it.

And without your patch, the limit on that larger OSC is 2*budget. How about instead making it a % of total cache? That would cover the case when budget is small due to the number of OSCs without letting a single OSC totally dominate the cache (which is the downside to your quick fix). It could perhaps be made a tunable - "max_single_osc_cache_percent" or similar?

Comment by Jinshan Xiong (Inactive) [ 03/Apr/15 ]

the ultimate solution would be a self adaptive policy that an OSC can use as many LRU slots as it wants, if there is no competition from other OSCs. However, once other OSCs are starting consuming LRU slots, the over budget OSC should release slots with a faster speed to maintain fairness.

There is no difference to use percentage of slots an OSC can use in maximum - from my point of view.

Comment by Patrick Farrell (Inactive) [ 03/Apr/15 ]

Jinshan - Ah, OK. Makes sense. I guess doing a % would only help a little in certain edge cases.

Matt - Cray can make sure the patch passes sanity testing before sending it your way, but it's tricky for us to verify it solves the problem with a large OST/OSC count.

It would be very helpful if ORNL can verify it clears up the performance problem on your non-Cray clients before we proceed with the patch - And it would be more convincing than attempting to replicate it here, since we don't have physical hardware on the scale of Titan available for testing. We can use various tricks to get OST count up on limited hardware, but it's not nearly the same.

Comment by Dustin Leverman [ 08/Apr/15 ]

I was able to test this on our TDS this afternoon and it appears to have fixed the problem. Our next step is to test this on a Cray client. We will send out the results.

Below is the output from my run:

[root@atlas-tds-mds2 leverman]# dd if=10GB.out of=10GB.out.test bs=2G count=1
0+1 records in
0+1 records out
2147479552 bytes (2.1 GB) copied, 6.37421 s, 337 MB/s
[root@atlas-tds-mds2 leverman]# rpm -qa| grep lustre
lustre-client-2.7.0-2.6.322.6.322.6.322.6.32_431.17.1.el6.wc.x86_64.x86_64
lustre-client-modules-2.7.0-2.6.322.6.322.6.322.6.32_431.17.1.el6.wc.x86_64.x86_64

Comment by Gerrit Updater [ 09/Apr/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/14347/
Subject: LU-6370 osc: disable to control per-OSC LRU budget
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 642dd7e50b4ff39e91c5fd0a771a26c59b5b6637

Comment by Peter Jones [ 26/May/15 ]

Landed for 2.8

Generated at Sat Feb 10 01:59:37 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.