[LU-6370] Read performance degrades with increasing read block size. Created: 16/Mar/15 Updated: 12/Aug/15 Resolved: 26/May/15 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.6.0, Lustre 2.7.0, Lustre 2.8.0 |
| Fix Version/s: | Lustre 2.8.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | James A Simmons | Assignee: | Jinshan Xiong (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | clio | ||
| Environment: |
Clients running a Cray 2.5 which contains backport of CLIO changes for 2.6. Problem also observed with vanilla 2.6 and 2.7 clients, |
||
| Attachments: |
|
||||
| Issue Links: |
|
||||
| Epic/Theme: | Performance | ||||
| Severity: | 3 | ||||
| Rank (Obsolete): | 9223372036854775807 | ||||
| Description |
|
We're finding substantial read performance degradations with increasing read block sizes. This has been observed in eslogin nodes as well as in internal login nodes. Data provided below was gathered on an external node. ext7:/lustre # dd if=/dev/urandom of=3gurandomdata bs=4M count=$((256*3)) 4K 3221225472 bytes (3.2 GB) copied, 13.9569 s, 231 MB/s This shows up on 1008 striped file system but on smaller systems the impact is not nearly so substantial. On our 56 OST system we get Another test case was used with C code rather than dd that provided similar results based on an fread call int read_size = 256*1024*1024*2; Also, file striping information on production and tds filesystems: ext8:/lustretds # lfs getstripe 3gurandomdata So this appears to only be happening on wide-stripe file systems. Here's the output from 'perf top' while a 'bad' dd is running:
|
| Comments |
| Comment by Matt Ezell [ 16/Mar/15 ] |
|
While the dd process is running, 'top' shows 100% CPU for a core. |
| Comment by Jian Yu [ 16/Mar/15 ] |
|
Hi Jinshan, Could you please take a look at this performance degradation issue? Thank you. |
| Comment by Andreas Dilger [ 17/Mar/15 ] |
|
It seems to me that the performance only goes badly when the read() syscall size is very large (e.g. 1GB or more). This is causing of extra overhead to allocate and double-buffer the pages in both the kernel and userspace. How much RAM is on this client? |
| Comment by Matt Ezell [ 17/Mar/15 ] |
The 2.7GA client was a dedicated spare node with 64GB ram. I agree this might be a bit of a pathological case. A workaround would be to tell users to split up read calls while (bytes_read < bytes_needed)
bytes_read += read(...)
but it seems that we have existing applications that do not do this. From a programmer's point of view, if they need to load 500 million floats from a file, it's straightforward to do fread(buffer, sizeof(float), 500000000, fp_in);
This works OK in vanilla 2.5, so it seems like a regression from the CLIO simplification re-work. |
| Comment by Patrick Farrell (Inactive) [ 17/Mar/15 ] |
|
This definitely seems to be file system size dependent. I've tested various Intel client versions and an almost identical Cray client in a VM environment and while I (sometimes) saw a slight drop off at higher read sizes, it's on the order of 5-20%; nothing like that reported by ORNL. I've found similar results on our in house hardware. Not surprisingly, we don't have anything in house that approaches the size of the file systems connected to Titan. I'd be very curious to see actual tests results from a 2.x client version that does not have this issue so we can confirm this as a regression from a particular point. I've been told that's the case, but I have yet to see data. |
| Comment by Jinshan Xiong (Inactive) [ 17/Mar/15 ] |
|
From the problem can be seen on vanilla 2.6, then it's not related to CLIO simplification work. Maybe some changes between 2.5 and 2.6 caused this problem. From the test results, there is huge performance degradation when the read size is increasing from 1G to 2G. Will you run those two tests again and use perf to collect performance data? |
| Comment by Patrick Farrell (Inactive) [ 17/Mar/15 ] |
|
Jinshan - They're referring to |
| Comment by John Lewis (Inactive) [ 17/Mar/15 ] |
|
perf results from bs= {1G,2G}testing ext8:/lustre # (perf record dd if=3gurandomdata of=/dev/null bs=1G ; rm perf.data) >dev/null 2>&1 |
| Comment by Jinshan Xiong (Inactive) [ 17/Mar/15 ] |
|
I can't parse the results on my node: [root@mercury ~]# perf report -v -i perf.1G legacy perf.data format magic/endian check failed incompatible file format (rerun with -v to learn more)[root@mercury ~]# Can you do it on your side and post the result here? Obviously there are much more perf samples collected for 2G block size. Please try to read only 1 block with block size 1G and 2G specifically and see what we can get. If we can reproduce the problem with only 1 block, then please turn on debug and run it again, then post the lustre log here. Thanks, |
| Comment by John Lewis (Inactive) [ 18/Mar/15 ] |
|
Jinshan, please see attached reports. count=1 was used for both of these files. |
| Comment by John Lewis (Inactive) [ 19/Mar/15 ] |
|
We've found some interesting results while running permutations on file parameters. While the problem is very repeatable in a given configuration, it's not at all stable as we change parameters. Specifically, changing the stripe count from 4 to {1,16,56}alleviates the huge performance degradation. The following table lists time (in seconds, rounded to nearest second) to copy 3 gb of data from lustre to dev null using dd. Column headers indicate the block size used, row headers indicate the stripe count of the file. Tests were performed on two separate filesystems, a dedicated 56 OST TDS filesystem as well as a shared (not quiet) 1008 OST filesystem. Timings across multiple runs vary, but are stable to within approximately 10%. 1008 OST fs
bs=512M bs=1G bs=2G
Count 1 15 16 23
Count 4 5 12 384
Count 16 4 2 2
Count 56 4 2 2
56 OST fs
bs=512M bs=1G bs=2G
Count 1 2 2 3
Count 4 2 2 2
Count 16 2 2 2
Count 56 2 2 4
Results above were with stripe widths of 1M. 4M widths were also tested, but have not been listed as they tracked the 1M results. This testing was performed on a whitebox IB lustre client using a 2.5 Cray client. Tests with other client builds are planned. |
| Comment by Jinshan Xiong (Inactive) [ 20/Mar/15 ] |
|
Can you read just 1 block and collect Lustre log with block size as 1G and 2G specifically? |
| Comment by Jian Yu [ 26/Mar/15 ] |
Hi John, |
| Comment by John Lewis (Inactive) [ 26/Mar/15 ] |
|
Jinshan, what tracing options would you like to see, and for how many seconds? With "-1", I expect a bs=2G count=1 to take approximately 1 hour and to overflow any rational buffer many times over. |
| Comment by Jinshan Xiong (Inactive) [ 26/Mar/15 ] |
|
In that case, I just need the log after it has been running for a few minutes. |
| Comment by John Lewis (Inactive) [ 27/Mar/15 ] |
|
My hosts won't be quiet and available until tomorrow, I'm coordinating with Dustin to hopefully get these uploaded today. |
| Comment by Jian Yu [ 27/Mar/15 ] |
|
Thank you John and Dustin. |
| Comment by Dustin Leverman [ 31/Mar/15 ] |
|
I have gathered the requested debug data for LU6370. I used "-1" for LNET debugging, and ran the dd test with 1G and 2G blocksizes on a lustre-2.7 client for 180 seconds. Each output file is roughly 30MB, compressed. Thanks, |
| Comment by Jinshan Xiong (Inactive) [ 01/Apr/15 ] |
|
Hi Dustin, Thanks for the debug information. From the 2G output, I can see some strange things were happening. The reading process is 11235. At the time 1427805640.741591, this process just finished a reading, and then it was preempted and rescheduled to processor 15. From time 1427805640.741714 to 1427805644.468771, kiblnd_cq_completion(), which is running inside softirq context, took over the CPU for over 3 seconds. 00000080:00000001:13.0:1427805640.741591:0:11235:0:(rw.c:1140:ll_readpage()) Process leaving (rc=0 : 0 : 0) 00000800:00000200:15.2:1427805640.741714:0:11235:0:(o2iblnd_cb.c:3325:kiblnd_cq_completion()) conn[ffff88081ed6e000] (130)++ .... 00000800:00000200:15.2:1427805643.543220:0:11235:0:(o2iblnd_cb.c:3325:kiblnd_cq_completion()) conn[ffff88081ed6e000] (130)++ 00000800:00000200:15.2:1427805643.543271:0:11235:0:(o2iblnd_cb.c:3325:kiblnd_cq_completion()) conn[ffff88081ed6e000] (131)++ 00000800:00000200:15.2:1427805643.543458:0:11235:0:(o2iblnd_cb.c:3325:kiblnd_cq_completion()) conn[ffff88081ed6e000] (131)++ 00000800:00000200:15.2:1427805643.543499:0:11235:0:(o2iblnd_cb.c:3325:kiblnd_cq_completion()) conn[ffff88081ed6e000] (131)++ 00000800:00000200:15.2:1427805643.543588:0:11235:0:(o2iblnd_cb.c:3325:kiblnd_cq_completion()) conn[ffff88081ed6e000] (130)++ 00000800:00000200:15.2:1427805644.468771:0:11235:0:(o2iblnd_cb.c:3325:kiblnd_cq_completion()) conn[ffff88081ed6ea00] (131)++ 00000020:00000001:15.0:1427805644.468787:0:11235:0:(cl_io.c:930:cl_page_list_del()) Process leaving This happened quite often, from time 1427805640 to 1427805644, and then from 1427805644 to 1427805647. Roughly speaking, the pattern was to run 1 second, and then stall for 3 seconds. I didn't see this was happening for 1G reading but I don't know why. Have you ever run this kind of reading 2G block size job before on pre 2.6 client? I'm still reading the log and trying to dig more things. Stay tuned. |
| Comment by Dustin Leverman [ 01/Apr/15 ] |
|
Jinshan, The data I gave you was from a client that was IB attached, but we are seeing this on our Cray systems as well (GNI attached). I wanted to give you a heads up in case that helps you with tracking down the issue. -Dustin |
| Comment by Jinshan Xiong (Inactive) [ 01/Apr/15 ] |
|
After taking a further look, the kiblnd_cq_completion() was running on CPU 15 for even longer time, before the reading process was rescheduled to that CPU. It had been using idle process's stack so kernel scheduler considered that CPU as idle, this is why the reading process was scheduled. If a task can take so long time, it should really be using process context. It's still unclear why this is not happening in 1G block size. Not sure if this is related to LRU management. Hi Dustin, If you have another chance to run the test, can you collect some log for CPU usage info and the log for Lustre LRU stats(the output of `lctl get_param llite.*.max_cached_mb and osc.*.osc_cached_mb'). Thanks, |
| Comment by Jinshan Xiong (Inactive) [ 01/Apr/15 ] |
|
btw, Dustin, you don't need to wait the test to finish, just clear the log beforehand and let it run for a few minutes, then collect the log. Thanks again. |
| Comment by Jinshan Xiong (Inactive) [ 01/Apr/15 ] |
|
Can you see the same problem for even bigger block size, for example 4G? |
| Comment by Dustin Leverman [ 02/Apr/15 ] |
|
I do not see the same issue with a block size of 4G on the 2.5.3 client (takes ~6 seconds), but the problem still exists on the 2.7 client (let it run for about 30min then killed it). I am working on gathering the requested CPU and LRU info. I will have it for you soon. -Dustin |
| Comment by Patrick Farrell (Inactive) [ 02/Apr/15 ] |
|
Sorry if I've missed this in the shuffle, but have you tested 2.6, Dustin? The latest Cray clients on your mainframe have the performance improvements from |
| Comment by Dustin Leverman [ 02/Apr/15 ] |
|
Jinshan, Please let me know if you need additional info. Thanks, |
| Comment by Jinshan Xiong (Inactive) [ 02/Apr/15 ] |
|
Hi Dustin, Thanks for the debug info. You mentioned that you've back ported |
| Comment by Matt Ezell [ 02/Apr/15 ] |
|
> You mentioned that you've back ported Cray (Patrick?) would need to provide this. We can either track this down with Cray's client that includes |
| Comment by Patrick Farrell (Inactive) [ 02/Apr/15 ] |
|
We can certainly do that (the GPL requires we do it So, at least, I can't do it right away. So for this bug, it's likely going to be easier to proceed with examining it from a 2.7 perspective. Matt/Dustin - Are you able to confirm this happens in 2.6? (Sorry if you already have and I missed it) If it doesn't happen in 2.6 (but does in Cray's 2.5 and Intel's 2.7), then |
| Comment by Dustin Leverman [ 02/Apr/15 ] |
|
Patrick: I did not have a lustre-2.6 client laying around and had to build on and test it. The answer to your question though is that the problem is reproducible in lustre-2.6. Thanks, |
| Comment by Jinshan Xiong (Inactive) [ 02/Apr/15 ] |
|
Sooner or later I will need the code base for a fix. Dustin, can you please help me do one more thing - while the read is going on, run `echo t > /proc/sysrq-trigger; dmesg > dmesg-$(date +%s).log' on console so that it will generate stack traces for all processes running on the system right now. You may do it a couple of times with 1 second interval to make sure it will collect sufficient log. Once this is done, please send me all dmesg-xyz.log. Thanks, |
| Comment by Patrick Farrell (Inactive) [ 02/Apr/15 ] |
|
Jinshan - Since the problem is in 2.6/2.7, Cray should be able to handle back porting a patch done against a newer version. In the end, our client is our responsibility. If it does prove problematic and ORNL would like to ask for your help in getting that fix on Cray's 2.5 (and you feel that's covered by ORNL's arrangements with Intel), I assume we'd be able to provide the code base (and would be happy to have the help if it's needed).
|
| Comment by Jinshan Xiong (Inactive) [ 03/Apr/15 ] |
|
I've known the root cause of this issue - and at last I can reproduce it by adding some tricks in the code. The problem boils down to the inconsistency of LRU management and read ahead algorithm. The end result is that read ahead brings a lot of pages into the memory but LRU drops them due to tightness of per-OSC LRU budget. It will take huge effort to make a general policy to fit every I/O cases. However, I can make a hot fix to solve the problem you're experiencing. If that's okay for you, I can make a patch for master. |
| Comment by Matt Ezell [ 03/Apr/15 ] |
|
Jinshan - good news that you understand the root cause of the issue. Is readahead not bounded by: # lctl get_param llite.*.max_read_ahead_mb llite.atlas1-ffff8817e6a85000.max_read_ahead_mb=40 llite.atlas2-ffff880fef58fc00.max_read_ahead_mb=40 llite.linkfarm-ffff8827ef28c000.max_read_ahead_mb=40 Is there a different way that we can tune readahead? Anyway, we would appreciate a "hot fix" against master. We can probably trivially backport it to b2_7 for our non-Cray clients. James Simmons and Cray can figure out how easy it will be to backport to Cray's 2.5 branch. Thanks! |
| Comment by Gerrit Updater [ 03/Apr/15 ] |
|
Jinshan Xiong (jinshan.xiong@intel.com) uploaded a new patch: http://review.whamcloud.com/14347 |
| Comment by Patrick Farrell (Inactive) [ 03/Apr/15 ] |
|
Matt - The code in question is effectively identical in Cray's 2.5, so porting the patch is trivial. Jinshan - I've been thinking about this and had a suggestion. First, I'll explain my understanding in case I've missed something. Also, separately, it checks to see if the OSC in question is at more than 2*budget, and if so, decides to free pages. The problem is that these large reads are overflowing that 2*budget limit for a particular OSC, so your patch comments out that limit, which would allow a particular OSC to consume any amount of cache, as long as LRU slots are available. The reason this is a particular issue for ORNL is the per OSC cache budget is calculated by taking the total budget and dividing by the number of OSCs. Since ORNL has a very large number of OSTs, this means the budget for each OSC could be quite small. In general, freeing down to max_cache/[number_of_OSCs] when low on LRU pages seems correct, but we'd like to let a particular OSC use a larger portion of cache if it's available, but... Probably not ALL of it. And without your patch, the limit on that larger OSC is 2*budget. How about instead making it a % of total cache? That would cover the case when budget is small due to the number of OSCs without letting a single OSC totally dominate the cache (which is the downside to your quick fix). It could perhaps be made a tunable - "max_single_osc_cache_percent" or similar? |
| Comment by Jinshan Xiong (Inactive) [ 03/Apr/15 ] |
|
the ultimate solution would be a self adaptive policy that an OSC can use as many LRU slots as it wants, if there is no competition from other OSCs. However, once other OSCs are starting consuming LRU slots, the over budget OSC should release slots with a faster speed to maintain fairness. There is no difference to use percentage of slots an OSC can use in maximum - from my point of view. |
| Comment by Patrick Farrell (Inactive) [ 03/Apr/15 ] |
|
Jinshan - Ah, OK. Makes sense. I guess doing a % would only help a little in certain edge cases. Matt - Cray can make sure the patch passes sanity testing before sending it your way, but it's tricky for us to verify it solves the problem with a large OST/OSC count. It would be very helpful if ORNL can verify it clears up the performance problem on your non-Cray clients before we proceed with the patch - And it would be more convincing than attempting to replicate it here, since we don't have physical hardware on the scale of Titan available for testing. We can use various tricks to get OST count up on limited hardware, but it's not nearly the same. |
| Comment by Dustin Leverman [ 08/Apr/15 ] |
|
I was able to test this on our TDS this afternoon and it appears to have fixed the problem. Our next step is to test this on a Cray client. We will send out the results. Below is the output from my run: [root@atlas-tds-mds2 leverman]# dd if=10GB.out of=10GB.out.test bs=2G count=1 |
| Comment by Gerrit Updater [ 09/Apr/15 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/14347/ |
| Comment by Peter Jones [ 26/May/15 ] |
|
Landed for 2.8 |