[LU-5212] poor stat performance after upgrade from zfs-0.6.2-1/lustre-2.4.0-1 to zfs-0.6.3-1 Created: 17/Jun/14 Updated: 05/Apr/18 Resolved: 05/Apr/18 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.2 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Scott Nolin | Assignee: | WC Triage |
| Resolution: | Won't Fix | Votes: | 0 |
| Labels: | llnl, prz, zfs | ||
| Environment: |
centos 6.5 |
||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Epic/Theme: | Performance, zfs | ||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 14543 | ||||||||
| Description |
|
After upgrading a system from mdtest shows a signficantly lower stat performance - about 8000 iops vs 14400 File reads and file removals are a bit worse, but not as severe. See the attached graph. We do see other marked improvements with the upgrade, for example with system processes waiting on the MDS. I wonder if this is some kind of expected perfromance tradeoff for the new version? I'm guessing the absolute numbers for stat are still acceptable for our workload, but it is quite a large relative difference. Scott |
| Comments |
| Comment by Gabriele Paciucci (Inactive) [ 17/Jun/14 ] |
|
Hi Scott, Have you seen this patch https://jira.hpdd.intel.com/browse/LU-4944 ? |
| Comment by Gabriele Paciucci (Inactive) [ 17/Jun/14 ] |
|
Can you collect please the /proc/spl/kstat/zfs/arcstats during the benchmark and upload a graph of:
thanks |
| Comment by Scott Nolin [ 17/Jun/14 ] |
|
This is lustre 2.4.2 and zfs 0.6.3 from the ZFS on Linux EPEL repository I'm not certain I am following this correctly, but the comments on I'm not sure when I can run the benchmark again to collect stats while running. Here's limit values at least arc_meta_limit 10000000000 |
| Comment by Scott Nolin [ 17/Jun/14 ] |
|
When I do get to run this it will be easier if I can just use arcstat.py to collect data. If I do that, which of these fields do you want to see on a graph? Field definitions are as follows: |
| Comment by Gabriele Paciucci (Inactive) [ 17/Jun/14 ] |
|
Hi Scott, |
| Comment by Gabriele Paciucci (Inactive) [ 17/Jun/14 ] |
|
Have you compiled lustre 2.4.2 against ZFS 0.6.3? |
| Comment by Scott Nolin [ 17/Jun/14 ] |
|
I didn't compile it, Brian Behlendorf or whomever built it for zfsonlinux epel repository did. But yes, this is lustre 2.4.2 and zfs 0.6.3 I'll make graphs shortly Scott |
| Comment by Scott Nolin [ 17/Jun/14 ] |
|
Attached are requested stats. I broke out the non-constant stats and adjusted a bit to make it a little more interesting. |
| Comment by Gabriele Paciucci (Inactive) [ 17/Jun/14 ] |
|
I can't understand nothing... could you please use MB for the y-axis and take a graph only for: thanks |
| Comment by Scott Nolin [ 18/Jun/14 ] |
|
The y-axis is simply the raw data from /proc/spl/kstat/zfs/arcstats - I just put it into a graph quickly, it must be bytes. Note that "arc_meta_size" doesn't exist in arcstats, but there is "meta_size" - I assume the same thing. I will make those graphs in MB for you tomorrow, but here's a description doing some approximate math quickly which should explain it. 1) arc_meta_limit is a constant, why graph it? Our value was 10000000000 bytes (aka 1E10) = 9536 MB 2) arc_meta_size - does not exist, assuming we want meta_size - this is in 'arcstat-2.png': 3) Size - also in arcstat-2.png I ran 2 iterations of the mdtest, and I think you can see the test finish and restart in that graph. I captured all the stats, so whatever graphs in whatever format might help, I'll make tomorrow. Thanks, |
| Comment by Isaac Huang (Inactive) [ 18/Jun/14 ] |
|
Scott, In the 1st graph, p (i.e. ARC adaptation parameter) almost never changed, which was weird. Can you verify from your data that p never really changed or its changes were too small to be seen on the graph? Also, can you make sure you have this patch http://review.whamcloud.com/#/c/10237/ on the server? It was just landed and in my opinion it makes a lot of sense to apply that patch before trying to tune the ARC. |
| Comment by Scott Nolin [ 18/Jun/14 ] |
|
Isaac, Regarding 'p' in the first graph - you just can't see it due to the scale. See the 'arcstat-3.png' graph which shows 'p' on it's own changing. Regarding the patch (I see it's now also here - Scott |
| Comment by Scott Nolin [ 18/Jun/14 ] |
|
arcstat-MB.png shows meta-size, size, and arc-meta-limit in MB. |
| Comment by Scott Nolin [ 18/Jun/14 ] |
|
We have upgraded a second filesystem with similar resources on the MDS/MDT, and see pretty much the same performance difference for stat in mdtest. Scott |
| Comment by Gabriele Paciucci (Inactive) [ 18/Jun/14 ] |
|
Do you have 32GB of RAM? if yes could you set these parameters:
reboot your system and collect data? |
| Comment by Scott Nolin [ 18/Jun/14 ] |
|
We have 256GB of RAM modinfo zfs output - filename: /lib/modules/2.6.32-358.6.2.el6.x86_64/extra/zfs.ko |
| Comment by Andrew Wagner [ 18/Jun/14 ] |
|
Gabriele, I'm working on this filesystem with Scott: Speaking to the Arc Cache settings, we are currently using: options zfs zfs_arc_meta_limit=10000000000 However, we're nowhere near to filling that up so we're not seeing any excessive cache pressure right now. |
| Comment by Gabriele Paciucci (Inactive) [ 18/Jun/14 ] |
|
Have you set these values? In the arcstat-MB seems to be 10GB only. could you increase these value? also the zfs_arc_max? |
| Comment by Andrew Wagner [ 18/Jun/14 ] |
|
Yes, we set these values after observations of different values on ZFS 0.6.2. The larger meta_limit helped us avoid running into ugly cache issues. Either way, the cache is only using about 10GB right now as the filesystem has only been up for two days and is relatively quiet. We can't force it to use more of the cache without more activity. |
| Comment by Gabriele Paciucci (Inactive) [ 18/Jun/14 ] |
|
Okay but yesterday Scott captured these values: and for me you go out of the arc memory for metadata during your mdtest. |
| Comment by Andrew Wagner [ 18/Jun/14 ] |
|
Sorry about that, we were missing a 0 on the arc_meta_limit. We'll retest with the new values. |
| Comment by Scott Nolin [ 18/Jun/14 ] |
|
I've completed an initial run with the zfs_arc_meta_limit and max set more approprately to 100G and 150G. The mdtest data doesn't look any better, it's actually a bit worse. Complicating things, the filesystem is now in use by other jobs, not heavy use but a few hundred iops on various tasks it looks like (just watching jobstats in general). I'll post actual data if I can get another test done. Regardless of that one, our other filesystem has more appropriate numbers to start with (it was left at default), and see a very similar difference in stat performance. So I don't really expect much... Scott |
| Comment by Scott Nolin [ 18/Jun/14 ] |
|
Here are results with more appropriate limits, included the graph with arc_meta_limit to show we're not exceeding it, and also a version with better scale. Notice how things certainly aren't better, they're a little worse. These graphs all have "95G" in the title. I wish I could embed images within comments to make it flow better. |
| Comment by Prakash Surya (Inactive) [ 08/Jul/14 ] |
|
I don't mean to barge in so late to the party, but I'm curious what the status of this issue is? It's a little hard to follow the comments and the attached graphs. What's the observed performance degradation, and what configuration changes/experiments have been tried (and what were the results)? |
| Comment by Scott Nolin [ 08/Jul/14 ] |
|
The observed performance degradation is ~45% lower stat IOPs in mdtest (8000 vs 14400) after the zfs/lustre upgrade. We adjusted the arc_meta_limit (as it was set poorly) but it made no difference. The easiest graph to look at is this one: https://jira.hpdd.intel.com/secure/attachment/15192/mdtest-zfs063-95G-arc_meta_limit.png The graphs are all for one particular filesystem, but we have a second file system with similar hardware and the same software versions and saw similar degradation in stat performance for mdtest. If anyone runs mdtest just prior to upgrading lustre/zfs I'd be interested in their results, I suspect it will be similar. Our software is from the zfsonlinux repo with no additional patches. Scott |
| Comment by Prakash Surya (Inactive) [ 09/Jul/14 ] |
|
Thanks Scott. That definitely doesn't sit well with me. Can you post the command you used as a test? Do you have the exact mdtest command/options you used? How many nodes? If I can get some time, I might try and reproduce this and see if I can better understand what's going on here. It's definitely not expected nor desired for the performance to drop like that; I want to get to the bottom of this. Also, what's the Y axis label in the graph you linked to? I saw that earlier, but I can't make sense of it without labels. My initial interpretation was the Y axis is seconds, but that would mean lower is better, which doesn't agree with the claim of a performance decrease. |
| Comment by Prakash Surya (Inactive) [ 09/Jul/14 ] |
Actually, I think I got it now. The Y axis must be the rate of operations per second, which lines up with your claim of 14400 stat/s prior and 8000 stat/s now. When you get a chance, please update us with the command used to generate the workload. |
| Comment by Scott Nolin [ 09/Jul/14 ] |
|
Y-axis is IOPs. The command info: mdtest-1.9.1 was launched with 64 total task(s) on 4 node(s) 64 tasks, 256000 files Scott |
| Comment by Scott Nolin [ 09/Jul/14 ] |
|
I would also add, that while this absolute number from mdtest is worse, in use so far the upgrade has been an improvement. Performance doesn't seem to degrade so quickly with file creates, and things like interactive 'ls -l' are much better. Scott |
| Comment by Prakash Surya (Inactive) [ 09/Jul/14 ] |
Glad to hear it! I'm still a bit puzzled regarding the stat's though. I'm going to try and reproduce this using our test cluster; stay tuned. |
| Comment by Prakash Surya (Inactive) [ 09/Jul/14 ] |
|
Interesting.. I think I see similar reduced performance with stats as well.. Hm.. So here's the mdtest output with releases based on lustre 2.4.2 and zfs 0.6.3 on the servers: hype355@root:srun -- mdtest -i 8 -F -n 4000 -d /p/lcratery/surya1/LU-5212/mdtest-1 -- started at 07/09/2014 13:09:02 -- mdtest-1.8.3 was launched with 64 total task(s) on 64 nodes Command line used: /opt/mdtest-1.8.3/bin/mdtest -i 8 -F -n 4000 -d /p/lcratery/surya1/LU-5212/mdtest-1 Path: /p/lcratery/surya1/LU-5212 FS: 1019.6 TiB Used FS: 50.3% Inodes: 866.3 Mi Used Inodes: 60.2% 64 tasks, 256000 files SUMMARY: (of 8 iterations) Operation Max Min Mean Std Dev --------- --- --- ---- ------- File creation : 2046.438 838.703 1534.565 375.574 File stat : 65205.403 23577.494 57837.499 13089.055 File removal : 4780.471 4647.670 4719.076 45.088 Tree creation : 505.051 34.332 221.404 196.950 Tree removal : 12.423 10.049 11.123 0.763 -- finished at 07/09/2014 13:40:52 -- And here's the mdtest output with releases based on lustre 2.4.0 and zfs 0.6.2 on the servers: hype355@root:srun -- mdtest -i 8 -F -n 4000 -d /p/lcratery/surya1/LU-5212/mdtest-1 -- started at 07/09/2014 14:43:06 -- mdtest-1.8.3 was launched with 64 total task(s) on 64 nodes Command line used: /opt/mdtest-1.8.3/bin/mdtest -i 8 -F -n 4000 -d /p/lcratery/surya1/LU-5212/mdtest-1 Path: /p/lcratery/surya1/LU-5212 FS: 1019.6 TiB Used FS: 50.3% Inodes: 861.8 Mi Used Inodes: 60.5% 64 tasks, 256000 files SUMMARY: (of 8 iterations) Operation Max Min Mean Std Dev --------- --- --- ---- ------- File creation : 1627.029 810.017 1320.848 239.655 File stat : 99560.417 69839.184 88798.194 9632.641 File removal : 4352.713 3279.728 4029.607 413.213 Tree creation : 348.675 33.174 194.944 141.913 Tree removal : 15.176 10.103 12.088 1.386 -- finished at 07/09/2014 15:19:02 -- Which shows about a 34% decrease in the mean "File stat" performance with the lustre 2.4.2 and zfs 0.6.3 release (I'm assuming the number reported is operations per second). That's no good. |
| Comment by Prakash Surya (Inactive) [ 15/Jul/14 ] |
|
Scott, can you try increasing the `lu_cache_nr` module option and re-running the test? # zwicky-lcy-mds1 /root > cat /sys/module/obdclass/parameters/lu_cache_nr 256 Try increasing it to something much larger, maybe 1M. I'd try that myself, but our testing resource is busy with other work at the moment. |
| Comment by Scott Nolin [ 16/Jul/14 ] |
|
Prakash, we will give this a try soon. Scott |
| Comment by Prakash Surya (Inactive) [ 16/Jul/14 ] |
|
Scott, I was able to squeeze in a test run with lu_cache_nr=1048576 on the MDS and all OSS nodes in the filesystem. I didn't see any significant difference: hype355@root:srun -- mdtest -v -i 8 -F -n 4000 -d /p/lcratery/surya1/LU-5212/mdtest-5 -- started at 07/16/2014 09:23:46 -- mdtest-1.8.3 was launched with 64 total task(s) on 64 nodes Command line used: /opt/mdtest-1.8.3/bin/mdtest -v -i 8 -F -n 4000 -d /p/lcratery/surya1/LU-5212/mdtest-5 Path: /p/lcratery/surya1/LU-5212 FS: 1019.6 TiB Used FS: 50.3% Inodes: 834.8 Mi Used Inodes: 62.5% 64 tasks, 256000 files SUMMARY: (of 8 iterations) Operation Max Min Mean Std Dev --------- --- --- ---- ------- File creation : 3060.802 2525.669 2719.003 161.410 File stat : 72310.501 32382.555 57016.755 11440.553 File removal : 4344.489 4043.991 4224.141 97.727 Tree creation : 377.644 32.784 147.864 126.958 Tree removal : 11.800 9.356 10.626 0.884 -- finished at 07/16/2014 09:45:06 -- |
| Comment by Scott Nolin [ 16/Jul/14 ] |
|
Prakash, thanks for letting me know. We won't bother running it then. Ours is a production cluster, we can typically run these tests though as it's not heavily used all the time, but it's not easy. Scott |
| Comment by Peter Jones [ 05/Apr/18 ] |
|
I imagine the performance is quite different on more current versions of Lustre and ZFS |