[LU-2476] poor OST file creation rate performance with zfs backend Created: 29/Nov/11 Updated: 14/Aug/16 Resolved: 14/Aug/16 |
|
| Status: | Closed |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.0 |
| Fix Version/s: | None |
| Type: | Improvement | Priority: | Blocker |
| Reporter: | Ned Bass | Assignee: | Alex Zhuravlev |
| Resolution: | Won't Fix | Votes: | 0 |
| Labels: | JL, performance, prz, zfs | ||
| Environment: |
RHEL 6.2 |
||
| Attachments: |
|
||||||||||||
| Issue Links: |
|
||||||||||||
| Sub-Tasks: |
|
||||||||||||
| Rank (Obsolete): | 2828 | ||||||||||||
| Description |
|
We observe poor file creation rate performance with a zfs backend. The attached jpeg was generated using the createmany benchmark creating 1 million files in a single directory. Files created per second were reported for every 10,000 files created. The green line shows results for Lustre 2.1 and ldiskfs backend. The red line was with ZFS on Orion. The benchmark was run on the same hardware for both ldisks and ZFS: MDT: 24 GB RAM, 8 200GB SSD drives in two external SAS-2 enclosures *LUNs were partitioned for ldiskfs for compatibility reasons |
| Comments |
| Comment by Ned Bass [ 29/Nov/11 ] |
|
For further comparison, the second attachment shows the same test run locally on the MDS with a mounted ZFS filesystem configured identically to the MDT. |
| Comment by Andreas Dilger [ 29/Nov/11 ] |
|
Ned, thanks for running these tests. As yet we haven't done much performance testing with orion, so information like this is very valuable. Was this with or without Brian's ZFS xattr-in-SA patch? Since Lustre will write 2 xattrs for every file created, if this cannot be stored in the dnode the performance will be quite bad. Without xattr-in-SA, there will be 4 dnodes + 3 data blocks written for every file. I think if you multiply the ZFS performance by 4x it will be very similar to the ldiskfs performance. Second question is whether the createmany was run with "-o" (open and create objects), or with "-m" (mknod without objects)? It may well be that the OST object creation performance is slowing down the MDS performance, especially in the ZFS case where there is only a single OST. It is worthwhile to test the MDS in isolation, to see where the performance problem lies. In production there would be many more OSTs, which would hide the latency from object creation on a single OST. Finally, since the metadata stack has changed noticeably between 2.1 and orion to accomodate ZFS, it is also worthwhile to include a test from ldiskfs on orion. |
| Comment by Ned Bass [ 29/Nov/11 ] |
It was with. The patch made a significant difference for a locally mounted ZFS filesystem, but little difference for Lustre.
-o. I'll give -m a try.
I did attempt to do test ldiskfs on orion, but ran into stability issues. I'll open separate bugs for those. |
| Comment by Andreas Dilger [ 29/Nov/11 ] |
Is it possibly because mkfs.lustre is not setting the required pool property? According to Brian's comments:
Probably we need a small patch to mkfs_lustre_zfs.c (and/or manual tuning before the test) to set this property on all datasets that Lustre creates. |
| Comment by Ned Bass [ 29/Nov/11 ] |
Oops, sorry, I was confusing the xattr-in-SA patch with the recent "improve metadata performance" commit. In fact we don't currently have the SA based xattrs. I'll roll a new package with that patch when I get a chance and do some more testing. |
| Comment by Prakash Surya (Inactive) [ 07/Feb/12 ] |
|
Alex, We're seeing creates take up to 10 ms on our Lustre-ZFS filesystem. I've attached the Lustre log of one such case. It appears we're spending alot of time (~1ms) in functions like osd_index_lookup, osd_declare_index_insert, and osd_declare_object_create which then add up to several millisecond creates. Do you have any ideas about how we might speed this up? |
| Comment by Alex Zhuravlev [ 09/Feb/12 ] |
|
all those functions do is just to call corresponding DMU functions to declare updates. |
| Comment by Prakash Surya (Inactive) [ 10/Feb/12 ] |
|
I wanted to push these graphs up to you guys. We are seeing the number of creates we can sustain degrade over time. One graph plots four runs of createmany, plotting the create rate vs. time as reported by the client. The other two graphs are from another createmany test but plot the RPC service times for each create as reported in the Lustre logs. A sample of 'lctl dk' was taken at 5min intervals on the MDS, and these logs were mined to get the RPC service times. Each "dot" represents a serviced RPC with an opcode of 101. As you can see, the rates reported by the client, and the RPC service times reported by the MDS degrade. RPC service time grows, thus the client sees less creates per second. We are still investigation exactly why this degradation is occurring. |
| Comment by Andreas Dilger [ 10/Feb/12 ] |
|
It would be useful to run "seekwatcher" on the ZFS disks to see if the disk IO behaviour is changing over time as well (in particular the seek count), or if the slowdown is in core due to locks or some continual increase in the size of some array/list/tree. |
| Comment by Prakash Surya (Inactive) [ 11/Feb/12 ] |
|
Well, we've run a few tests over the ZPL interface and haven't seen this type of behavior. In that case (ZPL/ZFS) I believe we saw two distinct create rates, one high(er) rate when we getting mainly cache hits, and a low(er) rate when we were getting cache misses and became IOPS limited. Both rates being relatively constant. In the graphs I just posted, I believe we're getting mainly ARC hits so I don't know how useful seekwatcher would prove. For example, in the "dmu-api-patch-createmany-results.ps" graph, I was sporadically watching iostat and zpool iostat to keep an eye on the IOPS, and I only started to see sustained reads at the very end. Which I believe correspond to the dramatic drop around 1,700K files. My next course of action was to try and mine the logs to see if any one (or multiple) Lustre function's execution time was increasing as the test progressed. To try and see if slowdown was due to lock contention, poor scaling, etc. as you mentioned. Although, If data from seekwatcher would be valuable I can definitely run that. |
| Comment by Prakash Surya (Inactive) [ 14/Feb/12 ] |
|
It appears the osp_object_create function's execution time is increasing as the createmany test progresses. So as more files are created, this function takes longer to execute. Going from 10's of microseconds to complete, to 1-2 milliseconds. |
| Comment by Alex Zhuravlev [ 09/Jul/12 ] |
|
any updates here? |
| Comment by Prakash Surya (Inactive) [ 26/Jul/12 ] |
|
Alex, IIRC, we chalked the create rate degrading over time to a deficiency in the lu_site hash function implementation. This has since been fixed and landed: commit 23d5254c8d607a0ae1065853011ecfc82f0543a2
Author: Liang Zhen <liang@whamcloud.com>
Date: Tue Mar 29 16:44:41 2011 +0800
LU-143 fid hash improvements
Current hash function of fid is not good enough for lu_site and
ldlm_namespace.
We have to use two totally different hash functions to hash fid into
hash table with millions of entries.
As far as the current state for our MDS create rate performance, we think there are two distinct states which the MDS gets in. The state where the ARC has not hit it's metadata limit, and the state when it has hit the metadata limit. Create rates are much better prior to hitting the limit. Once the limit is hit, we believe our biggest bottle neck lies withing ZFS. Once this happens, the ARC will be forced to pitch data from it's cache. In the process it calls the arc_evict function, which is pretty heavy handed with respect to the locking it does. It takes the state mutex and the evicted state mutex and holds those locks for the majority of the function: 1671 static void *
1672 arc_evict(arc_state_t *state, uint64_t spa, int64_t bytes, boolean_t recycle,
1673 arc_buf_contents_t type)
1674 {
...
1687 mutex_enter(&state->arcs_mtx);
1688 mutex_enter(&evicted_state->arcs_mtx);
...
1772 mutex_exit(&evicted_state->arcs_mtx);
1773 mutex_exit(&state->arcs_mtx);
...
1807 }
The profiling we've done does not show the MDS being CPU bound. Specifically, top shows the arc_adapt and txg_sync threads as the top CPU offenders (but 90% of cpu still idle), and oprofile shows the arc_evict as very high in the list of CPU offenders but only at about 5% of the cpu: samples % image name app name symbol name 469692 5.7936 zfs.ko zfs.ko lzjb_decompress 409723 5.0538 zfs.ko zfs.ko arc_evict 394668 4.8681 vmlinux vmlinux intel_idle 393170 4.8497 vmlinux vmlinux mutex_spin_on_owner 375095 4.6267 spl.ko spl.ko taskq_thread 213346 2.6316 vmlinux vmlinux mutex_lock 190386 2.3484 zfs.ko zfs.ko lzjb_compress 120682 1.4886 spl.ko spl.ko spl_kmem_cache_alloc 117026 1.4435 vmlinux vmlinux poll_idle 116807 1.4408 vmlinux vmlinux mutex_unlock 110453 1.3624 spl.ko spl.ko __taskq_dispatch_ent 100957 1.2453 vmlinux vmlinux find_busiest_group 99320 1.2251 vmlinux vmlinux kfree 96255 1.1873 obdclass.ko obdclass.ko lu_site_purge 81324 1.0031 libcfs.ko libcfs.ko cfs_hash_spin_lock 78966 0.9740 vmlinux vmlinux schedule Brian has also mention that he has seen similar reports of the metadata performance degrading once the ARC's metadata limit is reached on workloads entirely through ZFS/ZPL alone. So I think the work that needs to be done next in order to increase our sustained and/or burst create rate performace will be in the ZFS code. As a follow up comment, I'll post the data from the createmany test I ran earlier today. Along with some graphs of the performance numbers. |
| Comment by Prakash Surya (Inactive) [ 26/Jul/12 ] |
|
I ran the "createmany" test on 10 client nodes all creating files in separate directories. The exact test run was essentially: pdsh -w grove[390-399] '/tmp/createmany -o -g -i 1000 /p/lstest/surya1/ori434/run1/`hostname`/%d 1000000'
|
| Comment by Alex Zhuravlev [ 27/Jul/12 ] |
|
a lot of interesting information! thanks Prakash. are you saying this should not be our focus at the moment? |
| Comment by Prakash Surya (Inactive) [ 30/Jul/12 ] |
|
Well, we definitely will need to do something about the file create performance. But, Lustre might be off the hook for the time being. From the information we have currently, we'll need to make some modifications on the ZFS backend in order to increase the create rate. The issue arises when the ZFS ARC is "full" (or hit the metadata limit in this case) and thus needs to start evicting buffers. Since the evict function takes giant locks on the state list mutexes, all other traffic needing to use those state mutexes obviously will block. Lustre doesn't have much say in this matter, so the "fix" will need to come from within ZFS. One possible solution would be to proactively pitch metadata buffers from the cache when the system is idle. This would give enough free ARC space to allow quick bursts of creates. For example, an application would be able to create ~1M files at the ~6K/s rate. Then in the background, the ARC would be pruned so it doesn't hit the metadata limit (and drop to the slower rate) when the next ~1M burst happens. Of course this doesn't fix the sustained rate. The fix for that would be much more invasive and would require finer grained locking over the lists. |
| Comment by Brian Behlendorf [ 30/Jul/12 ] |
|
Right, since we have a pretty good idea what's going on here I'd rather see us fix some of the remaining stability issues we've observed through real usage. I'll troll our logs from the last two weeks and email a list which should help you guys prioritize. |
| Comment by Alex Zhuravlev [ 11/Dec/12 ] |
|
Brian, Ned, do you think we can close the ticket? another option is to make it LU-XXXX |
| Comment by Christopher Morrone [ 11/Dec/12 ] |
|
There is still work needed in this area. Moving to an LU ticket would be fine. |
| Comment by Alex Zhuravlev [ 11/Dec/12 ] |
|
Christopher, could you describe current status please? any numbers to share? |
| Comment by Christopher Morrone [ 11/Dec/12 ] |
|
Roughly, I see about 6500 creates per second from ior when all 768 OST are available on Sequoia. Rates are lower when not all 768 are available. |
| Comment by Andreas Dilger [ 11/Jan/13 ] |
|
This bug is for OST object creation performance, |
| Comment by Prakash Surya (Inactive) [ 25/Jul/13 ] |
|
I just posted this to the lustre-devel mailing list, and I'll copy/paste it here since I think it's relevant:
|
| Comment by Alex Zhuravlev [ 01/Aug/13 ] |
|
this is not directly related to the problem being discussed, but probably some improvement can be achieved with: |
| Comment by Andreas Dilger [ 01/Oct/13 ] |
|
Alex's patch shows good early test results, but isn't yet ready to be landed for the 2.5.0 release. Move this to 2.5.1 so that we will track the patch for landing there if it is ready. |
| Comment by Alex Zhuravlev [ 31/Jan/14 ] |
|
News from Brian B.: |
| Comment by Scott Nolin [ 19/Feb/14 ] |
|
I'd like to add that at SSEC we're affected by this with moderate severity, have been following here and on the ZFS side at https://github.com/zfsonlinux/zfs/pull/2110 - We plan on updating ZFS when this is ready for release. Our system is in production so we'd prefer to wait for the release, but if this is likely to be a matter of months instead of weeks, we may be willing to test the patch. We are seeing this problem in particular because we are populating a new file system with existing data, lots of writes, so every few days it becomes an issue. Scott |
| Comment by Prakash Surya (Inactive) [ 19/Feb/14 ] |
|
Scott, I'm glad you stumbled upon that pull request. My hope is it will get merged very soon (i.e. this week), and will get some more testing from the ZFS community. Here at LLNL we're still running a 0.6.2 ZFS release in production, although I'm very anxious to get 0.6.3 tagged and put in production because I think it will alleviate a lot of suffering we see on our ZFS servers. With that said, I'd love to leverage some more testing if you're willing. With a "small" test cluster (1 MDS, 8 OSS, 14 clients) running https://github.com/zfsonlinux/zfs/pull/2110, I've managed to create over 800 million files in ~8 days at an average rate of about 1.2K creates per second. Without that pull request, the creates per second drop by a couple orders of magnitude within an hour or so. The clients are running a simple fdtree workload (each within a unique directory): "fdtree -d 10 -l 5 -s 1 -f 10000 -C" If it's useful, I could try and explain why these ZFS patches help Lustre's particular create workload, but it would take a non-trivial amount of effort to write that up and I'm not sure anybody is really interested. And honestly, although both will benefit, that set of patches will probably help the MDS more so than the OSS. There's definitely more work that needs to be done in this area, though, so I don't want people to think that landing https://github.com/zfsonlinux/zfs/pull/2110 is the end of the road. Specifically, once the OST's OI's become large enough, I'm constantly seeing ARC misses on the OSS nodes. Those are likely due to fzap operations hitting the disk because the OIs are too large to fit in RAM, but I haven't done enough investigation to know for sure. |
| Comment by Alex Zhuravlev [ 20/Feb/14 ] |
|
Prakash, those ARC misses (on the OSS nodes) .. can you tell whether that happen in the context of object creation or normal read/write/punch ? |
| Comment by Prakash Surya (Inactive) [ 20/Feb/14 ] |
|
Alex, I probably could. The test is still running, so I could poke around tomorrow when I get in to the office. I'm also running with the latest ZFS code, so there's a really useful proc file I can use to determine exactly which object(s) it is missing on. Why do you want to know? Understanding the angle you're coming from will help me ensure I gather all the info you need the first time around. With that said, my guess is it's missing on the precreates. I started the test with an empty filesystem, and at first I saw around 5K creates per second and almost no misses on the MDS or the OSS nodes. After a certain amount of time I began to see a constant ~1% miss rate on the MDS, which actually had a significant affect on the create rate; it brought it down by about 50% to around 2.5K creates per second. After some more time had elapsed, the OSS nodes started to get misses and I began to see the "slow precreate" messages on the OSS node's console. Also, the MDS reads as a whole have become bursty, I see no ARC reads at all on the MDS for seconds at a time. My guess for this is due to the OSS precreates not being able to keep up with the MDS's requests; so while the OSS nodes chug along trying to precreate objects, the MDS sits idle. I think what's happening now at the tail end of the test, the OSS node's OIs are too large to completely sit in RAM. So when a create happens, it must do a FZAP operation with will hash to a "random" offset into one of the 32 OIs. Since only a portion of the OI can stay resident in RAM, some of these operations will read in a leaf from disk, slowing down the precreate rate and the overall create rate of the test. Once the precreate brings in the leaf, any follow up operations dealing with the same leaf will be a cache hit, until that leaf rolls off the end of the MRU or MFU lists in the ARC. At least that's my theory based on how I understand all this machinery to work. Unfortunately I haven't gathered any evidence to back this claim up, yet. |
| Comment by Scott Nolin [ 20/Feb/14 ] |
|
Prakesh, We can't really do any aggressive testing as this system is in use. However, right now it only has 1 primary user who is populating the file system. Especially once the creates grind down to a trickle, we can have the work halt and do some minor things and restart - naturally they're OK with that as the throughput makes it worthwhile. If there's anything simple we could do or report we will, but probably not patch right now as we're working around the issue. Regarding, why the patches help lustre's workload – while it's interesting, don't take the effort to write that up on my account. Just for reference or if it helps in any way, here's what we see and are doing - We have a system with 1 MDS, 12 OSS. Data is being written to this from existing NFS mounted filesytems (NAS basically). There are I think 6 NAS machines on 1GB or channeled dual 1GB ethernet. These flow through 3 lrouter machines with 10GB ethernet to Infiniband and the servers. We can also test on infiniband connected cluster nodes if needed, but when we see the create problem it of course doesn't matter. What we saw: Once our user had everything set and was writing at a reasonable rate, after a week or two file creation went down by a few orders of magnitude, about 2 per second - really bad as noted in this ticket. On the MDS we noticed txg_sync and arc_adapt particularly busy, but the system cpu mostly idle as also noted above. To try to at least put the problem off we increased our arc cache settings and restarted, performance was back to normal. For reference, with mdtest in initial testing we saw around 2000 creates/second. We can't run mdtest right now so we've just been doing simple tests from one node (fdtree is nice by the way), and comparing the same on another zfs backed file system (though not nearly so busy), and a 2.3 ldiskfs based system. This normally gets perhaps 200 to 300 creates per second - all of these file systems are in use, so we can't be too accurate. After a while performance degraded again, so we stop and restart lustre/zfs on the mds. I haven't been looking at the OSS's, but just checked and while these do look to have full arc caches, I only see one report in the logs for all servers of slow creates: Lustre: arcdata-OST001c: Slow creates, 384/512 objects created at a rate of 7/s So my guess is if these patches help MDS issues more so than OSS, that's at least addressing the more immediate issue for us. If we had a way to simply completely reset the arc without restarting the lustre and zfs services, that would be useful. Scott |
| Comment by Prakash Surya (Inactive) [ 20/Feb/14 ] |
|
Since the performance you see appears to degrade, and restarting ZFS appears to help things, my guess is you're hitting the same or a very similar we see on our production ZFS filesystems. To double check this, can you run this command on the MDS and post the output (when the system is in the "slow" state): grep -E "^c|_size|arc_meta_used|arc_meta_limit" /proc/spl/kstat/zfs/arcstats My guess is the size of the MRU and MFU lists will be very small, while the size of the MRUG and MFUG will be large; but I want to check and make sure.
The very, very ugly hack we have in place at the moment is to drop caches on the MDS once a certain threshold is reached. You can try to run something like this, to see if things go back to the rate you were seeing after a fresh reboot: for i in `seq 1 10000`; do
echo 3 > /proc/sys/vm/drop_caches
done
|
| Comment by Prakash Surya (Inactive) [ 20/Feb/14 ] |
|
Alex, Does this get you what you want to see? # zeno1 /root > cat /proc/spl/kstat/zfs/zeno1/reads 8 0 0x01 30 3360 242325718628 766164277255963 UID start objset object level blkid aflags origin pid process 15928 766164156579412 0x2f 178 0 79114 0x64 dbuf_read_impl 3803 ll_ost03_000 15929 766164159823356 0x2f 179 0 84980 0x64 dbuf_read_impl 3803 ll_ost03_000 15930 766164162520769 0x2f 150 0 58855 0x64 dbuf_read_impl 3803 ll_ost03_000 15931 766164163555034 0x2f 169 0 48748 0x64 dbuf_read_impl 6828 ldlm_cn03_007 15932 766164174364152 0x2f 151 0 83592 0x64 dbuf_read_impl 3803 ll_ost03_000 15933 766164177093706 0x2f 148 0 416 0x64 dbuf_read_impl 6828 ldlm_cn03_007 15934 766164184477506 0x2f 152 0 6507 0x64 dbuf_read_impl 3803 ll_ost03_000 15935 766164187822369 0x2f 153 0 24148 0x64 dbuf_read_impl 3803 ll_ost03_000 15936 766164188694642 0x2f 167 0 13795 0x64 dbuf_read_impl 6828 ldlm_cn03_007 15937 766164197669859 0x2f 154 0 36067 0x64 dbuf_read_impl 3803 ll_ost03_000 15938 766164199348418 0x2f 162 0 84625 0x64 dbuf_read_impl 6828 ldlm_cn03_007 15939 766164210502738 0x2f 156 0 63494 0x64 dbuf_read_impl 3803 ll_ost03_000 15940 766164210929591 0x2f 163 0 57995 0x64 dbuf_read_impl 6828 ldlm_cn03_007 15941 766164222820463 0x2f 154 0 10070 0x64 dbuf_read_impl 6828 ldlm_cn03_007 15942 766164222864079 0x2f 157 0 53684 0x64 dbuf_read_impl 3803 ll_ost03_000 15943 766164225929360 0x2f 160 0 34939 0x64 dbuf_read_impl 3803 ll_ost03_000 15944 766164228828963 0x2f 161 0 5729 0x64 dbuf_read_impl 3803 ll_ost03_000 15945 766164231472223 0x2f 162 0 11351 0x64 dbuf_read_impl 3803 ll_ost03_000 15946 766164234109879 0x2f 170 0 59220 0x64 dbuf_read_impl 6828 ldlm_cn03_007 15947 766164237596524 0x2f 163 0 94437 0x64 dbuf_read_impl 3803 ll_ost03_000 15948 766164240488786 0x2f 164 0 57877 0x64 dbuf_read_impl 3803 ll_ost03_000 15949 766164240829980 0x2f 165 0 87378 0x64 dbuf_read_impl 3803 ll_ost03_000 15950 766164243497409 0x2f 166 0 42756 0x64 dbuf_read_impl 3803 ll_ost03_000 15951 766164246480267 0x2f 169 0 48773 0x64 dbuf_read_impl 6828 ldlm_cn03_007 15952 766164249795764 0x2f 167 0 36727 0x64 dbuf_read_impl 3803 ll_ost03_000 15953 766164253590137 0x2f 163 0 66820 0x64 dbuf_read_impl 6828 ldlm_cn03_007 15954 766164261417886 0x2f 168 0 53317 0x64 dbuf_read_impl 3803 ll_ost03_000 15955 766164265303986 0x2f 160 0 8178 0x64 dbuf_read_impl 6828 ldlm_cn03_007 15956 766164272799763 0x2f 171 0 1701 0x64 dbuf_read_impl 6828 ldlm_cn03_007 15957 766164273028298 0x2f 169 0 36867 0x64 dbuf_read_impl 3803 ll_ost03_000 |
| Comment by Scott Nolin [ 20/Feb/14 ] |
|
Prakesh, Our user asked us to reset things, so captured the arcstats info and beat down the cache size with that hack. However, the system wasn't really in an extremely degraded state. The arc was full but I think performance wasn't that bad - lower than his peak, but maybe 50%. Since he's the only user I went ahead and did it anyhow. So I don't know how useful these numbers are for you: c 4 59624418824 c_min 4 33000000000 c_max 4 96000000000 hdr_size 4 10851742848 data_size 4 25850144256 other_size 4 22922398496 anon_size 4 409600 mru_size 4 7485866496 mru_ghost_size 4 33932813824 mfu_size 4 18363868160 mfu_ghost_size 4 1910375424 l2_size 4 0 l2_hdr_size 4 0 duplicate_buffers_size 4 0 arc_meta_used 4 56346426784 arc_meta_limit 4 66000000000 Reading your previous posts, I think we did see the cache warm-up effect after our restart last time. Performance wasn't great and then got much better as data flowed in over time, then got worse. Scott |
| Comment by Prakash Surya (Inactive) [ 20/Feb/14 ] |
|
Yea, those numbers look "normal" to me. I'd expect the ARC to be performing reasonably well at that point in time. I see you're manually tuning a few of the ARC parameters (i.e. arc_meta_limit, c_max, c_min, maybe others?), care to fill me in on why that is? I have a feeling your tuning of arc_meta_limit is helping you, but I'm curious what drove you to do that in the first place and what (if any) perceived benefits you're seeing. Have you been hitting any OOMs with a higher than normal arc_meta_limit? How much RAM is on the system? |
| Comment by Scott Nolin [ 21/Feb/14 ] |
|
We were just guessing if we increased the arc size it would delay the performance degradation. We haven't had any OOMs, we have 128GB of RAM. I don't know if it's really helped, it hasn't hurt. My guess is maybe pointless or a minor help. Scott |
| Comment by Scott Nolin [ 22/Feb/14 ] |
|
Here's some additional information about our current workload, in case it's pertinent. I think with every file written the mtime is changed and permissions are changed by the user's process. In aggregate according to robinhood there are an average of 1 mtime change and about 1.5 permission change operations per file. Earlier we ran into the issue with synchronous permission changes, because so many are performed - https://jira.hpdd.intel.com/browse/LU-3671 - making them asynchronous helped significantly. Scott |
| Comment by Prakash Surya (Inactive) [ 24/Feb/14 ] |
|
FYI, the ARC changes landed last Friday to the ZoL master tree. So if you get a chance to update and run with those patches, I'd be interested to know if you still have to "reset" things due to performance degradation. |
| Comment by Scott Nolin [ 07/Mar/14 ] |
|
Just an update on changing the arc parameters. It did take a while to happen, but now we have seen some negative effects. No OOM, but lustre threads taking a long time to complete, indicating overloaded system. So I assume not enough memory for the number of service threads. Scott |
| Comment by Scott Nolin [ 13/Jun/14 ] |
|
For the record, here is a graph of performance degradation. - mdtest-arcdata.png After upgrading ram (and to ssd's) we now see it after about 30 million creates. This is on lustre 2.4.0-1 with zfs 0.6.2, no additional patches We have 258 G RAM - zfs arc options were doubled from default: options zfs zfs_arc_meta_limit=10000000000 I know the new version of zfs increases this further by default. Ideally we'd be able to run this test with the new version too. Interestingly, now I see all other mdtest file operations go bad at about the same time. Previously, with fewer resources and default zfs arc options the file creates tanked very quickly, but the other operations stayed the same. In that case though we never managed to test past maybe a couple million files, as it was so slow. |
| Comment by Scott Nolin [ 17/Jun/14 ] |
|
We have upgraded to zfs-0.6.3-1 and while it looks promising that this issue is solved (or greatly improved) I haven't been able to verify this. I managed to run about 20 Million file writes, and need more than 30M to go past the point where things degraded last time. I'd like to verify it's improved, but can't do it with that system now. While testing we did unfortunately see decreased numbers for stat, see Scott |
| Comment by James A Simmons [ 14/Aug/16 ] |
|
Old blocker for unsupported version |