[LU-2476] poor OST file creation rate performance with zfs backend Created: 29/Nov/11  Updated: 14/Aug/16  Resolved: 14/Aug/16

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: None

Type: Improvement Priority: Blocker
Reporter: Ned Bass Assignee: Alex Zhuravlev
Resolution: Won't Fix Votes: 0
Labels: JL, performance, prz, zfs
Environment:

RHEL 6.2


Attachments: File aplot.eps     File bplot.eps     PNG File clients.png     JPEG File creates.jpg     File dmu-api-patch-createmany-results.ps     File llog-mds-zwicky-1-slow     PNG File mdstats.png     PNG File mdtest-arcdata.png     File ori434-run1.tar.bz2     PNG File osp_object_create.png     JPEG File zfs.jpg    
Issue Links:
Duplicate
duplicates LU-2600 lustre metadata performance is very s... Resolved
Related
Sub-Tasks:
Key
Summary
Type
Status
Assignee
LU-2477 Poor MDS create performance due to AR... Technical task Resolved Alex Zhuravlev  
Rank (Obsolete): 2828

 Description   

We observe poor file creation rate performance with a zfs backend. The attached jpeg was generated using the createmany benchmark creating 1 million files in a single directory. Files created per second were reported for every 10,000 files created. The green line shows results for Lustre 2.1 and ldiskfs backend. The red line was with ZFS on Orion.

The benchmark was run on the same hardware for both ldisks and ZFS:

MDT: 24 GB RAM, 8 200GB SSD drives in two external SAS-2 enclosures
(Linux MD-RAID10 for ldiskfs, 1 zpool with 8 mirrored pairs for zfs)
OSS: 2 OSS nodes (3 8TB* OSTs each for ldiskfs, 1 72TB OST each for ZFS)
OST: Netapp 60 drive enclosure, 6 24 TB RAID6 LUNs, 3 TB SAS drives, dual 4X QDR IB connections to each OST
Network: 20 Gb/sec 4X DDR IB

*LUNs were partitioned for ldiskfs for compatibility reasons



 Comments   
Comment by Ned Bass [ 29/Nov/11 ]

For further comparison, the second attachment shows the same test run locally on the MDS with a mounted ZFS filesystem configured identically to the MDT.

Comment by Andreas Dilger [ 29/Nov/11 ]

Ned, thanks for running these tests. As yet we haven't done much performance testing with orion, so information like this is very valuable.

Was this with or without Brian's ZFS xattr-in-SA patch? Since Lustre will write 2 xattrs for every file created, if this cannot be stored in the dnode the performance will be quite bad. Without xattr-in-SA, there will be 4 dnodes + 3 data blocks written for every file. I think if you multiply the ZFS performance by 4x it will be very similar to the ldiskfs performance.

Second question is whether the createmany was run with "-o" (open and create objects), or with "-m" (mknod without objects)? It may well be that the OST object creation performance is slowing down the MDS performance, especially in the ZFS case where there is only a single OST. It is worthwhile to test the MDS in isolation, to see where the performance problem lies. In production there would be many more OSTs, which would hide the latency from object creation on a single OST.

Finally, since the metadata stack has changed noticeably between 2.1 and orion to accomodate ZFS, it is also worthwhile to include a test from ldiskfs on orion.

Comment by Ned Bass [ 29/Nov/11 ]

Was this with or without Brian's ZFS xattr-in-SA patch?

It was with. The patch made a significant difference for a locally mounted ZFS filesystem, but little difference for Lustre.

Second question is whether the createmany was run with "-o" (open and create objects), or with "-m"

-o. I'll give -m a try.

Finally, since the metadata stack has changed noticeably between 2.1 and orion to accomodate ZFS, it is also worthwhile to include a test from ldiskfs on orion.

I did attempt to do test ldiskfs on orion, but ran into stability issues. I'll open separate bugs for those.

Comment by Andreas Dilger [ 29/Nov/11 ]

The (xattr-in-SA) patch made a significant difference for a locally mounted ZFS filesystem, but little difference for Lustre.

Is it possibly because mkfs.lustre is not setting the required pool property? According to Brian's comments:

By default with the patch applied the feature will be disabled. To enable it you will need to set the property 'xattr=sa' on your dataset.

Probably we need a small patch to mkfs_lustre_zfs.c (and/or manual tuning before the test) to set this property on all datasets that Lustre creates.

Comment by Ned Bass [ 29/Nov/11 ]

Is it possibly because mkfs.lustre is not setting the required pool property?

Oops, sorry, I was confusing the xattr-in-SA patch with the recent "improve metadata performance" commit. In fact we don't currently have the SA based xattrs. I'll roll a new package with that patch when I get a chance and do some more testing.

Comment by Prakash Surya (Inactive) [ 07/Feb/12 ]

Alex,

We're seeing creates take up to 10 ms on our Lustre-ZFS filesystem. I've attached the Lustre log of one such case. It appears we're spending alot of time (~1ms) in functions like osd_index_lookup, osd_declare_index_insert, and osd_declare_object_create which then add up to several millisecond creates. Do you have any ideas about how we might speed this up?

Comment by Alex Zhuravlev [ 09/Feb/12 ]

all those functions do is just to call corresponding DMU functions to declare updates.
I'm not familiar with expectations, but 10ms is way too much ... given all the buffers
are prepared in dmu_tx_hold_*(), it might be something within. are you running with
debug compiled in zfs? could you profile the case with all the debug turned off please?

Comment by Prakash Surya (Inactive) [ 10/Feb/12 ]

I wanted to push these graphs up to you guys.

We are seeing the number of creates we can sustain degrade over time.

One graph plots four runs of createmany, plotting the create rate vs. time as reported by the client.

The other two graphs are from another createmany test but plot the RPC service times for each create as reported in the Lustre logs. A sample of 'lctl dk' was taken at 5min intervals on the MDS, and these logs were mined to get the RPC service times. Each "dot" represents a serviced RPC with an opcode of 101.

As you can see, the rates reported by the client, and the RPC service times reported by the MDS degrade. RPC service time grows, thus the client sees less creates per second.

We are still investigation exactly why this degradation is occurring.

Comment by Andreas Dilger [ 10/Feb/12 ]

It would be useful to run "seekwatcher" on the ZFS disks to see if the disk IO behaviour is changing over time as well (in particular the seek count), or if the slowdown is in core due to locks or some continual increase in the size of some array/list/tree.

Comment by Prakash Surya (Inactive) [ 11/Feb/12 ]

Well, we've run a few tests over the ZPL interface and haven't seen this type of behavior. In that case (ZPL/ZFS) I believe we saw two distinct create rates, one high(er) rate when we getting mainly cache hits, and a low(er) rate when we were getting cache misses and became IOPS limited. Both rates being relatively constant.

In the graphs I just posted, I believe we're getting mainly ARC hits so I don't know how useful seekwatcher would prove. For example, in the "dmu-api-patch-createmany-results.ps" graph, I was sporadically watching iostat and zpool iostat to keep an eye on the IOPS, and I only started to see sustained reads at the very end. Which I believe correspond to the dramatic drop around 1,700K files.

My next course of action was to try and mine the logs to see if any one (or multiple) Lustre function's execution time was increasing as the test progressed. To try and see if slowdown was due to lock contention, poor scaling, etc. as you mentioned. Although, If data from seekwatcher would be valuable I can definitely run that.

Comment by Prakash Surya (Inactive) [ 14/Feb/12 ]

It appears the osp_object_create function's execution time is increasing as the createmany test progresses. So as more files are created, this function takes longer to execute. Going from 10's of microseconds to complete, to 1-2 milliseconds.

Comment by Alex Zhuravlev [ 09/Jul/12 ]

any updates here?

Comment by Prakash Surya (Inactive) [ 26/Jul/12 ]

Alex,

IIRC, we chalked the create rate degrading over time to a deficiency in the lu_site hash function implementation. This has since been fixed and landed:

commit 23d5254c8d607a0ae1065853011ecfc82f0543a2
Author: Liang Zhen <liang@whamcloud.com>
Date:   Tue Mar 29 16:44:41 2011 +0800

    LU-143 fid hash improvements

    Current hash function of fid is not good enough for lu_site and 
    ldlm_namespace.
    We have to use two totally different hash functions to hash fid into
    hash table with millions of entries.

As far as the current state for our MDS create rate performance, we think there are two distinct states which the MDS gets in. The state where the ARC has not hit it's metadata limit, and the state when it has hit the metadata limit. Create rates are much better prior to hitting the limit.

Once the limit is hit, we believe our biggest bottle neck lies withing ZFS. Once this happens, the ARC will be forced to pitch data from it's cache. In the process it calls the arc_evict function, which is pretty heavy handed with respect to the locking it does. It takes the state mutex and the evicted state mutex and holds those locks for the majority of the function:

1671 static void *
1672 arc_evict(arc_state_t *state, uint64_t spa, int64_t bytes, boolean_t recycle,
1673     arc_buf_contents_t type)
1674 {

...

1687         mutex_enter(&state->arcs_mtx);
1688         mutex_enter(&evicted_state->arcs_mtx);

...

1772         mutex_exit(&evicted_state->arcs_mtx);
1773         mutex_exit(&state->arcs_mtx);

...

1807 }

The profiling we've done does not show the MDS being CPU bound. Specifically, top shows the arc_adapt and txg_sync threads as the top CPU offenders (but 90% of cpu still idle), and oprofile shows the arc_evict as very high in the list of CPU offenders but only at about 5% of the cpu:

samples  %        image name               app name                 symbol name
469692    5.7936  zfs.ko                   zfs.ko                   lzjb_decompress
409723    5.0538  zfs.ko                   zfs.ko                   arc_evict
394668    4.8681  vmlinux                  vmlinux                  intel_idle
393170    4.8497  vmlinux                  vmlinux                  mutex_spin_on_owner
375095    4.6267  spl.ko                   spl.ko                   taskq_thread
213346    2.6316  vmlinux                  vmlinux                  mutex_lock
190386    2.3484  zfs.ko                   zfs.ko                   lzjb_compress
120682    1.4886  spl.ko                   spl.ko                   spl_kmem_cache_alloc
117026    1.4435  vmlinux                  vmlinux                  poll_idle
116807    1.4408  vmlinux                  vmlinux                  mutex_unlock
110453    1.3624  spl.ko                   spl.ko                   __taskq_dispatch_ent
100957    1.2453  vmlinux                  vmlinux                  find_busiest_group
99320     1.2251  vmlinux                  vmlinux                  kfree
96255     1.1873  obdclass.ko              obdclass.ko              lu_site_purge
81324     1.0031  libcfs.ko                libcfs.ko                cfs_hash_spin_lock
78966     0.9740  vmlinux                  vmlinux                  schedule

Brian has also mention that he has seen similar reports of the metadata performance degrading once the ARC's metadata limit is reached on workloads entirely through ZFS/ZPL alone.

So I think the work that needs to be done next in order to increase our sustained and/or burst create rate performace will be in the ZFS code.

As a follow up comment, I'll post the data from the createmany test I ran earlier today. Along with some graphs of the performance numbers.

Comment by Prakash Surya (Inactive) [ 26/Jul/12 ]

I ran the "createmany" test on 10 client nodes all creating files in separate directories. The exact test run was essentially:

pdsh -w grove[390-399] '/tmp/createmany -o -g -i 1000 /p/lstest/surya1/ori434/run1/`hostname`/%d 1000000'
  • clients.png - A graph of the create rate as reported by each client "createmany" test vs. the total number of files that individual client has created
  • mdstats.png - A graph of the aggregate number of creates per second as reported by the MDS vs. Time. This value was scraped from the "md_stats" /proc entry on the MDS.
  • ori434-run1.tar.bz2 - A tarball containing all the results and graphs.
Comment by Alex Zhuravlev [ 27/Jul/12 ]

a lot of interesting information! thanks Prakash. are you saying this should not be our focus at the moment?

Comment by Prakash Surya (Inactive) [ 30/Jul/12 ]

Well, we definitely will need to do something about the file create performance. But, Lustre might be off the hook for the time being. From the information we have currently, we'll need to make some modifications on the ZFS backend in order to increase the create rate.

The issue arises when the ZFS ARC is "full" (or hit the metadata limit in this case) and thus needs to start evicting buffers. Since the evict function takes giant locks on the state list mutexes, all other traffic needing to use those state mutexes obviously will block. Lustre doesn't have much say in this matter, so the "fix" will need to come from within ZFS.

One possible solution would be to proactively pitch metadata buffers from the cache when the system is idle. This would give enough free ARC space to allow quick bursts of creates. For example, an application would be able to create ~1M files at the ~6K/s rate. Then in the background, the ARC would be pruned so it doesn't hit the metadata limit (and drop to the slower rate) when the next ~1M burst happens.

Of course this doesn't fix the sustained rate. The fix for that would be much more invasive and would require finer grained locking over the lists.

Comment by Brian Behlendorf [ 30/Jul/12 ]

Right, since we have a pretty good idea what's going on here I'd rather see us fix some of the remaining stability issues we've observed through real usage. I'll troll our logs from the last two weeks and email a list which should help you guys prioritize.

Comment by Alex Zhuravlev [ 11/Dec/12 ]

Brian, Ned, do you think we can close the ticket? another option is to make it LU-XXXX

Comment by Christopher Morrone [ 11/Dec/12 ]

There is still work needed in this area. Moving to an LU ticket would be fine.

Comment by Alex Zhuravlev [ 11/Dec/12 ]

Christopher, could you describe current status please? any numbers to share?

Comment by Christopher Morrone [ 11/Dec/12 ]

Roughly, I see about 6500 creates per second from ior when all 768 OST are available on Sequoia. Rates are lower when not all 768 are available.

Comment by Andreas Dilger [ 11/Jan/13 ]

This bug is for OST object creation performance, LU-2600 is for MDT object create performance. Will likely be fixed together, but may be different in some way.

Comment by Prakash Surya (Inactive) [ 25/Jul/13 ]

I just posted this to the lustre-devel mailing list, and I'll copy/paste it here since I think it's relevant:

I've been looking into improving create rates on one of our ZFS backed
Lustre 2.4 file systems. Currently, when using mds-survey, I can achieve
around a maximum of 10,000 creates per second when using a specially
crafted workload (i.e. using a "small enough" working set and preheating
the cache). If the cache is not hot or the working set is "too large",
the create rates plummet down to around 100 creates per second (that's a
100x difference!).

The past week or so I've spent some time trying to dig into exactly what
is going on under the hood, so I can then try to optimize things better
and maintain the 10,000 create per second number.

We often see this message in logs:

2013-07-22 11:31:26 Lustre: lcz-OST0000: Slow creates, 640/768 objects created at a rate of 12/s

so I started by trying to figure out where the time was being spent in
the precreate call path. On one of the OSS nodes I used ftrace to
profile ofd_precreate_objects() and saw that it took a whopping 22
seconds to complete for the specific call I was looking at! Digging a
little deeper I saw many calls to dt_object_find() which would
degenerate into fzap_lookup()'s and those into dbuf_read()
calls. Some of these dbuf_read() calls took over 600ms to complete
in the trace I was looking at. If we do a dt_object_find() for each
object we're precreating, and each of those causes a dbuf_read()
from disk which could take over half a second to complete, that's a
recipe for terrible create rates.

Thus, the two solutions seem to be to either not do the
fzap_lookup() altogether, or make the fzap_lookup() faster. One
way to make it faster is to ensure the call is a ARC hit, rather than it
reading from disk.

If I run mds-survey in a loop, with it creating and destroying 10,000
files, each iterations will tend to speed up when compared to the
previous one. For example, iteration 1 will get 100 creates per second,
iteration 2 will get 200 cr/s, iter 3 at 400 cr/s, etc. This trend will
continue up until the iterations hit about 10,000 creates per second,
and will plateau there for awhile, but then drop back down to 100 cr/s
again.

The "warm up" period where the rates are increasing during each
iterations I think can be attributed to the cache being cold at the
start of the test. I generally would start after rebooting the OSS
nodes, so they first have to fetch all of the ZAP leaf blocks from disk.
Once they get cached in the ARC, the dbuf_read() calls slowly start
turning into cache hits, speeding up the overall create rates.

Now, once the cache is warm, the max create rate is maintained up until
the ARC fills up and begins pitching things from the cache. When this
happens, the create rate goes to shit. I believe this is because we then
start going back to disk to fetch the leaf blocks we just pitched.

So why does the ARC fill up in the first place? Looking at the OI ZAP
objects using zdb, they are all only a max size of 32M. Even with 32 of
them, that's only 1G of data which can certainly fit in the ARC (even
with a meta_limit of 3G). What I think is happening, is as the creates
progress, we not only cache the ZAP buffers but also things like dnodes
for the newly create files. The ZAP buffers plus the dnodes (and other
data) does not all fit, causing the ARC to evict our "important" ZAP
buffers. If we didn't evict the ZAP buffers I think we could maintain
the 10,000 cr/s rate.

So how do we ensure our "important" data stays in cache? That's an open
question at the moment. Ideally the ARC should be doing this for us. As
we access the same leaf blocks multiple times, it should move that data
to the ARC's MFU and keep it around longer than the dnodes and other
data, but that doesn't seem to be happening.

Another place for investigation is into how the OI ZAP's are getting
packed. I think we can do a much better job at ensuring each ZAP leaf
blocked is more full before expanding the ZAP object hash. If you look
at the stats for one of the OI objects on our OSS:

    Object  lvl   iblk   dblk  dsize  lsize   %full  type
       150    4     4K     4K  17.1M  32.2M   99.62  ZFS directory
                                        144   bonus  System attributes
        dnode flags: USED_BYTES USERUSED_ACCOUNTED!
        dnode maxblkid: 8238
        path    /O/0/d2
        uid     0
        gid     0
        atime   Wed Dec 31 16:00:00 1969
        mtime   Wed Dec 31 16:00:00 1969
        ctime   Wed Dec 31 16:00:00 1969
        crtime  Thu Apr 25 15:01:44 2013
        gen     80
        mode    40755
        size    2
        parent  147
        links   1
        pflags  0
        rdev    0x0000000000000000
        Fat ZAP stats:
                Pointer table:
                        16384 elements
                        zt_blk: 5322
                        zt_numblks: 32
                        zt_shift: 14
                        zt_blks_copied: 0
                        zt_nextblk: 0
                ZAP entries: 31349
                Leaf blocks: 8175
                Total blocks: 8239
                zap_block_type: 0x8000000000000001
                zap_magic: 0x2f52ab2ab
                zap_salt: 0x53463a46d
                Leafs with 2^n pointers:
                        1:   8158 ****************************************
                        2:     17 *
                Blocks with n*5 entries:
                        0:   5534 ****************************************
                        1:   2575 *******************
                        2:     66 *
                Blocks n/10 full:
                        1:   3799 ************************************
                        2:   4310 ****************************************
                        3:     66 *
                Entries with n chunks:
                        3:  31349 ****************************************
                Buckets with n entries:
                        0: 1015374 ****************************************
                        1:  30704 **
                        2:    321 *
                        3:      1 *

most ZAP leaf blocks have 2 pointers to it and some even have 4!
Ideally, each ZAP leaf would only have a single pointer to it. Also,
none of the leaf blocks are over 40% full, with most being under 30%
full. I think if we did a better job of hashing our creates better into
the ZAP leaf blocks (i.e. more entries per leaf) this would increase our
chance of keeping the "important" leaf blocks in cache for a longer
time.

There are still areas I want to investigate (l2arc, MRU vs MFU usage,
ZAP hash uniformity with our inputs), but in the mean time I wanted to
get some of this information out to try and garner some comments.

Comment by Alex Zhuravlev [ 01/Aug/13 ]

this is not directly related to the problem being discussed, but probably some improvement can be achieved with:
http://review.whamcloud.com/#/c/7157/

Comment by Andreas Dilger [ 01/Oct/13 ]

Alex's patch shows good early test results, but isn't yet ready to be landed for the 2.5.0 release. Move this to 2.5.1 so that we will track the patch for landing there if it is ready.

Comment by Alex Zhuravlev [ 31/Jan/14 ]

News from Brian B.:
> It turns out that Lustre's IO pattern on the MDS is very different than what you would normally see through a Posix filesystem. It has exposed some flaws in the way the ZFS ARC manages its cache. The result is a significant decrease in the expected cache hit rate. You can see this effect pretty clearly if you run 'arcstats.py'.
>
> We've been working to resolve this issue for the last few months, and there are proposed patches in the following pull request which should help. They changes themselves are small but the ARC code is subtle so we're in the process of ensuring we didn't hurt other unrelated workloads. Once we're sure the patches are working as expected they'll be merged. This is one of the major things preventing me from making another ZFS tag. This needs to be addressed.
>
> https://github.com/zfsonlinux/zfs/pull/1967

Comment by Scott Nolin [ 19/Feb/14 ]

I'd like to add that at SSEC we're affected by this with moderate severity, have been following here and on the ZFS side at https://github.com/zfsonlinux/zfs/pull/2110 - We plan on updating ZFS when this is ready for release.

Our system is in production so we'd prefer to wait for the release, but if this is likely to be a matter of months instead of weeks, we may be willing to test the patch.

We are seeing this problem in particular because we are populating a new file system with existing data, lots of writes, so every few days it becomes an issue.

Scott

Comment by Prakash Surya (Inactive) [ 19/Feb/14 ]

Scott, I'm glad you stumbled upon that pull request. My hope is it will get merged very soon (i.e. this week), and will get some more testing from the ZFS community.

Here at LLNL we're still running a 0.6.2 ZFS release in production, although I'm very anxious to get 0.6.3 tagged and put in production because I think it will alleviate a lot of suffering we see on our ZFS servers.

With that said, I'd love to leverage some more testing if you're willing.

With a "small" test cluster (1 MDS, 8 OSS, 14 clients) running https://github.com/zfsonlinux/zfs/pull/2110, I've managed to create over 800 million files in ~8 days at an average rate of about 1.2K creates per second. Without that pull request, the creates per second drop by a couple orders of magnitude within an hour or so. The clients are running a simple fdtree workload (each within a unique directory): "fdtree -d 10 -l 5 -s 1 -f 10000 -C"

If it's useful, I could try and explain why these ZFS patches help Lustre's particular create workload, but it would take a non-trivial amount of effort to write that up and I'm not sure anybody is really interested. And honestly, although both will benefit, that set of patches will probably help the MDS more so than the OSS.

There's definitely more work that needs to be done in this area, though, so I don't want people to think that landing https://github.com/zfsonlinux/zfs/pull/2110 is the end of the road. Specifically, once the OST's OI's become large enough, I'm constantly seeing ARC misses on the OSS nodes. Those are likely due to fzap operations hitting the disk because the OIs are too large to fit in RAM, but I haven't done enough investigation to know for sure.

Comment by Alex Zhuravlev [ 20/Feb/14 ]

Prakash, those ARC misses (on the OSS nodes) .. can you tell whether that happen in the context of object creation or normal read/write/punch ?

Comment by Prakash Surya (Inactive) [ 20/Feb/14 ]

Alex, I probably could. The test is still running, so I could poke around tomorrow when I get in to the office. I'm also running with the latest ZFS code, so there's a really useful proc file I can use to determine exactly which object(s) it is missing on. Why do you want to know? Understanding the angle you're coming from will help me ensure I gather all the info you need the first time around.

With that said, my guess is it's missing on the precreates. I started the test with an empty filesystem, and at first I saw around 5K creates per second and almost no misses on the MDS or the OSS nodes. After a certain amount of time I began to see a constant ~1% miss rate on the MDS, which actually had a significant affect on the create rate; it brought it down by about 50% to around 2.5K creates per second. After some more time had elapsed, the OSS nodes started to get misses and I began to see the "slow precreate" messages on the OSS node's console. Also, the MDS reads as a whole have become bursty, I see no ARC reads at all on the MDS for seconds at a time. My guess for this is due to the OSS precreates not being able to keep up with the MDS's requests; so while the OSS nodes chug along trying to precreate objects, the MDS sits idle.

I think what's happening now at the tail end of the test, the OSS node's OIs are too large to completely sit in RAM. So when a create happens, it must do a FZAP operation with will hash to a "random" offset into one of the 32 OIs. Since only a portion of the OI can stay resident in RAM, some of these operations will read in a leaf from disk, slowing down the precreate rate and the overall create rate of the test. Once the precreate brings in the leaf, any follow up operations dealing with the same leaf will be a cache hit, until that leaf rolls off the end of the MRU or MFU lists in the ARC.

At least that's my theory based on how I understand all this machinery to work. Unfortunately I haven't gathered any evidence to back this claim up, yet.

Comment by Scott Nolin [ 20/Feb/14 ]

Prakesh,

We can't really do any aggressive testing as this system is in use. However, right now it only has 1 primary user who is populating the file system. Especially once the creates grind down to a trickle, we can have the work halt and do some minor things and restart - naturally they're OK with that as the throughput makes it worthwhile. If there's anything simple we could do or report we will, but probably not patch right now as we're working around the issue.

Regarding, why the patches help lustre's workload – while it's interesting, don't take the effort to write that up on my account.

Just for reference or if it helps in any way, here's what we see and are doing -

We have a system with 1 MDS, 12 OSS. Data is being written to this from existing NFS mounted filesytems (NAS basically). There are I think 6 NAS machines on 1GB or channeled dual 1GB ethernet. These flow through 3 lrouter machines with 10GB ethernet to Infiniband and the servers. We can also test on infiniband connected cluster nodes if needed, but when we see the create problem it of course doesn't matter.

What we saw:

Once our user had everything set and was writing at a reasonable rate, after a week or two file creation went down by a few orders of magnitude, about 2 per second - really bad as noted in this ticket.

On the MDS we noticed txg_sync and arc_adapt particularly busy, but the system cpu mostly idle as also noted above.

To try to at least put the problem off we increased our arc cache settings and restarted, performance was back to normal.

For reference, with mdtest in initial testing we saw around 2000 creates/second. We can't run mdtest right now so we've just been doing simple tests from one node (fdtree is nice by the way), and comparing the same on another zfs backed file system (though not nearly so busy), and a 2.3 ldiskfs based system. This normally gets perhaps 200 to 300 creates per second - all of these file systems are in use, so we can't be too accurate.

After a while performance degraded again, so we stop and restart lustre/zfs on the mds.

I haven't been looking at the OSS's, but just checked and while these do look to have full arc caches, I only see one report in the logs for all servers of slow creates:

Lustre: arcdata-OST001c: Slow creates, 384/512 objects created at a rate of 7/s

So my guess is if these patches help MDS issues more so than OSS, that's at least addressing the more immediate issue for us.

If we had a way to simply completely reset the arc without restarting the lustre and zfs services, that would be useful.

Scott

Comment by Prakash Surya (Inactive) [ 20/Feb/14 ]

Since the performance you see appears to degrade, and restarting ZFS appears to help things, my guess is you're hitting the same or a very similar we see on our production ZFS filesystems. To double check this, can you run this command on the MDS and post the output (when the system is in the "slow" state):

grep -E "^c|_size|arc_meta_used|arc_meta_limit" /proc/spl/kstat/zfs/arcstats

My guess is the size of the MRU and MFU lists will be very small, while the size of the MRUG and MFUG will be large; but I want to check and make sure.

If we had a way to simply completely reset the arc without restarting the lustre and zfs services, that would be useful.

The very, very ugly hack we have in place at the moment is to drop caches on the MDS once a certain threshold is reached. You can try to run something like this, to see if things go back to the rate you were seeing after a fresh reboot:

for i in `seq 1 10000`; do
    echo 3 > /proc/sys/vm/drop_caches
done
Comment by Prakash Surya (Inactive) [ 20/Feb/14 ]

Alex, Does this get you what you want to see?

# zeno1 /root > cat /proc/spl/kstat/zfs/zeno1/reads
8 0 0x01 30 3360 242325718628 766164277255963
UID      start            objset   object   level    blkid    aflags   origin                   pid      process         
15928    766164156579412  0x2f     178      0        79114    0x64     dbuf_read_impl           3803     ll_ost03_000    
15929    766164159823356  0x2f     179      0        84980    0x64     dbuf_read_impl           3803     ll_ost03_000    
15930    766164162520769  0x2f     150      0        58855    0x64     dbuf_read_impl           3803     ll_ost03_000    
15931    766164163555034  0x2f     169      0        48748    0x64     dbuf_read_impl           6828     ldlm_cn03_007   
15932    766164174364152  0x2f     151      0        83592    0x64     dbuf_read_impl           3803     ll_ost03_000    
15933    766164177093706  0x2f     148      0        416      0x64     dbuf_read_impl           6828     ldlm_cn03_007   
15934    766164184477506  0x2f     152      0        6507     0x64     dbuf_read_impl           3803     ll_ost03_000    
15935    766164187822369  0x2f     153      0        24148    0x64     dbuf_read_impl           3803     ll_ost03_000    
15936    766164188694642  0x2f     167      0        13795    0x64     dbuf_read_impl           6828     ldlm_cn03_007   
15937    766164197669859  0x2f     154      0        36067    0x64     dbuf_read_impl           3803     ll_ost03_000    
15938    766164199348418  0x2f     162      0        84625    0x64     dbuf_read_impl           6828     ldlm_cn03_007   
15939    766164210502738  0x2f     156      0        63494    0x64     dbuf_read_impl           3803     ll_ost03_000    
15940    766164210929591  0x2f     163      0        57995    0x64     dbuf_read_impl           6828     ldlm_cn03_007   
15941    766164222820463  0x2f     154      0        10070    0x64     dbuf_read_impl           6828     ldlm_cn03_007   
15942    766164222864079  0x2f     157      0        53684    0x64     dbuf_read_impl           3803     ll_ost03_000    
15943    766164225929360  0x2f     160      0        34939    0x64     dbuf_read_impl           3803     ll_ost03_000    
15944    766164228828963  0x2f     161      0        5729     0x64     dbuf_read_impl           3803     ll_ost03_000    
15945    766164231472223  0x2f     162      0        11351    0x64     dbuf_read_impl           3803     ll_ost03_000    
15946    766164234109879  0x2f     170      0        59220    0x64     dbuf_read_impl           6828     ldlm_cn03_007   
15947    766164237596524  0x2f     163      0        94437    0x64     dbuf_read_impl           3803     ll_ost03_000    
15948    766164240488786  0x2f     164      0        57877    0x64     dbuf_read_impl           3803     ll_ost03_000    
15949    766164240829980  0x2f     165      0        87378    0x64     dbuf_read_impl           3803     ll_ost03_000    
15950    766164243497409  0x2f     166      0        42756    0x64     dbuf_read_impl           3803     ll_ost03_000    
15951    766164246480267  0x2f     169      0        48773    0x64     dbuf_read_impl           6828     ldlm_cn03_007   
15952    766164249795764  0x2f     167      0        36727    0x64     dbuf_read_impl           3803     ll_ost03_000    
15953    766164253590137  0x2f     163      0        66820    0x64     dbuf_read_impl           6828     ldlm_cn03_007   
15954    766164261417886  0x2f     168      0        53317    0x64     dbuf_read_impl           3803     ll_ost03_000    
15955    766164265303986  0x2f     160      0        8178     0x64     dbuf_read_impl           6828     ldlm_cn03_007   
15956    766164272799763  0x2f     171      0        1701     0x64     dbuf_read_impl           6828     ldlm_cn03_007   
15957    766164273028298  0x2f     169      0        36867    0x64     dbuf_read_impl           3803     ll_ost03_000
Comment by Scott Nolin [ 20/Feb/14 ]

Prakesh,

Our user asked us to reset things, so captured the arcstats info and beat down the cache size with that hack.

However, the system wasn't really in an extremely degraded state. The arc was full but I think performance wasn't that bad - lower than his peak, but maybe 50%. Since he's the only user I went ahead and did it anyhow. So I don't know how useful these numbers are for you:

c                               4    59624418824
c_min                           4    33000000000
c_max                           4    96000000000
hdr_size                        4    10851742848
data_size                       4    25850144256
other_size                      4    22922398496
anon_size                       4    409600
mru_size                        4    7485866496
mru_ghost_size                  4    33932813824
mfu_size                        4    18363868160
mfu_ghost_size                  4    1910375424
l2_size                         4    0
l2_hdr_size                     4    0
duplicate_buffers_size          4    0
arc_meta_used                   4    56346426784
arc_meta_limit                  4    66000000000

Reading your previous posts, I think we did see the cache warm-up effect after our restart last time. Performance wasn't great and then got much better as data flowed in over time, then got worse.

Scott

Comment by Prakash Surya (Inactive) [ 20/Feb/14 ]

Yea, those numbers look "normal" to me. I'd expect the ARC to be performing reasonably well at that point in time.

I see you're manually tuning a few of the ARC parameters (i.e. arc_meta_limit, c_max, c_min, maybe others?), care to fill me in on why that is? I have a feeling your tuning of arc_meta_limit is helping you, but I'm curious what drove you to do that in the first place and what (if any) perceived benefits you're seeing. Have you been hitting any OOMs with a higher than normal arc_meta_limit? How much RAM is on the system?

Comment by Scott Nolin [ 21/Feb/14 ]

We were just guessing if we increased the arc size it would delay the performance degradation. We haven't had any OOMs, we have 128GB of RAM.

I don't know if it's really helped, it hasn't hurt. My guess is maybe pointless or a minor help.

Scott

Comment by Scott Nolin [ 22/Feb/14 ]

Here's some additional information about our current workload, in case it's pertinent.

I think with every file written the mtime is changed and permissions are changed by the user's process. In aggregate according to robinhood there are an average of 1 mtime change and about 1.5 permission change operations per file.

Earlier we ran into the issue with synchronous permission changes, because so many are performed - https://jira.hpdd.intel.com/browse/LU-3671 - making them asynchronous helped significantly.

Scott

Comment by Prakash Surya (Inactive) [ 24/Feb/14 ]

FYI, the ARC changes landed last Friday to the ZoL master tree. So if you get a chance to update and run with those patches, I'd be interested to know if you still have to "reset" things due to performance degradation.

Comment by Scott Nolin [ 07/Mar/14 ]

Just an update on changing the arc parameters. It did take a while to happen, but now we have seen some negative effects. No OOM, but lustre threads taking a long time to complete, indicating overloaded system. So I assume not enough memory for the number of service threads.

Scott

Comment by Scott Nolin [ 13/Jun/14 ]

For the record, here is a graph of performance degradation. - mdtest-arcdata.png

After upgrading ram (and to ssd's) we now see it after about 30 million creates.

This is on lustre 2.4.0-1 with zfs 0.6.2, no additional patches

We have 258 G RAM - zfs arc options were doubled from default:

options zfs zfs_arc_meta_limit=10000000000
options zfs zfs_arc_max=150000000000

I know the new version of zfs increases this further by default. Ideally we'd be able to run this test with the new version too.

Interestingly, now I see all other mdtest file operations go bad at about the same time. Previously, with fewer resources and default zfs arc options the file creates tanked very quickly, but the other operations stayed the same. In that case though we never managed to test past maybe a couple million files, as it was so slow.

Comment by Scott Nolin [ 17/Jun/14 ]

We have upgraded to zfs-0.6.3-1 and while it looks promising that this issue is solved (or greatly improved) I haven't been able to verify this. I managed to run about 20 Million file writes, and need more than 30M to go past the point where things degraded last time.

I'd like to verify it's improved, but can't do it with that system now.

While testing we did unfortunately see decreased numbers for stat, see LU-5212

Scott

Comment by James A Simmons [ 14/Aug/16 ]

Old blocker for unsupported version

Generated at Sat Feb 10 01:25:32 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.