My suggestion would be to start with the latest source from Github if you're doing any sort of performance work. We've made some major performance improvements in the last 6 months you'll definitely benefit from. We try very hard to keep what's on the master branch stable so I would track it for performance testing.
https://github.com/zfsonlinux/zfs/
One of the major improvements made post 0.6.2 was the ZFS write throttle code has been completely reworked. The previous design was causing considerable I/O starvation/contention just like you've described above. The updated code smooths things out considerably we're see more consistent I/O times and improved throughput. Here are some additional links describing this work.
http://open-zfs.org/wiki/Features#Smoother_Write_Throttle
http://dtrace.org/blogs/ahl/2013/12/27/zfs-fundamentals-the-write-throttle/
Another thing we've been working on is improving the ARC hit rate. We've observed that particularly with meta data heavy workloads (which is all the MDS does) the ARC performance degrades over time and we end up needing to read from disk more. You can see this behavior pretty easily by running the arcstat.py script which among other things can show you the the current cache hit rate. Prakash has been investigating this and has proposed some promising patches which help a lot. But we're still reviewing and testing them to ensure they work as expected and don't introduce regressions for other workloads. We'd love for you to give them a spin and see how much they help your testing.
https://github.com/zfsonlinux/zfs/pull/1967
It's also worth running the master branch because it adds some useful tools and more entries to proc to improve visibility. My favorite new tool is dbufstat.py. It allows you to dump all the cached dbufs and show which pool, dataset, object they belong too. You can also see extended information about each buffer which allows you to often infer why it's being kept in the cache. For example, for Lustre it clearly shows all the spill blocks we're forced to use because of the 512 byte dnode size. That makes it quite clear than increasing the dnode size to 1k could halve the number of I/Os we need to do for lookups. It's nice to be able to easily see that.
There are also some new entries in /proc/spl/kstat/zfs/. They let you get a handle of how long it's taking to assign a TXG or exactly what I/O are we issuing to disk when we get a cache miss.
- dbufs - Stats for all dbufs in the dbuf_hash
- <pool>/txgs - Stats for the last N txgs synced to disk
- <pool>/reads - Stats for the last N reads issues by the ARC
- <pool>/dmu_tx_assign - Histogram of tx assign times
- <pool>/io - Total I/O issued for the pool
Basically, we've been thinking about performance with ZFS too. And now that things are running well we've been getting the tools in place so we can clearly understand exactly what needs to be improved. I'd hoped to get an 0.6.3 tag out with all these improvements in January but that's slipped. One of the two major blockers in convincing ourselves that Prakash's ARC changes work as designed and help the expected workloads. Once again if you guys could help test them that would be very helpful!
My suggestion would be to start with the latest source from Github if you're doing any sort of performance work. We've made some major performance improvements in the last 6 months you'll definitely benefit from. We try very hard to keep what's on the master branch stable so I would track it for performance testing.
https://github.com/zfsonlinux/zfs/
One of the major improvements made post 0.6.2 was the ZFS write throttle code has been completely reworked. The previous design was causing considerable I/O starvation/contention just like you've described above. The updated code smooths things out considerably we're see more consistent I/O times and improved throughput. Here are some additional links describing this work.
http://open-zfs.org/wiki/Features#Smoother_Write_Throttle
http://dtrace.org/blogs/ahl/2013/12/27/zfs-fundamentals-the-write-throttle/
Another thing we've been working on is improving the ARC hit rate. We've observed that particularly with meta data heavy workloads (which is all the MDS does) the ARC performance degrades over time and we end up needing to read from disk more. You can see this behavior pretty easily by running the arcstat.py script which among other things can show you the the current cache hit rate. Prakash has been investigating this and has proposed some promising patches which help a lot. But we're still reviewing and testing them to ensure they work as expected and don't introduce regressions for other workloads. We'd love for you to give them a spin and see how much they help your testing.
https://github.com/zfsonlinux/zfs/pull/1967
It's also worth running the master branch because it adds some useful tools and more entries to proc to improve visibility. My favorite new tool is dbufstat.py. It allows you to dump all the cached dbufs and show which pool, dataset, object they belong too. You can also see extended information about each buffer which allows you to often infer why it's being kept in the cache. For example, for Lustre it clearly shows all the spill blocks we're forced to use because of the 512 byte dnode size. That makes it quite clear than increasing the dnode size to 1k could halve the number of I/Os we need to do for lookups. It's nice to be able to easily see that.
There are also some new entries in /proc/spl/kstat/zfs/. They let you get a handle of how long it's taking to assign a TXG or exactly what I/O are we issuing to disk when we get a cache miss.
Basically, we've been thinking about performance with ZFS too. And now that things are running well we've been getting the tools in place so we can clearly understand exactly what needs to be improved. I'd hoped to get an 0.6.3 tag out with all these improvements in January but that's slipped. One of the two major blockers in convincing ourselves that Prakash's ARC changes work as designed and help the expected workloads. Once again if you guys could help test them that would be very helpful!