[LU-4507] Server hangs and terrible performance - ZFS IOR Created: 17/Jan/14 Updated: 09/Oct/21 Resolved: 09/Oct/21 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.6.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Cliff White (Inactive) | Assignee: | Isaac Huang (Inactive) |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Hyperion/LLNL |
||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 12333 | ||||||||
| Description |
|
For sometime now we have been observing terrible read performance when running ZFS IOR file-per-proccess. The system will see ~7 GB/s reading with ldiskfs, at higher client counts the ZFS read performance on this test will drop to ~400 MB/s which is roughly a single client level. |
| Comments |
| Comment by Cliff White (Inactive) [ 21/Jan/14 ] |
|
ZFS also is demonstrating terrible mdtest performance. I compared ZFS and ldiskfs, increasing the number of files. Increasing file count on ZFS from 500k to 750k increased the test run time 4-6x. No such impact was observed on the ldiskfs side. Spreadsheet attached. |
| Comment by Andreas Dilger [ 23/Jan/14 ] |
|
Brian, Isaac is starting to dig into ZFS debugging and development. The stack traces that Cliff attached here look pretty clearly like a lockup (or at least heavy contention) in ZFS: ll_ost_io02_0 D 0000000000000008 0 10402 2 0x00000000 Call Trace: [<ffffffff81510ad5>] rwsem_down_failed_common+0x95/0x1d0 [<ffffffff81510c33>] rwsem_down_write_failed+0x23/0x30 [<ffffffff81283cb3>] call_rwsem_down_write_failed+0x13/0x20 [<ffffffff81510132>] ? down_write+0x32/0x40 [<ffffffffa04e970a>] dmu_zfetch+0x6fa/0xd70 [zfs] [<ffffffffa04d3281>] dbuf_read+0x6a1/0x750 [zfs] [<ffffffffa04db122>] dmu_buf_hold_array_by_dnode+0x162/0x560 [zfs] [<ffffffffa04dc167>] dmu_read+0x97/0x180 [zfs] [<ffffffffa0d0891f>] osd_read+0x18f/0x240 [osd_zfs] [<ffffffffa0d08b53>] osd_read_prep+0x183/0x240 [osd_zfs] [<ffffffffa0dc3ba3>] ofd_preprw_read+0x253/0x7f0 [ofd] [<ffffffffa0dc48ba>] ofd_preprw+0x77a/0x1480 [ofd] [<ffffffffa0d58e11>] obd_preprw+0x121/0x390 [ost] [<ffffffffa0d60569>] ost_brw_read+0xd29/0x1350 [ost] [<ffffffffa0d676d8>] ost_handle+0x24a8/0x44d0 [ost] [<ffffffffa08a00e5>] ptlrpc_server_handle_request+0x385/0xc00 [ptlrpc] [<ffffffffa08a144d>] ptlrpc_main+0xaed/0x1740 [ptlrpc] Any pointers for how to proceed on this? Looks like the threads are blocked for a long time reading a dnode from disk, or there is some sort of lock deadlock. |
| Comment by Brian Behlendorf [ 23/Jan/14 ] |
|
My suggestion would be to start with the latest source from Github if you're doing any sort of performance work. We've made some major performance improvements in the last 6 months you'll definitely benefit from. We try very hard to keep what's on the master branch stable so I would track it for performance testing. https://github.com/zfsonlinux/zfs/ One of the major improvements made post 0.6.2 was the ZFS write throttle code has been completely reworked. The previous design was causing considerable I/O starvation/contention just like you've described above. The updated code smooths things out considerably we're see more consistent I/O times and improved throughput. Here are some additional links describing this work. http://open-zfs.org/wiki/Features#Smoother_Write_Throttle Another thing we've been working on is improving the ARC hit rate. We've observed that particularly with meta data heavy workloads (which is all the MDS does) the ARC performance degrades over time and we end up needing to read from disk more. You can see this behavior pretty easily by running the arcstat.py script which among other things can show you the the current cache hit rate. Prakash has been investigating this and has proposed some promising patches which help a lot. But we're still reviewing and testing them to ensure they work as expected and don't introduce regressions for other workloads. We'd love for you to give them a spin and see how much they help your testing. https://github.com/zfsonlinux/zfs/pull/1967 It's also worth running the master branch because it adds some useful tools and more entries to proc to improve visibility. My favorite new tool is dbufstat.py. It allows you to dump all the cached dbufs and show which pool, dataset, object they belong too. You can also see extended information about each buffer which allows you to often infer why it's being kept in the cache. For example, for Lustre it clearly shows all the spill blocks we're forced to use because of the 512 byte dnode size. That makes it quite clear than increasing the dnode size to 1k could halve the number of I/Os we need to do for lookups. It's nice to be able to easily see that. There are also some new entries in /proc/spl/kstat/zfs/. They let you get a handle of how long it's taking to assign a TXG or exactly what I/O are we issuing to disk when we get a cache miss.
Basically, we've been thinking about performance with ZFS too. And now that things are running well we've been getting the tools in place so we can clearly understand exactly what needs to be improved. I'd hoped to get an 0.6.3 tag out with all these improvements in January but that's slipped. One of the two major blockers in convincing ourselves that Prakash's ARC changes work as designed and help the expected workloads. Once again if you guys could help test them that would be very helpful! |