[LU-5278] ZFS - many OST watchdogs with IOR Created: 01/Jul/14 Updated: 23/Dec/15 Resolved: 09/Jun/15 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.6.0 |
| Fix Version/s: | Lustre 2.8.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Cliff White (Inactive) | Assignee: | Alex Zhuravlev |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | MB, llnl, prz, zfs | ||
| Environment: |
Hyperion/LLNL |
||
| Attachments: |
|
||||||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||||||||||
| Rank (Obsolete): | 14730 | ||||||||||||||||||||||||||||
| Description |
|
Running IOR with 100 clients. Performance is terrible. OSTs are wedging and dropping watchdogs. 2014-07-01 08:22:47 LNet: Service thread pid 8308 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: 2014-07-01 08:22:47 Pid: 8308, comm: ll_ost_io00_014 2014-07-01 08:22:47 2014-07-01 08:22:47 Call Trace: 2014-07-01 08:22:47 [<ffffffffa05b34ba>] ? dmu_zfetch+0x51a/0xd70 [zfs] 2014-07-01 08:22:47 [<ffffffff810a6d01>] ? ktime_get_ts+0xb1/0xf0 2014-07-01 08:22:47 [<ffffffff815287f3>] io_schedule+0x73/0xc0 2014-07-01 08:22:47 [<ffffffffa04f841c>] cv_wait_common+0x8c/0x100 [spl] 2014-07-01 08:22:47 [<ffffffff8109af00>] ? autoremove_wake_function+0x0/0x40 2014-07-01 08:22:47 [<ffffffffa04f84a8>] __cv_wait_io+0x18/0x20 [spl] 2014-07-01 08:22:47 [<ffffffffa062f0ab>] zio_wait+0xfb/0x1b0 [zfs] 2014-07-01 08:22:47 [<ffffffffa05a503d>] dmu_buf_hold_array_by_dnode+0x19d/0x4c0 [zfs] 2014-07-01 08:22:47 [<ffffffffa05a5e68>] dmu_buf_hold_array_by_bonus+0x68/0x90 [zfs] 2014-07-01 08:22:47 [<ffffffffa0e3f1a3>] osd_bufs_get+0x493/0xb00 [osd_zfs] 2014-07-01 08:22:47 [<ffffffffa03be488>] ? libcfs_log_return+0x28/0x40 [libcfs] 2014-07-01 08:22:47 [<ffffffffa0f2e00b>] ofd_preprw_read+0x15b/0x890 [ofd] 2014-07-01 08:22:47 [<ffffffffa0f30709>] ofd_preprw+0x749/0x1650 [ofd] 2014-07-01 08:22:47 [<ffffffffa09d71b1>] obd_preprw.clone.3+0x121/0x390 [ptlrpc] 2014-07-01 08:22:47 [<ffffffffa09deb03>] tgt_brw_read+0x2d3/0x1150 [ptlrpc] 2014-07-01 08:22:47 [<ffffffffa03be488>] ? libcfs_log_return+0x28/0x40 [libcfs] 2014-07-01 08:22:47 [<ffffffffa097ab36>] ? lustre_pack_reply_v2+0x216/0x280 [ptlrpc] 2014-07-01 08:22:47 [<ffffffffa097ac4e>] ? lustre_pack_reply_flags+0xae/0x1f0 [ptlrpc] 2014-07-01 08:22:47 [<ffffffffa09dca7c>] tgt_request_handle+0x23c/0xac0 [ptlrpc] 2014-07-01 08:22:47 [<ffffffffa098c29a>] ptlrpc_main+0xd1a/0x1980 [ptlrpc] 2014-07-01 08:22:47 [<ffffffffa098b580>] ? ptlrpc_main+0x0/0x1980 [ptlrpc] 2014-07-01 08:22:47 [<ffffffff8109ab56>] kthread+0x96/0xa0 2014-07-01 08:22:47 [<ffffffff8100c20a>] child_rip+0xa/0x20 2014-07-01 08:22:47 [<ffffffff8109aac0>] ? kthread+0x0/0xa0 2014-07-01 08:22:47 [<ffffffff8100c200>] ? child_rip+0x0/0x20 Lustre dump attached. Second example: 2014-07-01 09:38:41 Pid: 9299, comm: ll_ost_io00_070 2014-07-01 09:38:41 2014-07-01 09:38:41 Call Trace: 2014-07-01 09:38:41 [<ffffffffa05b02f7>] ? dmu_zfetch+0x357/0xd70 [zfs] 2014-07-01 09:38:41 [<ffffffffa05957f2>] ? arc_read+0x572/0x8d0 [zfs] 2014-07-01 09:38:41 [<ffffffff810a6d01>] ? ktime_get_ts+0xb1/0xf0 2014-07-01 09:38:41 [<ffffffff815287f3>] io_schedule+0x73/0xc0 2014-07-01 09:38:41 [<ffffffffa04f841c>] cv_wait_common+0x8c/0x100 [spl] 2014-07-01 09:38:41 [<ffffffff8109af00>] ? autoremove_wake_function+0x0/0x40 2014-07-01 09:38:41 [<ffffffffa04f84a8>] __cv_wait_io+0x18/0x20 [spl] 2014-07-01 09:38:41 [<ffffffffa062c0ab>] zio_wait+0xfb/0x1b0 [zfs] 2014-07-01 09:38:41 [<ffffffffa05a203d>] dmu_buf_hold_array_by_dnode+0x19d/0x4c0 [zfs] 2014-07-01 09:38:41 [<ffffffffa05a2e68>] dmu_buf_hold_array_by_bonus+0x68/0x90 [zfs] 2014-07-01 09:38:41 [<ffffffffa0e441a3>] osd_bufs_get+0x493/0xb00 [osd_zfs] 2014-07-01 09:38:41 [<ffffffffa03be488>] ? libcfs_log_return+0x28/0x40 [libcfs] 2014-07-01 09:38:41 [<ffffffffa0f3700b>] ofd_preprw_read+0x15b/0x890 [ofd] 2014-07-01 09:38:41 [<ffffffffa0f39709>] ofd_preprw+0x749/0x1650 [ofd] 2014-07-01 09:38:41 [<ffffffffa09d41b1>] obd_preprw.clone.3+0x121/0x390 [ptlrpc] 2014-07-01 09:38:41 [<ffffffffa09dbb03>] tgt_brw_read+0x2d3/0x1150 [ptlrpc] 2014-07-01 09:38:41 [<ffffffffa03be488>] ? libcfs_log_return+0x28/0x40 [libcfs] 2014-07-01 09:38:41 [<ffffffffa0977b36>] ? lustre_pack_reply_v2+0x216/0x280 [ptlrpc] 2014-07-01 09:38:41 [<ffffffffa0977c4e>] ? lustre_pack_reply_flags+0xae/0x1f0 [ptlrpc] 2014-07-01 09:38:41 [<ffffffffa09d9a7c>] tgt_request_handle+0x23c/0xac0 [ptlrpc] 2014-07-01 09:38:41 [<ffffffffa098929a>] ptlrpc_main+0xd1a/0x1980 [ptlrpc] 2014-07-01 09:38:41 [<ffffffffa0988580>] ? ptlrpc_main+0x0/0x1980 [ptlrpc] 2014-07-01 09:38:41 [<ffffffff8109ab56>] kthread+0x96/0xa0 2014-07-01 09:38:41 [<ffffffff8100c20a>] child_rip+0xa/0x20 2014-07-01 09:38:41 [<ffffffff8109aac0>] ? kthread+0x0/0xa0 2014-07-01 09:38:41 [<ffffffff8100c200>] ? child_rip+0x0/0x20 2014-07-01 09:38:41 |
| Comments |
| Comment by Cliff White (Inactive) [ 01/Jul/14 ] | |||||||||||||||||||||
|
Lustre log dumped by first watchdog | |||||||||||||||||||||
| Comment by Andreas Dilger [ 02/Jul/14 ] | |||||||||||||||||||||
|
Cliff, is this running ZFS 0.6.2 or 0.6.3? Also, did the IOR eventually complete or were the same threads stuck forever? | |||||||||||||||||||||
| Comment by Cliff White (Inactive) [ 02/Jul/14 ] | |||||||||||||||||||||
|
ZFS version was 0.6.3 - the IOR did eventually complete, results were terrible. | |||||||||||||||||||||
| Comment by Andreas Dilger [ 04/Jul/14 ] | |||||||||||||||||||||
|
Isaac, could you please take a look at this to see if there is something obvious? We are trying to decide if this needs to be a blocker for 2.6.0. | |||||||||||||||||||||
| Comment by Isaac Huang (Inactive) [ 24/Jul/14 ] | |||||||||||||||||||||
|
Sorry missed this bug during the travels. Cliff, 1. Please disable ZFS prefetching on both OSS and MDS: | |||||||||||||||||||||
| Comment by Andreas Dilger [ 12/Nov/14 ] | |||||||||||||||||||||
|
Cliff, Isaac, could you please update on the status of this ticket. Cliff, did you implement Isaac's suggestion? Are there still watchdogs being hit? | |||||||||||||||||||||
| Comment by Cliff White (Inactive) [ 12/Nov/14 ] | |||||||||||||||||||||
|
I can run some ZFS-specific tests. Due to LLNL's cfengine, I am not sure I can set that option. Should be able to do this in the next few days | |||||||||||||||||||||
| Comment by Cliff White (Inactive) [ 14/Nov/14 ] | |||||||||||||||||||||
|
We are still dumping watchdogs, I have not yet been able to set the option. Here's an example: 2014-11-14 10:11:14 LNet: Service thread pid 8361 was inactive for 222.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: 2014-11-14 10:11:14 Pid: 8361, comm: ll_ost_io03_081 2014-11-14 10:11:14 2014-11-14 10:11:14 Call Trace: 2014-11-14 10:11:14 [<ffffffffa059a4ba>] ? dmu_zfetch+0x51a/0xd70 [zfs] 2014-11-14 10:11:14 [<ffffffff81529e83>] io_schedule+0x73/0xc0 2014-11-14 10:11:14 [<ffffffffa04b141c>] cv_wait_common+0x8c/0x100 [spl] 2014-11-14 10:11:14 [<ffffffff8109afa0>] ? autoremove_wake_function+0x0/0x40 2014-11-14 10:11:14 [<ffffffffa04b14a8>] __cv_wait_io+0x18/0x20 [spl] 2014-11-14 10:11:14 [<ffffffffa061611b>] zio_wait+0xfb/0x1b0 [zfs] 2014-11-14 10:11:14 [<ffffffffa058c03d>] dmu_buf_hold_array_by_dnode+0x19d/0x4c0 [zfs] 2014-11-14 10:11:14 [<ffffffffa058ce68>] dmu_buf_hold_array_by_bonus+0x68/0x90 [zfs] 2014-11-14 10:11:14 [<ffffffffa110f4b3>] osd_bufs_get+0x493/0xb00 [osd_zfs] 2014-11-14 10:11:14 [<ffffffffa03c71a1>] ? libcfs_debug_msg+0x41/0x50 [libcfs] 2014-11-14 10:11:14 [<ffffffffa03c17d8>] ? libcfs_log_return+0x28/0x40 [libcfs] 2014-11-14 10:11:14 [<ffffffffa1050f7b>] ofd_preprw_read+0x15b/0x930 [ofd] 2014-11-14 10:11:14 [<ffffffffa1051f99>] ofd_preprw+0x849/0x1680 [ofd] 2014-11-14 10:11:14 [<ffffffffa09a24b1>] obd_preprw.clone.3+0x121/0x390 [ptlrpc] 2014-11-14 10:11:14 [<ffffffffa09aa24e>] tgt_brw_read+0x67e/0x1160 [ptlrpc] 2014-11-14 10:11:14 [<ffffffffa03c17d8>] ? libcfs_log_return+0x28/0x40 [libcfs] 2014-11-14 10:11:14 [<ffffffffa0946446>] ? lustre_pack_reply_v2+0x226/0x290 [ptlrpc] 2014-11-14 10:11:14 [<ffffffffa094655e>] ? lustre_pack_reply_flags+0xae/0x1f0 [ptlrpc] 2014-11-14 10:11:14 [<ffffffffa09a82ae>] tgt_request_handle+0x71e/0xb10 [ptlrpc] 2014-11-14 10:11:14 [<ffffffffa0957a94>] ptlrpc_main+0xe64/0x1990 [ptlrpc] 2014-11-14 10:11:14 [<ffffffffa0956c30>] ? ptlrpc_main+0x0/0x1990 [ptlrpc] 2014-11-14 10:11:14 [<ffffffff8109abf6>] kthread+0x96/0xa0 2014-11-14 10:11:14 [<ffffffff8100c20a>] child_rip+0xa/0x20 2014-11-14 10:11:14 [<ffffffff8109ab60>] ? kthread+0x0/0xa0 2014-11-14 10:11:14 [<ffffffff8100c200>] ? child_rip+0x0/0x20 2014-11-14 10:11:14 | |||||||||||||||||||||
| Comment by Cliff White (Inactive) [ 14/Nov/14 ] | |||||||||||||||||||||
|
Log dump from OSS | |||||||||||||||||||||
| Comment by Cliff White (Inactive) [ 14/Nov/14 ] | |||||||||||||||||||||
|
I am making a second run now with the options set | |||||||||||||||||||||
| Comment by Cliff White (Inactive) [ 17/Nov/14 ] | |||||||||||||||||||||
|
Disabling prefetch seems to fix the watchdogs, performance is about the same or perhaps a bit worse, spreadsheet attached | |||||||||||||||||||||
| Comment by Isaac Huang (Inactive) [ 09/Jan/15 ] | |||||||||||||||||||||
|
For some workloads, zprefetch actually helps a lot. But if performance went just a bit worse by disabling it, then zprefetch would mostly just incur unnecessary IOs and likely execute some flaky parts of the code. So I'd suggest to keep it disabled as long as performance isn't adversely affected. | |||||||||||||||||||||
| Comment by Cliff White (Inactive) [ 14/Jan/15 ] | |||||||||||||||||||||
|
Okay, for now will leave prefetch always off. Left it on by mistake in last IEEL runs. | |||||||||||||||||||||
| Comment by Rick Wagner (Inactive) [ 01/Feb/15 ] | |||||||||||||||||||||
|
I believe we are seeing the cause of this while testing streaming read IO using Lustre backed by ZFS, namely that osd_bufs_get (and therefore osd_bufs_get_read) is called inside of a for loop via ofd_preprw_read, and osd_bufs_get_read is hitting storage on each call due to its call to dmu_buf_hold_array_by_bonus. My guess is that IOs are not being queued up to the storage as intended by dt_object_design when using ZFS. | |||||||||||||||||||||
| Comment by Alex Zhuravlev [ 02/Feb/15 ] | |||||||||||||||||||||
|
specifically with streaming reads this shouldn't be a problem - it'd be a single call to dmu_buf_hold_array_by_bonus() as the whole range is described by a single niobuf. | |||||||||||||||||||||
| Comment by Rick Wagner (Inactive) [ 02/Feb/15 ] | |||||||||||||||||||||
|
Thanks, Alex, that's reassuring. We're trying to figure out why we're getting 2.4 GB/s read bandwidth out of OSTs that will deliver over 9 GB/s when mounted as ZFS and using dd. During our profiling, I saw that dmu_buf_hold_array_by_bonus() was going to the drive, and didn't know that nicount in ofd_preprw_read() would be set to 1 for streaming reads. | |||||||||||||||||||||
| Comment by Alex Zhuravlev [ 02/Feb/15 ] | |||||||||||||||||||||
|
I think it makes sense to verify it's really 1 and I'm not too optimistic | |||||||||||||||||||||
| Comment by Rick Wagner (Inactive) [ 02/Feb/15 ] | |||||||||||||||||||||
|
Alex, without having attached to a process to check, I don't think it is 1. We're testing using obdfilter-survey, which uses the echo client. Inside of obdecho/echo_client.c:echo_client_prep_commit() there's a loop with the following: ioo.ioo_bufcnt = npages;
lpages = npages;
ret = obd_preprw(env, rw, exp, oa, 1, &ioo, rnb, &lpages,
lnb, oti, NULL);
where npages is the number of 4K pages in the request. Later, ofd_io.c:ofd_preprw() call ofd_preprw_read() where ioo is now obj rc = ofd_preprw_read(env, exp, ofd, fid, &info->fti_attr, oa, obj->ioo_bufcnt, rnb, nr_local, lnb, jobid); and that's where ioo_bufcnt becomes niocount, and the loop is executed which eventually gets to osd-zfs:osd_io.c:osd_bufs_get_read() and dmu_buf_hold_array_by_bonus(). This points to dmu_buf_hold_array_by_bonus() being called for 4K buffers. I don't know if this is by design, but depending on how ZFS handles these requests, it could explain why the streaming bandwidth is so much lower that what we see with native ZFS file systems. The performance is similarly low when the Lustre file system is mounted, so I'm guessing the regular code path has similar logic to what's in the echo client. | |||||||||||||||||||||
| Comment by Alex Zhuravlev [ 03/Feb/15 ] | |||||||||||||||||||||
|
I'd agree that echo_client_prep_commit() isn't a good approach, it should be filing rnb'es and then make a single call to obd_preprw(). please try the regular I/O with dd, I'm quite sure you'll be seeing mostly single niobuf requests. | |||||||||||||||||||||
| Comment by Andreas Dilger [ 03/Feb/15 ] | |||||||||||||||||||||
|
Obdfilter-survey should behave the same as our network IO submission, so that it simulates real world performance reasonably accurately. It makes sense to fix echo_cloent_prep_commit() to build large niobufs for submission. | |||||||||||||||||||||
| Comment by Rick Wagner (Inactive) [ 03/Feb/15 ] | |||||||||||||||||||||
|
Indeed, running single dds from 4 clients was sufficient to see 1.3 GB/s (occasional peaks from zpool iostat at 1.5 GB/s) from a single OST. I'll need more clients to see how that scales over more OSTs, but that's far better read performance than we were getting from a single OST via obdfilter-survey, and is much closer to the native ZFS results. If there's a method for me to verify the len passed to osd_bufs_get_read, I'd be glad to try it. And, Andreas, I completely agree with your comment about the echo client. This has led to significant confusion on our side about the capabilities of Lustre with ZFS. A simple tool which can simulate dd-like workloads on the server is what we expected, what we got behaved more fio. | |||||||||||||||||||||
| Comment by Gerrit Updater [ 03/Feb/15 ] | |||||||||||||||||||||
|
Alex Zhuravlev (alexey.zhuravlev@intel.com) uploaded a new patch: http://review.whamcloud.com/13612 | |||||||||||||||||||||
| Comment by Rick Wagner (Inactive) [ 03/Feb/15 ] | |||||||||||||||||||||
|
Alex, thanks for the patch. I will test this today and see how close it compares to the native ZFS performance. Now it looks we're bottlenecking on socknal_sd* threads when reading from more than one OST across the network. Different topic, I expect. | |||||||||||||||||||||
| Comment by Alex Zhuravlev [ 03/Feb/15 ] | |||||||||||||||||||||
|
thanks in advance, Rick. I tested the patch locally and it seem to be OK, but the testing just started on Maloo, so I can't yet promise it's perfect. | |||||||||||||||||||||
| Comment by Rick Wagner (Inactive) [ 04/Feb/15 ] | |||||||||||||||||||||
|
I'm optimistic about your patch doing what you expect, Alex. I'm seeing up to 1.5 GB/s reads from a single OST, which is what I got from a client using dd last night, and very close the 1.6 GB/s I measure on native ZFS. The real test will be how much it helps in performance tuning. $ targets=edragon-OST0000 rszlo=4096 rszhi=4096 size=$((64*1024)) nobjlo=2 nobjhi=4 thrlo=2 thrhi=4 obdfilter-survey Tue Feb 3 21:00:43 PST 2015 Obdfilter-survey for case=disk from seahorse-oss-19-5.local ost 1 sz 67108864K rsz 4096K obj 2 thr 2 write 842.72 [ 547.97, 947.91] read 1520.84 [1087.92,1835.91] ost 1 sz 67108864K rsz 4096K obj 2 thr 4 write 810.22 [ 31.99,1303.90] read 846.02 [ 203.99,1691.92] ost 1 sz 67108864K rsz 4096K obj 4 thr 4 write 833.51 [ 307.98,1215.91] read 1518.76 [1187.94,2011.89] done! $ zpool iostat 5 ... ---------- ----- ----- ----- ----- ----- ----- mdt 27.2M 3.62T 0 0 0 0 ost0 141G 36.1T 1.51K 30 1.51G 30.0M ost1 42.7G 36.2T 0 0 0 0 ost2 41.8G 36.2T 0 0 0 0 ---------- ----- ----- ----- ----- ----- ----- mdt 27.2M 3.62T 0 0 0 0 ost0 141G 36.1T 1.48K 34 1.48G 34.8M ost1 42.7G 36.2T 0 0 0 0 ost2 41.8G 36.2T 0 0 0 0 ---------- ----- ----- ----- ----- ----- ----- mdt 27.2M 3.62T 0 0 0 0 ost0 141G 36.1T 1.49K 37 1.49G 37.0M ost1 42.7G 36.2T 0 0 0 0 ost2 41.8G 36.2T 0 0 0 0 | |||||||||||||||||||||
| Comment by Alex Zhuravlev [ 04/Feb/15 ] | |||||||||||||||||||||
|
thanks for the feedback, Rick. | |||||||||||||||||||||
| Comment by Andreas Dilger [ 04/Feb/15 ] | |||||||||||||||||||||
|
A significant amount of overhead has been observed in socknal threads due to TCP overhead from assembling packets on read. In the past the CPU usage of this data copying was on the order of 1GHz CPU/1GB/s of network bandwidth, which is one reason why IB with RDMA is more attractive. Another possibility (which we haven't tested much ourselves, but have heard some reports about) is to try RoCE via o2iblnd, if your ethernet hardware supports it. That would likely reduce your CPU usage. | |||||||||||||||||||||
| Comment by Rick Wagner (Inactive) [ 07/Feb/15 ] | |||||||||||||||||||||
|
Andreas, I'll hold off on blaming the socknal threads for the performance, since I think I've found another place where 4K reads are being made. First, I can confirm that Alex's patch to the echo client was a huge improvement to echo client. The streaming bandwidth performance comparison between dd on remote clients to a server running obdfilter-survey is much more consistent. You can see this in these two flame graphs: before, 25% of time in osd_bufs_get, and after, 1.5% of time in osd_bufs_get. Since the performance is now closer between client and server (between 4 to 5 GB/s for reads), but still much less than native ZFS (over 9 GB/s, scaling linearly from 1.5 GB/s on a single zpool), I looked at the portion of code still taking up a lot of time, osd_read, and its call to dmu_read. If I trace back the calls correctly, dmu_read is being called with a size of 4K. Like the earlier discussion, it starts in ofd_preprw_read with it's call to dt_bufs_get, except this time, after the call to dmu_buf_hold_array_by_bonus, osd_bufs_get_read loops over lnb mapping (copying?) data. while (tocpy > 0) { thispage = PAGE_CACHE_SIZE; thispage -= bufoff & (PAGE_CACHE_SIZE - 1); thispage = min(tocpy, thispage); lnb->lnb_rc = 0; lnb->lnb_file_offset = off; lnb->lnb_page_offset = bufoff & ~CFS_PAGE_MASK; lnb->lnb_len = thispage; lnb->lnb_page = kmem_to_page(dbp[i]->db_data + bufoff); /* mark just a single slot: we need this * reference to dbuf to be release once */ lnb->lnb_data = dbf; dbf = NULL; tocpy -= thispage; len -= thispage; bufoff += thispage; off += thispage; npages++; lnb++; } This is where lnb->lnb_len gets set to 4K, and npages is incremented before being returned back to ofd_preprw_read. The page count is also incremented via the nr_local variable. Both lnb and nr_local are then passed to dt_read_prep. From there, osd_read_prep has a loop over the number of pages calling osd_read, which makes a single call to dmu_read. Would someone confirm that I've got this correct? If so, it may be a more involved patch, unless the logic can stay in osd-zfs/osd_io.c. I do think this is consistent with reduced performance with additional OSTs, and the time spent in dmu_read.
| |||||||||||||||||||||
| Comment by Alex Zhuravlev [ 07/Feb/15 ] | |||||||||||||||||||||
|
Rick, you should try this patch - http://review.whamcloud.com/#/c/12991/ | |||||||||||||||||||||
| Comment by Rick Wagner (Inactive) [ 07/Feb/15 ] | |||||||||||||||||||||
|
Hi Alex, I applied the patch, but things went backwards. My impression is that relying on the ARC only works when streaming data below zfs_arc_max. For this test, I set zfs_arc_max to 64 GB on a machine with 128 GB of RAM. The 2 GB/s writes are also much better than native ZFS, which make me suspect caching. The numbers reported, and the observation of zpool iostat look like data is being pulled from a single drive as needed. Is it better to call dmu_read with a similar size argument to what was used for dmu_buf_hold_array_by_bonus? $ targets="ddragon-OST0000 ddragon-OST0001" rszlo=4096 rszhi=4096 size=$((64*1024)) nobjlo=1 nobjhi=1 thrlo=1 thrhi=1 obdfilter-survey Fri Feb 6 20:58:47 PST 2015 Obdfilter-survey for case=disk from seahorse-oss-19-2.local ost 2 sz 134217728K rsz 4096K obj 2 thr 2 write 2015.86 [ 387.91,5523.86] read 436.14 [ 32.00,1051.92] done! $ zpool iostat 5
capacity operations bandwidth
pool alloc free read write read write
---------- ----- ----- ----- ----- ----- -----
...wait a while...
mdt 36.6M 59.5G 0 0 202 0
ost0 965G 35.3T 242 0 242M 0
ost1 918G 35.4T 203 0 203M 0
ost2 829G 35.4T 0 0 201 0
ost3 849G 35.4T 0 0 201 0
ost4 831G 35.4T 0 0 201 0
ost5 831G 35.4T 0 0 201 0
---------- ----- ----- ----- ----- ----- -----
mdt 36.6M 59.5G 0 0 201 0
ost0 965G 35.3T 235 0 235M 0
ost1 918G 35.4T 216 0 216M 0
ost2 829G 35.4T 0 0 203 0
ost3 849G 35.4T 0 0 203 0
ost4 831G 35.4T 0 0 203 0
ost5 831G 35.4T 0 0 203 0
| |||||||||||||||||||||
| Comment by Alex Zhuravlev [ 07/Feb/15 ] | |||||||||||||||||||||
|
Rick, ARC is always used. dmu_read() shouldn't be used at all, instead we call dmu_buf_hold_array_by_bonus() to fill the buffers with size equal to one applied by the client (should be 1M). | |||||||||||||||||||||
| Comment by Rick Wagner (Inactive) [ 07/Feb/15 ] | |||||||||||||||||||||
|
OK, I'll do another perf analysis and see where the time went. At some point, the data requested will surpass the ARC, and need to begin streaming off of disk efficiently. I do wonder (out of ignorance) if the DMU layer is more related to the spl_kmem_cache than the ARC. | |||||||||||||||||||||
| Comment by Alex Zhuravlev [ 07/Feb/15 ] | |||||||||||||||||||||
|
in the meantime, I'll try to add more I/O stats to osd-zfs/. | |||||||||||||||||||||
| Comment by Rick Wagner (Inactive) [ 07/Feb/15 ] | |||||||||||||||||||||
|
Any instructions on how to pull debugging information from the running tasks would be appreciated. I am willing to use gdb, but I don't know how much guidance that would require to introspect the correct function. | |||||||||||||||||||||
| Comment by Alex Zhuravlev [ 07/Feb/15 ] | |||||||||||||||||||||
|
what kind of information? I'm planning to add I/O stats available in /proc/fs/lustre/.. | |||||||||||||||||||||
| Comment by Rick Wagner (Inactive) [ 07/Feb/15 ] | |||||||||||||||||||||
|
Like rpcstats on the client, a report of the {{len}}s and {{offset}}s in the call to dmu_buf_hold_array_by_bonus seems good, since this is the interface to the ZFS layer. Basically, summary statistics of the calls to ZFS from Lustre. | |||||||||||||||||||||
| Comment by Rick Wagner (Inactive) [ 07/Feb/15 ] | |||||||||||||||||||||
|
The patch for | |||||||||||||||||||||
| Comment by Alex Zhuravlev [ 07/Feb/15 ] | |||||||||||||||||||||
|
actually all the patch does is removing unnecessary dmu_read(). I'd think running swapper isn't a problem as long as this doesn't block reads. it should be releasing memory a bit ahead so that ARC always can get memory for new buffers immediately. | |||||||||||||||||||||
| Comment by Rick Wagner (Inactive) [ 07/Feb/15 ] | |||||||||||||||||||||
|
I see your point, Alex, and it's shown that there must be another issue with the echo client. I went to 4 remote host, with each pulling a single file from the same OST using dd, and the zpool is happily streaming over 1 GB/s. I will move on to multiple OSTs, and if I see the same sub-linear scaling with additional OSTs I will do my perf analysis using remote clients. | |||||||||||||||||||||
| Comment by Andreas Dilger [ 08/Feb/15 ] | |||||||||||||||||||||
The /proc/fs/lustre/osd-zfs/*/brw_stats file was added to master for 2.7.0 via http://review.whamcloud.com/11467 so this may be useful for your testing. See also my comments in Alex's patch http://review.whamcloud.com/12991 which should be fixed up to move the brw_stats accounting around the dmu_buf_hold_array_by_bonus)) loop so it accurately measures read times. Unfortunately, measuring write times is much more difficult for ZFS, since the write completion only happens down in ZFS. | |||||||||||||||||||||
| Comment by Rick Wagner (Inactive) [ 08/Feb/15 ] | |||||||||||||||||||||
|
Thanks, Andreas. Since I'm working from the Git master with all the patches I've needed to get this far (including large block support in ZFS), it turns out that's already in place. I can see that all of the IO requests are at least 1M in size. Very useful. For both reads and writes, I do find {zpool iostat <seconds>}to be a good check of what's going in and out of the disks, and ignores reads from cache. $ cat brw_stats
snapshot_time: 1423367149.544364 (secs.usecs)
read | write
pages per bulk r/w rpcs % cum % | rpcs % cum %
1: 4408 0 0 | 160 58 58
2: 10 0 0 | 114 41 100
4: 0 0 0 | 0 0 100
8: 0 0 0 | 0 0 100
16: 0 0 0 | 0 0 100
32: 0 0 0 | 0 0 100
64: 0 0 0 | 0 0 100
128: 0 0 0 | 0 0 100
256: 456654 89 89 | 0 0 100
512: 8732 1 91 | 0 0 100
1K: 42704 8 100 | 0 0 100
read | write
discontiguous pages rpcs % cum % | rpcs % cum %
0: 512508 100 100 | 274 100 100
read | write
disk I/Os in flight ios % cum % | ios % cum %
1: 511312 99 99 | 273 99 99
2: 1123 0 99 | 1 0 100
3: 62 0 99 | 0 0 100
4: 9 0 99 | 0 0 100
5: 1 0 99 | 0 0 100
6: 1 0 100 | 0 0 100
read | write
I/O time (1/1000s) ios % cum % | ios % cum %
1: 494 98 98 | 0 0 0
2: 0 0 98 | 0 0 0
4: 0 0 98 | 0 0 0
8: 1 0 99 | 0 0 0
16: 2 0 99 | 0 0 0
32: 0 0 99 | 0 0 0
64: 1 0 99 | 0 0 0
128: 0 0 99 | 0 0 0
256: 2 0 100 | 0 0 0
read | write
disk I/O size ios % cum % | ios % cum %
8: 4 0 0 | 9 3 3
16: 0 0 0 | 0 0 3
32: 1 0 0 | 0 0 3
64: 1 0 0 | 2 0 4
128: 17 0 0 | 83 30 34
256: 1 0 0 | 64 23 57
512: 3 0 0 | 2 0 58
1K: 2 0 0 | 0 0 58
2K: 0 0 0 | 0 0 58
4K: 4379 0 0 | 0 0 58
8K: 10 0 0 | 114 41 100
16K: 0 0 0 | 0 0 100
32K: 0 0 0 | 0 0 100
64K: 0 0 0 | 0 0 100
128K: 0 0 0 | 0 0 100
256K: 0 0 0 | 0 0 100
512K: 0 0 0 | 0 0 100
1M: 456654 89 89 | 0 0 100
2M: 8732 1 91 | 0 0 100
4M: 42704 8 100 | 0 0 100
At the moment, Alex's patch to eliminate the use of dmu_read apparently was the last thing that was needed to get the osd-zfs layer working at speed. The final piece seems to be balancing client requests across the socknal_sd* tasks. If I understand what I've read correctly and compare it to what I see, portal_rotor using HASH_RT is mapping client NIDs to a particular socknal_sd task. I'd like to know how many clients each task is handling, so that I can tell how balanced the requests are. An LNET self test showed 9 GB/s with 16 clients, but trying with data from disk that falls to between 5.5 and 6 GB/s. A snippet from top 12461 root 20 0 0 0 0 R 89.2 0.0 16:52.32 socknal_sd00_01 12471 root 20 0 0 0 0 R 80.6 0.0 12:09.99 socknal_sd03_02 12467 root 20 0 0 0 0 R 66.8 0.0 9:00.63 socknal_sd02_01 12464 root 20 0 0 0 0 R 43.4 0.0 8:09.87 socknal_sd01_01 12466 root 20 0 0 0 0 R 32.5 0.0 6:40.28 socknal_sd02_00 12465 root 20 0 0 0 0 R 31.4 0.0 5:18.51 socknal_sd01_02 12468 root 20 0 0 0 0 S 20.2 0.0 4:06.25 socknal_sd02_02 12463 root 20 0 0 0 0 R 18.4 0.0 3:43.80 socknal_sd01_00 12460 root 20 0 0 0 0 S 16.6 0.0 4:10.38 socknal_sd00_00 12470 root 20 0 0 0 0 S 10.1 0.0 1:17.31 socknal_sd03_01 I'm also going to try setting the CPU pattern, since libcfs is breaking up a dual socket system (Intel E5-2640v2) into 4 partitions. cat /proc/sys/lnet/cpu_partition_table 0 : 0 1 2 3 1 : 4 5 6 7 2 : 8 9 10 11 3 : 12 13 14 15 Since this veers away from the original ticket by dealing with possible LNET and SMP tuning, I would be glad to open a new one. | |||||||||||||||||||||
| Comment by Andreas Dilger [ 08/Feb/15 ] | |||||||||||||||||||||
|
I was mislead by the version field of this issue - I thought you were running 2.6.0, but I now see this wasn't originally your ticket Good to hear that Alex's latest patch is helping out. It also would be a trivial change to move the IO stats as I'd suggested in my review, so that you would get an accurate value for the IO times in brw_stats. They are all currently reporting 1/1000s, which isn't very likely. I would definitely ask that you file a separate ticket for the socklnd and cpu partition tunings. | |||||||||||||||||||||
| Comment by Andreas Dilger [ 08/Feb/15 ] | |||||||||||||||||||||
|
Rick, have you tried increasing the maximum RPC size from the clients? lctl set_param osc.*.max_pages_per_rpc=4M That may increase your IO performance, it may not, depending on where the bottleneck is. Have you tried iostat -x -k -z 1 to see what the utilization of the disks is? Presumably they are not hitting the peak, but I wonder if the IO load generated by Lustre RPCs is different than that from the local ZFS mount, or if the io submission at the SCSI level is getting chopped up? | |||||||||||||||||||||
| Comment by Rick Wagner (Inactive) [ 08/Feb/15 ] | |||||||||||||||||||||
|
Andreas, after tracking rpc_stats for a while, I have max_pages_per_rpc = 1024 and max_rpcs_in_flight = 32. I believe this is why the disk I/O size reported by brw_stats started showing 4M reads, since that was changed after several tests. I have been watch zpool iostat to see how much data is flowing from the pool. The number of operations agree with a 1024K record size, and what I saw from collectl showed 128K reads from the individual drives (we're currently using raidz2 with 8+2 zpools). Below is a capture from zpool iostat during the test, after I changed the CPU partitioning: $ zpool iostat 10
capacity operations bandwidth
pool alloc free read write read write
---------- ----- ----- ----- ----- ----- -----
...wait for test...
mdt 36.6M 59.5G 0 0 0 0
ost0 937G 35.3T 1.15K 0 1.15G 0
ost1 888G 35.4T 1.07K 0 1.07G 0
ost2 860G 35.4T 1.09K 0 1.08G 0
ost3 881G 35.4T 1.12K 0 1.11G 0
ost4 910G 35.4T 1.21K 0 1.21G 0
ost5 854G 35.4T 1.01K 0 1.01G 0
---------- ----- ----- ----- ----- ----- -----
mdt 36.6M 59.5G 0 0 0 0
ost0 937G 35.3T 1.05K 0 1.05G 0
ost1 888G 35.4T 1.14K 0 1.13G 0
ost2 860G 35.4T 1.18K 0 1.17G 0
ost3 881G 35.4T 1.16K 0 1.16G 0
ost4 910G 35.4T 1.15K 0 1.14G 0
ost5 854G 35.4T 1023 0 1017M 0
---------- ----- ----- ----- ----- ----- -----
mdt 36.6M 59.5G 0 0 0 0
ost0 937G 35.3T 1.16K 0 1.15G 0
ost1 888G 35.4T 1.06K 0 1.05G 0
ost2 860G 35.4T 1.11K 0 1.10G 0
ost3 881G 35.4T 1.06K 0 1.05G 0
ost4 910G 35.4T 1.16K 0 1.15G 0
ost5 854G 35.4T 1.02K 0 1.02G 0
---------- ----- ----- ----- ----- ----- -----
mdt 36.6M 59.5G 0 0 0 0
ost0 937G 35.3T 1.09K 0 1.08G 0
ost1 888G 35.4T 1.03K 0 1.02G 0
ost2 860G 35.4T 1.10K 0 1.09G 0
ost3 881G 35.4T 1.12K 0 1.11G 0
ost4 910G 35.4T 1.29K 0 1.29G 0
ost5 854G 35.4T 1.15K 0 1.14G 0
This still leaves about 2 GB/s of performance available when compared to native ZFS and the LNET self test. I expect that if I generate another perf analysis, it will show the time spent in handling the RPCs via socknal_sd*. | |||||||||||||||||||||
| Comment by Gabriele Paciucci (Inactive) [ 08/Feb/15 ] | |||||||||||||||||||||
|
Hi Rick, remember to export this value to all the cluster | |||||||||||||||||||||
| Comment by Rick Wagner (Inactive) [ 08/Feb/15 ] | |||||||||||||||||||||
|
Gabriele, thanks. There are negative numbers in /proc/sys/lnet/peers, and even bumping up the credits on the server gave 10% or so improvement. I'll have to shift to another set of clients to test both sides, since I'm using a production system nodes as clients and can't reload the kernel modules. This would help explain the remaining bottleneck. | |||||||||||||||||||||
| Comment by Gabriele Paciucci (Inactive) [ 09/Feb/15 ] | |||||||||||||||||||||
|
If you are using Ethernet, you should also tune the systctl.conf. Please refer to your Ethernet vendor. This is a good starting point from Mellanox but you can apply to other vendors. | |||||||||||||||||||||
| Comment by Rick Wagner (Inactive) [ 10/Feb/15 ] | |||||||||||||||||||||
|
Andreas & Gabriele, I have moved my network tuning questions over to | |||||||||||||||||||||
| Comment by Cliff White (Inactive) [ 16/Apr/15 ] | |||||||||||||||||||||
|
I am a bit confused by all the network tuning comments. Are there patches available that have not landed in 2.7.52? | |||||||||||||||||||||
| Comment by Cliff White (Inactive) [ 16/Apr/15 ] | |||||||||||||||||||||
|
Watchdogs continue with prefetch disabled. LNet: Service thread pid 64826 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Call Trace: LustreError: dumping log to /tmp/lustre-log.1429199475.64826 | |||||||||||||||||||||
| Comment by Cliff White (Inactive) [ 16/Apr/15 ] | |||||||||||||||||||||
|
Lustre-log dumped by watchdog, all files under /proc/spl on OST | |||||||||||||||||||||
| Comment by Cliff White (Inactive) [ 16/Apr/15 ] | |||||||||||||||||||||
|
/proc/spl from the MDS | |||||||||||||||||||||
| Comment by Alex Zhuravlev [ 17/Apr/15 ] | |||||||||||||||||||||
|
the following isn't exactly the same, but looks very similar: 13:06:17:INFO: task txg_sync:16276 blocked for more than 120 seconds. https://testing.hpdd.intel.com/test_logs/c128d706-e305-11e4-a348-5254006e85c2/show_text | |||||||||||||||||||||
| Comment by Gerrit Updater [ 09/Jun/15 ] | |||||||||||||||||||||
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13612/ | |||||||||||||||||||||
| Comment by Peter Jones [ 09/Jun/15 ] | |||||||||||||||||||||
|
Landed for 2.8 |