[LU-5278] ZFS - many OST watchdogs with IOR Created: 01/Jul/14  Updated: 23/Dec/15  Resolved: 09/Jun/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.6.0
Fix Version/s: Lustre 2.8.0

Type: Bug Priority: Critical
Reporter: Cliff White (Inactive) Assignee: Alex Zhuravlev
Resolution: Fixed Votes: 0
Labels: MB, llnl, prz, zfs
Environment:

Hyperion/LLNL


Attachments: Microsoft Word Hyperion Performance 17 Nov 2014.xlsx     Text File ior.iws28.txt.gz     Text File iws24.dump.txt.gz     Text File iws28.dump.txt.gz     Text File lustre-log.1429199475.64826.txt.gz     File proc_spl.tgz     File proc_spl_MDS.tgz    
Issue Links:
Duplicate
is duplicated by LU-3109 ZFS - very slow reads, OST watchdogs Resolved
is duplicated by LU-5775 obdfilter-survey test 1c: lctl hung o... Resolved
Related
is related to LU-6254 Fix OFD/OSD prefetch for osd-ldiskfs ... Open
is related to LU-4820 extra memcpy in read path Closed
is related to LU-6228 How to balance network connections ac... Resolved
Severity: 3
Rank (Obsolete): 14730

 Description   

Running IOR with 100 clients. Performance is terrible. OSTs are wedging and dropping watchdogs.
Example:

2014-07-01 08:22:47 LNet: Service thread pid 8308 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
2014-07-01 08:22:47 Pid: 8308, comm: ll_ost_io00_014
2014-07-01 08:22:47
2014-07-01 08:22:47 Call Trace:
2014-07-01 08:22:47  [<ffffffffa05b34ba>] ? dmu_zfetch+0x51a/0xd70 [zfs]
2014-07-01 08:22:47  [<ffffffff810a6d01>] ? ktime_get_ts+0xb1/0xf0
2014-07-01 08:22:47  [<ffffffff815287f3>] io_schedule+0x73/0xc0
2014-07-01 08:22:47  [<ffffffffa04f841c>] cv_wait_common+0x8c/0x100 [spl]
2014-07-01 08:22:47  [<ffffffff8109af00>] ? autoremove_wake_function+0x0/0x40
2014-07-01 08:22:47  [<ffffffffa04f84a8>] __cv_wait_io+0x18/0x20 [spl]
2014-07-01 08:22:47  [<ffffffffa062f0ab>] zio_wait+0xfb/0x1b0 [zfs]
2014-07-01 08:22:47  [<ffffffffa05a503d>] dmu_buf_hold_array_by_dnode+0x19d/0x4c0 [zfs]
2014-07-01 08:22:47  [<ffffffffa05a5e68>] dmu_buf_hold_array_by_bonus+0x68/0x90 [zfs]
2014-07-01 08:22:47  [<ffffffffa0e3f1a3>] osd_bufs_get+0x493/0xb00 [osd_zfs]
2014-07-01 08:22:47  [<ffffffffa03be488>] ? libcfs_log_return+0x28/0x40 [libcfs]
2014-07-01 08:22:47  [<ffffffffa0f2e00b>] ofd_preprw_read+0x15b/0x890 [ofd]
2014-07-01 08:22:47  [<ffffffffa0f30709>] ofd_preprw+0x749/0x1650 [ofd]
2014-07-01 08:22:47  [<ffffffffa09d71b1>] obd_preprw.clone.3+0x121/0x390 [ptlrpc]
2014-07-01 08:22:47  [<ffffffffa09deb03>] tgt_brw_read+0x2d3/0x1150 [ptlrpc]
2014-07-01 08:22:47  [<ffffffffa03be488>] ? libcfs_log_return+0x28/0x40 [libcfs]
2014-07-01 08:22:47  [<ffffffffa097ab36>] ? lustre_pack_reply_v2+0x216/0x280 [ptlrpc]
2014-07-01 08:22:47  [<ffffffffa097ac4e>] ? lustre_pack_reply_flags+0xae/0x1f0 [ptlrpc]
2014-07-01 08:22:47  [<ffffffffa09dca7c>] tgt_request_handle+0x23c/0xac0 [ptlrpc]
2014-07-01 08:22:47  [<ffffffffa098c29a>] ptlrpc_main+0xd1a/0x1980 [ptlrpc]
2014-07-01 08:22:47  [<ffffffffa098b580>] ? ptlrpc_main+0x0/0x1980 [ptlrpc]
2014-07-01 08:22:47  [<ffffffff8109ab56>] kthread+0x96/0xa0
2014-07-01 08:22:47  [<ffffffff8100c20a>] child_rip+0xa/0x20
2014-07-01 08:22:47  [<ffffffff8109aac0>] ? kthread+0x0/0xa0
2014-07-01 08:22:47  [<ffffffff8100c200>] ? child_rip+0x0/0x20

Lustre dump attached.

Second example:

2014-07-01 09:38:41 Pid: 9299, comm: ll_ost_io00_070
2014-07-01 09:38:41
2014-07-01 09:38:41 Call Trace:
2014-07-01 09:38:41  [<ffffffffa05b02f7>] ? dmu_zfetch+0x357/0xd70 [zfs]
2014-07-01 09:38:41  [<ffffffffa05957f2>] ? arc_read+0x572/0x8d0 [zfs]
2014-07-01 09:38:41  [<ffffffff810a6d01>] ? ktime_get_ts+0xb1/0xf0
2014-07-01 09:38:41  [<ffffffff815287f3>] io_schedule+0x73/0xc0
2014-07-01 09:38:41  [<ffffffffa04f841c>] cv_wait_common+0x8c/0x100 [spl]
2014-07-01 09:38:41  [<ffffffff8109af00>] ? autoremove_wake_function+0x0/0x40
2014-07-01 09:38:41  [<ffffffffa04f84a8>] __cv_wait_io+0x18/0x20 [spl]
2014-07-01 09:38:41  [<ffffffffa062c0ab>] zio_wait+0xfb/0x1b0 [zfs]
2014-07-01 09:38:41  [<ffffffffa05a203d>] dmu_buf_hold_array_by_dnode+0x19d/0x4c0 [zfs]
2014-07-01 09:38:41  [<ffffffffa05a2e68>] dmu_buf_hold_array_by_bonus+0x68/0x90 [zfs]
2014-07-01 09:38:41  [<ffffffffa0e441a3>] osd_bufs_get+0x493/0xb00 [osd_zfs]
2014-07-01 09:38:41  [<ffffffffa03be488>] ? libcfs_log_return+0x28/0x40 [libcfs]
2014-07-01 09:38:41  [<ffffffffa0f3700b>] ofd_preprw_read+0x15b/0x890 [ofd]
2014-07-01 09:38:41  [<ffffffffa0f39709>] ofd_preprw+0x749/0x1650 [ofd]
2014-07-01 09:38:41  [<ffffffffa09d41b1>] obd_preprw.clone.3+0x121/0x390 [ptlrpc]
2014-07-01 09:38:41  [<ffffffffa09dbb03>] tgt_brw_read+0x2d3/0x1150 [ptlrpc]
2014-07-01 09:38:41  [<ffffffffa03be488>] ? libcfs_log_return+0x28/0x40 [libcfs]
2014-07-01 09:38:41  [<ffffffffa0977b36>] ? lustre_pack_reply_v2+0x216/0x280 [ptlrpc]
2014-07-01 09:38:41  [<ffffffffa0977c4e>] ? lustre_pack_reply_flags+0xae/0x1f0 [ptlrpc]
2014-07-01 09:38:41  [<ffffffffa09d9a7c>] tgt_request_handle+0x23c/0xac0 [ptlrpc]
2014-07-01 09:38:41  [<ffffffffa098929a>] ptlrpc_main+0xd1a/0x1980 [ptlrpc]
2014-07-01 09:38:41  [<ffffffffa0988580>] ? ptlrpc_main+0x0/0x1980 [ptlrpc]
2014-07-01 09:38:41  [<ffffffff8109ab56>] kthread+0x96/0xa0
2014-07-01 09:38:41  [<ffffffff8100c20a>] child_rip+0xa/0x20
2014-07-01 09:38:41  [<ffffffff8109aac0>] ? kthread+0x0/0xa0
2014-07-01 09:38:41  [<ffffffff8100c200>] ? child_rip+0x0/0x20
2014-07-01 09:38:41


 Comments   
Comment by Cliff White (Inactive) [ 01/Jul/14 ]

Lustre log dumped by first watchdog

Comment by Andreas Dilger [ 02/Jul/14 ]

Cliff, is this running ZFS 0.6.2 or 0.6.3?

Also, did the IOR eventually complete or were the same threads stuck forever?

Comment by Cliff White (Inactive) [ 02/Jul/14 ]

ZFS version was 0.6.3 - the IOR did eventually complete, results were terrible.

Comment by Andreas Dilger [ 04/Jul/14 ]

Isaac, could you please take a look at this to see if there is something obvious? We are trying to decide if this needs to be a blocker for 2.6.0.

Comment by Isaac Huang (Inactive) [ 24/Jul/14 ]

Sorry missed this bug during the travels.

Cliff,

1. Please disable ZFS prefetching on both OSS and MDS:
options zfs zfs_prefetch_disable=1
2. If it happens again, please also gather all files under /proc/spl/ on both the OSS and the MDS.

Comment by Andreas Dilger [ 12/Nov/14 ]

Cliff, Isaac, could you please update on the status of this ticket. Cliff, did you implement Isaac's suggestion? Are there still watchdogs being hit?

Comment by Cliff White (Inactive) [ 12/Nov/14 ]

I can run some ZFS-specific tests. Due to LLNL's cfengine, I am not sure I can set that option. Should be able to do this in the next few days

Comment by Cliff White (Inactive) [ 14/Nov/14 ]

We are still dumping watchdogs, I have not yet been able to set the option. Here's an example:

2014-11-14 10:11:14 LNet: Service thread pid 8361 was inactive for 222.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
2014-11-14 10:11:14 Pid: 8361, comm: ll_ost_io03_081
2014-11-14 10:11:14
2014-11-14 10:11:14 Call Trace:
2014-11-14 10:11:14  [<ffffffffa059a4ba>] ? dmu_zfetch+0x51a/0xd70 [zfs]
2014-11-14 10:11:14  [<ffffffff81529e83>] io_schedule+0x73/0xc0
2014-11-14 10:11:14  [<ffffffffa04b141c>] cv_wait_common+0x8c/0x100 [spl]
2014-11-14 10:11:14  [<ffffffff8109afa0>] ? autoremove_wake_function+0x0/0x40
2014-11-14 10:11:14  [<ffffffffa04b14a8>] __cv_wait_io+0x18/0x20 [spl]
2014-11-14 10:11:14  [<ffffffffa061611b>] zio_wait+0xfb/0x1b0 [zfs]
2014-11-14 10:11:14  [<ffffffffa058c03d>] dmu_buf_hold_array_by_dnode+0x19d/0x4c0 [zfs]
2014-11-14 10:11:14  [<ffffffffa058ce68>] dmu_buf_hold_array_by_bonus+0x68/0x90 [zfs]
2014-11-14 10:11:14  [<ffffffffa110f4b3>] osd_bufs_get+0x493/0xb00 [osd_zfs]
2014-11-14 10:11:14  [<ffffffffa03c71a1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
2014-11-14 10:11:14  [<ffffffffa03c17d8>] ? libcfs_log_return+0x28/0x40 [libcfs]
2014-11-14 10:11:14  [<ffffffffa1050f7b>] ofd_preprw_read+0x15b/0x930 [ofd]
2014-11-14 10:11:14  [<ffffffffa1051f99>] ofd_preprw+0x849/0x1680 [ofd]
2014-11-14 10:11:14  [<ffffffffa09a24b1>] obd_preprw.clone.3+0x121/0x390 [ptlrpc]
2014-11-14 10:11:14  [<ffffffffa09aa24e>] tgt_brw_read+0x67e/0x1160 [ptlrpc]
2014-11-14 10:11:14  [<ffffffffa03c17d8>] ? libcfs_log_return+0x28/0x40 [libcfs]
2014-11-14 10:11:14  [<ffffffffa0946446>] ? lustre_pack_reply_v2+0x226/0x290 [ptlrpc]
2014-11-14 10:11:14  [<ffffffffa094655e>] ? lustre_pack_reply_flags+0xae/0x1f0 [ptlrpc]
2014-11-14 10:11:14  [<ffffffffa09a82ae>] tgt_request_handle+0x71e/0xb10 [ptlrpc]
2014-11-14 10:11:14  [<ffffffffa0957a94>] ptlrpc_main+0xe64/0x1990 [ptlrpc]
2014-11-14 10:11:14  [<ffffffffa0956c30>] ? ptlrpc_main+0x0/0x1990 [ptlrpc]
2014-11-14 10:11:14  [<ffffffff8109abf6>] kthread+0x96/0xa0
2014-11-14 10:11:14  [<ffffffff8100c20a>] child_rip+0xa/0x20
2014-11-14 10:11:14  [<ffffffff8109ab60>] ? kthread+0x0/0xa0
2014-11-14 10:11:14  [<ffffffff8100c200>] ? child_rip+0x0/0x20
2014-11-14 10:11:14
Comment by Cliff White (Inactive) [ 14/Nov/14 ]

Log dump from OSS

Comment by Cliff White (Inactive) [ 14/Nov/14 ]

I am making a second run now with the options set

Comment by Cliff White (Inactive) [ 17/Nov/14 ]

Disabling prefetch seems to fix the watchdogs, performance is about the same or perhaps a bit worse, spreadsheet attached

Comment by Isaac Huang (Inactive) [ 09/Jan/15 ]

For some workloads, zprefetch actually helps a lot. But if performance went just a bit worse by disabling it, then zprefetch would mostly just incur unnecessary IOs and likely execute some flaky parts of the code. So I'd suggest to keep it disabled as long as performance isn't adversely affected.

Comment by Cliff White (Inactive) [ 14/Jan/15 ]

Okay, for now will leave prefetch always off. Left it on by mistake in last IEEL runs.

Comment by Rick Wagner (Inactive) [ 01/Feb/15 ]

I believe we are seeing the cause of this while testing streaming read IO using Lustre backed by ZFS, namely that osd_bufs_get (and therefore osd_bufs_get_read) is called inside of a for loop via ofd_preprw_read, and osd_bufs_get_read is hitting storage on each call due to its call to dmu_buf_hold_array_by_bonus. My guess is that IOs are not being queued up to the storage as intended by dt_object_design when using ZFS.

Comment by Alex Zhuravlev [ 02/Feb/15 ]

specifically with streaming reads this shouldn't be a problem - it'd be a single call to dmu_buf_hold_array_by_bonus() as the whole range is described by a single niobuf.

Comment by Rick Wagner (Inactive) [ 02/Feb/15 ]

Thanks, Alex, that's reassuring. We're trying to figure out why we're getting 2.4 GB/s read bandwidth out of OSTs that will deliver over 9 GB/s when mounted as ZFS and using dd. During our profiling, I saw that dmu_buf_hold_array_by_bonus() was going to the drive, and didn't know that nicount in ofd_preprw_read() would be set to 1 for streaming reads.

Comment by Alex Zhuravlev [ 02/Feb/15 ]

I think it makes sense to verify it's really 1 and I'm not too optimistic

Comment by Rick Wagner (Inactive) [ 02/Feb/15 ]

Alex, without having attached to a process to check, I don't think it is 1. We're testing using obdfilter-survey, which uses the echo client. Inside of obdecho/echo_client.c:echo_client_prep_commit() there's a loop with the following:

                ioo.ioo_bufcnt = npages;

                lpages = npages;
		ret = obd_preprw(env, rw, exp, oa, 1, &ioo, rnb, &lpages,
                                 lnb, oti, NULL);

where npages is the number of 4K pages in the request.

Later, ofd_io.c:ofd_preprw() call ofd_preprw_read() where ioo is now obj

			rc = ofd_preprw_read(env, exp, ofd, fid,
					     &info->fti_attr, oa,
					     obj->ioo_bufcnt, rnb, nr_local,
					     lnb, jobid);

and that's where ioo_bufcnt becomes niocount, and the loop is executed which eventually gets to osd-zfs:osd_io.c:osd_bufs_get_read() and dmu_buf_hold_array_by_bonus(). This points to dmu_buf_hold_array_by_bonus() being called for 4K buffers. I don't know if this is by design, but depending on how ZFS handles these requests, it could explain why the streaming bandwidth is so much lower that what we see with native ZFS file systems.

The performance is similarly low when the Lustre file system is mounted, so I'm guessing the regular code path has similar logic to what's in the echo client.

Comment by Alex Zhuravlev [ 03/Feb/15 ]

I'd agree that echo_client_prep_commit() isn't a good approach, it should be filing rnb'es and then make a single call to obd_preprw(). please try the regular I/O with dd, I'm quite sure you'll be seeing mostly single niobuf requests.

Comment by Andreas Dilger [ 03/Feb/15 ]

Obdfilter-survey should behave the same as our network IO submission, so that it simulates real world performance reasonably accurately. It makes sense to fix echo_cloent_prep_commit() to build large niobufs for submission.

Comment by Rick Wagner (Inactive) [ 03/Feb/15 ]

Indeed, running single dds from 4 clients was sufficient to see 1.3 GB/s (occasional peaks from zpool iostat at 1.5 GB/s) from a single OST. I'll need more clients to see how that scales over more OSTs, but that's far better read performance than we were getting from a single OST via obdfilter-survey, and is much closer to the native ZFS results. If there's a method for me to verify the len passed to osd_bufs_get_read, I'd be glad to try it.

And, Andreas, I completely agree with your comment about the echo client. This has led to significant confusion on our side about the capabilities of Lustre with ZFS. A simple tool which can simulate dd-like workloads on the server is what we expected, what we got behaved more fio.

Comment by Gerrit Updater [ 03/Feb/15 ]

Alex Zhuravlev (alexey.zhuravlev@intel.com) uploaded a new patch: http://review.whamcloud.com/13612
Subject: LU-5278 echo: request pages in batches
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 87737ebe6cc4bb45148e0c86e3b3e05007ab80fb

Comment by Rick Wagner (Inactive) [ 03/Feb/15 ]

Alex, thanks for the patch. I will test this today and see how close it compares to the native ZFS performance. Now it looks we're bottlenecking on socknal_sd* threads when reading from more than one OST across the network. Different topic, I expect.

Comment by Alex Zhuravlev [ 03/Feb/15 ]

thanks in advance, Rick. I tested the patch locally and it seem to be OK, but the testing just started on Maloo, so I can't yet promise it's perfect.

Comment by Rick Wagner (Inactive) [ 04/Feb/15 ]

I'm optimistic about your patch doing what you expect, Alex. I'm seeing up to 1.5 GB/s reads from a single OST, which is what I got from a client using dd last night, and very close the 1.6 GB/s I measure on native ZFS. The real test will be how much it helps in performance tuning.

$  targets=edragon-OST0000 rszlo=4096 rszhi=4096 size=$((64*1024)) nobjlo=2 nobjhi=4 thrlo=2 thrhi=4  obdfilter-survey
Tue Feb  3 21:00:43 PST 2015 Obdfilter-survey for case=disk from seahorse-oss-19-5.local
ost  1 sz 67108864K rsz 4096K obj    2 thr    2 write  842.72 [ 547.97, 947.91] read 1520.84 [1087.92,1835.91] 
ost  1 sz 67108864K rsz 4096K obj    2 thr    4 write  810.22 [  31.99,1303.90] read  846.02 [ 203.99,1691.92] 
ost  1 sz 67108864K rsz 4096K obj    4 thr    4 write  833.51 [ 307.98,1215.91] read 1518.76 [1187.94,2011.89] 
done!
$ zpool iostat 5
...
----------  -----  -----  -----  -----  -----  -----
mdt         27.2M  3.62T      0      0      0      0
ost0         141G  36.1T  1.51K     30  1.51G  30.0M
ost1        42.7G  36.2T      0      0      0      0
ost2        41.8G  36.2T      0      0      0      0
----------  -----  -----  -----  -----  -----  -----
mdt         27.2M  3.62T      0      0      0      0
ost0         141G  36.1T  1.48K     34  1.48G  34.8M
ost1        42.7G  36.2T      0      0      0      0
ost2        41.8G  36.2T      0      0      0      0
----------  -----  -----  -----  -----  -----  -----
mdt         27.2M  3.62T      0      0      0      0
ost0         141G  36.1T  1.49K     37  1.49G  37.0M
ost1        42.7G  36.2T      0      0      0      0
ost2        41.8G  36.2T      0      0      0      0

Comment by Alex Zhuravlev [ 04/Feb/15 ]

thanks for the feedback, Rick.

Comment by Andreas Dilger [ 04/Feb/15 ]

A significant amount of overhead has been observed in socknal threads due to TCP overhead from assembling packets on read. In the past the CPU usage of this data copying was on the order of 1GHz CPU/1GB/s of network bandwidth, which is one reason why IB with RDMA is more attractive. Another possibility (which we haven't tested much ourselves, but have heard some reports about) is to try RoCE via o2iblnd, if your ethernet hardware supports it. That would likely reduce your CPU usage.

Comment by Rick Wagner (Inactive) [ 07/Feb/15 ]

Andreas, I'll hold off on blaming the socknal threads for the performance, since I think I've found another place where 4K reads are being made. First, I can confirm that Alex's patch to the echo client was a huge improvement to echo client. The streaming bandwidth performance comparison between dd on remote clients to a server running obdfilter-survey is much more consistent. You can see this in these two flame graphs: before, 25% of time in osd_bufs_get, and after, 1.5% of time in osd_bufs_get.

Since the performance is now closer between client and server (between 4 to 5 GB/s for reads), but still much less than native ZFS (over 9 GB/s, scaling linearly from 1.5 GB/s on a single zpool), I looked at the portion of code still taking up a lot of time, osd_read, and its call to dmu_read. If I trace back the calls correctly, dmu_read is being called with a size of 4K.

Like the earlier discussion, it starts in ofd_preprw_read with it's call to dt_bufs_get, except this time, after the call to dmu_buf_hold_array_by_bonus, osd_bufs_get_read loops over lnb mapping (copying?) data.

			while (tocpy > 0) {
				thispage = PAGE_CACHE_SIZE;
				thispage -= bufoff & (PAGE_CACHE_SIZE - 1);
				thispage = min(tocpy, thispage);

				lnb->lnb_rc = 0;
				lnb->lnb_file_offset = off;
				lnb->lnb_page_offset = bufoff & ~CFS_PAGE_MASK;
				lnb->lnb_len = thispage;
				lnb->lnb_page = kmem_to_page(dbp[i]->db_data +
							     bufoff);
				/* mark just a single slot: we need this
				 * reference to dbuf to be release once */
				lnb->lnb_data = dbf;
				dbf = NULL;

				tocpy -= thispage;
				len -= thispage;
				bufoff += thispage;
				off += thispage;

				npages++;
				lnb++;
			}

This is where lnb->lnb_len gets set to 4K, and npages is incremented before being returned back to ofd_preprw_read. The page count is also incremented via the nr_local variable. Both lnb and nr_local are then passed to dt_read_prep. From there, osd_read_prep has a loop over the number of pages calling osd_read, which makes a single call to dmu_read.

Would someone confirm that I've got this correct? If so, it may be a more involved patch, unless the logic can stay in osd-zfs/osd_io.c. I do think this is consistent with reduced performance with additional OSTs, and the time spent in dmu_read.

  1. OSTs
read GB/S Perf
1 1357 flame graph
2 2261 flame graph
3 3236 flame graph
4 3729 flame graph
5 4369 flame graph
6 5130 flame graph
Comment by Alex Zhuravlev [ 07/Feb/15 ]

Rick, you should try this patch - http://review.whamcloud.com/#/c/12991/
actual I/O is done in osd_bufs_get_read() and it's not in 4K except it was fragmented by the client. if client requests say 1M then it will be a single call to dmu_buf_hold_array_by_bonus() and then we just map returned ARC buffers to the pages.

Comment by Rick Wagner (Inactive) [ 07/Feb/15 ]

Hi Alex,

I applied the patch, but things went backwards. My impression is that relying on the ARC only works when streaming data below zfs_arc_max. For this test, I set zfs_arc_max to 64 GB on a machine with 128 GB of RAM. The 2 GB/s writes are also much better than native ZFS, which make me suspect caching.

The numbers reported, and the observation of zpool iostat look like data is being pulled from a single drive as needed. Is it better to call dmu_read with a similar size argument to what was used for dmu_buf_hold_array_by_bonus?

$ targets="ddragon-OST0000 ddragon-OST0001"  rszlo=4096 rszhi=4096 size=$((64*1024)) nobjlo=1 nobjhi=1 thrlo=1 thrhi=1 obdfilter-survey  
Fri Feb  6 20:58:47 PST 2015 Obdfilter-survey for case=disk from seahorse-oss-19-2.local
ost  2 sz 134217728K rsz 4096K obj    2 thr    2 write 2015.86 [ 387.91,5523.86] read  436.14 [  32.00,1051.92] 
done!
$ zpool iostat 5
               capacity     operations    bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
...wait a while...
mdt         36.6M  59.5G      0      0    202      0
ost0         965G  35.3T    242      0   242M      0
ost1         918G  35.4T    203      0   203M      0
ost2         829G  35.4T      0      0    201      0
ost3         849G  35.4T      0      0    201      0
ost4         831G  35.4T      0      0    201      0
ost5         831G  35.4T      0      0    201      0
----------  -----  -----  -----  -----  -----  -----
mdt         36.6M  59.5G      0      0    201      0
ost0         965G  35.3T    235      0   235M      0
ost1         918G  35.4T    216      0   216M      0
ost2         829G  35.4T      0      0    203      0
ost3         849G  35.4T      0      0    203      0
ost4         831G  35.4T      0      0    203      0
ost5         831G  35.4T      0      0    203      0
Comment by Alex Zhuravlev [ 07/Feb/15 ]

Rick, ARC is always used. dmu_read() shouldn't be used at all, instead we call dmu_buf_hold_array_by_bonus() to fill the buffers with size equal to one applied by the client (should be 1M).

Comment by Rick Wagner (Inactive) [ 07/Feb/15 ]

OK, I'll do another perf analysis and see where the time went. At some point, the data requested will surpass the ARC, and need to begin streaming off of disk efficiently. I do wonder (out of ignorance) if the DMU layer is more related to the spl_kmem_cache than the ARC.

Comment by Alex Zhuravlev [ 07/Feb/15 ]

in the meantime, I'll try to add more I/O stats to osd-zfs/.

Comment by Rick Wagner (Inactive) [ 07/Feb/15 ]

Any instructions on how to pull debugging information from the running tasks would be appreciated. I am willing to use gdb, but I don't know how much guidance that would require to introspect the correct function.

Comment by Alex Zhuravlev [ 07/Feb/15 ]

what kind of information? I'm planning to add I/O stats available in /proc/fs/lustre/..

Comment by Rick Wagner (Inactive) [ 07/Feb/15 ]

Like rpcstats on the client, a report of the {{len}}s and {{offset}}s in the call to dmu_buf_hold_array_by_bonus seems good, since this is the interface to the ZFS layer. Basically, summary statistics of the calls to ZFS from Lustre.

Comment by Rick Wagner (Inactive) [ 07/Feb/15 ]

The patch for LU-4820 may have hit an unintended consequence. Something about relying on dmu_buf_hold_array_by_bonus and the ARC has trigger the swapper task, as seen in this this flame graph for 1 OST. I triggered this by zeroing out the ARC. This could also be done by requesting data over the size of the ARC.

Comment by Alex Zhuravlev [ 07/Feb/15 ]

actually all the patch does is removing unnecessary dmu_read(). I'd think running swapper isn't a problem as long as this doesn't block reads. it should be releasing memory a bit ahead so that ARC always can get memory for new buffers immediately.

Comment by Rick Wagner (Inactive) [ 07/Feb/15 ]

I see your point, Alex, and it's shown that there must be another issue with the echo client. I went to 4 remote host, with each pulling a single file from the same OST using dd, and the zpool is happily streaming over 1 GB/s. I will move on to multiple OSTs, and if I see the same sub-linear scaling with additional OSTs I will do my perf analysis using remote clients.

Comment by Andreas Dilger [ 08/Feb/15 ]

I'm planning to add I/O stats available in /proc/fs/lustre/..

The /proc/fs/lustre/osd-zfs/*/brw_stats file was added to master for 2.7.0 via http://review.whamcloud.com/11467 so this may be useful for your testing. See also my comments in Alex's patch http://review.whamcloud.com/12991 which should be fixed up to move the brw_stats accounting around the dmu_buf_hold_array_by_bonus)) loop so it accurately measures read times. Unfortunately, measuring write times is much more difficult for ZFS, since the write completion only happens down in ZFS.

Comment by Rick Wagner (Inactive) [ 08/Feb/15 ]

Thanks, Andreas. Since I'm working from the Git master with all the patches I've needed to get this far (including large block support in ZFS), it turns out that's already in place. I can see that all of the IO requests are at least 1M in size. Very useful. For both reads and writes, I do find

{zpool iostat <seconds>}

to be a good check of what's going in and out of the disks, and ignores reads from cache.

$ cat brw_stats 
snapshot_time:         1423367149.544364 (secs.usecs)

                           read      |     write
pages per bulk r/w     rpcs  % cum % |  rpcs        % cum %
1:		      4408   0   0   |  160  58  58
2:		        10   0   0   |  114  41 100
4:		         0   0   0   |    0   0 100
8:		         0   0   0   |    0   0 100
16:		         0   0   0   |    0   0 100
32:		         0   0   0   |    0   0 100
64:		         0   0   0   |    0   0 100
128:		         0   0   0   |    0   0 100
256:		    456654  89  89   |    0   0 100
512:		      8732   1  91   |    0   0 100
1K:		     42704   8 100   |    0   0 100

                           read      |     write
discontiguous pages    rpcs  % cum % |  rpcs        % cum %
0:		    512508 100 100   |  274 100 100

                           read      |     write
disk I/Os in flight    ios   % cum % |  ios         % cum %
1:		    511312  99  99   |  273  99  99
2:		      1123   0  99   |    1   0 100
3:		        62   0  99   |    0   0 100
4:		         9   0  99   |    0   0 100
5:		         1   0  99   |    0   0 100
6:		         1   0 100   |    0   0 100

                           read      |     write
I/O time (1/1000s)     ios   % cum % |  ios         % cum %
1:		       494  98  98   |    0   0   0
2:		         0   0  98   |    0   0   0
4:		         0   0  98   |    0   0   0
8:		         1   0  99   |    0   0   0
16:		         2   0  99   |    0   0   0
32:		         0   0  99   |    0   0   0
64:		         1   0  99   |    0   0   0
128:		         0   0  99   |    0   0   0
256:		         2   0 100   |    0   0   0

                           read      |     write
disk I/O size          ios   % cum % |  ios         % cum %
8:		         4   0   0   |    9   3   3
16:		         0   0   0   |    0   0   3
32:		         1   0   0   |    0   0   3
64:		         1   0   0   |    2   0   4
128:		        17   0   0   |   83  30  34
256:		         1   0   0   |   64  23  57
512:		         3   0   0   |    2   0  58
1K:		         2   0   0   |    0   0  58
2K:		         0   0   0   |    0   0  58
4K:		      4379   0   0   |    0   0  58
8K:		        10   0   0   |  114  41 100
16K:		         0   0   0   |    0   0 100
32K:		         0   0   0   |    0   0 100
64K:		         0   0   0   |    0   0 100
128K:		         0   0   0   |    0   0 100
256K:		         0   0   0   |    0   0 100
512K:		         0   0   0   |    0   0 100
1M:		    456654  89  89   |    0   0 100
2M:		      8732   1  91   |    0   0 100
4M:		     42704   8 100   |    0   0 100

At the moment, Alex's patch to eliminate the use of dmu_read apparently was the last thing that was needed to get the osd-zfs layer working at speed. The final piece seems to be balancing client requests across the socknal_sd* tasks. If I understand what I've read correctly and compare it to what I see, portal_rotor using HASH_RT is mapping client NIDs to a particular socknal_sd task. I'd like to know how many clients each task is handling, so that I can tell how balanced the requests are. An LNET self test showed 9 GB/s with 16 clients, but trying with data from disk that falls to between 5.5 and 6 GB/s.

A snippet from top

 12461 root      20   0     0    0    0 R 89.2  0.0  16:52.32 socknal_sd00_01                                      
 12471 root      20   0     0    0    0 R 80.6  0.0  12:09.99 socknal_sd03_02                                      
 12467 root      20   0     0    0    0 R 66.8  0.0   9:00.63 socknal_sd02_01                                      
 12464 root      20   0     0    0    0 R 43.4  0.0   8:09.87 socknal_sd01_01                                      
 12466 root      20   0     0    0    0 R 32.5  0.0   6:40.28 socknal_sd02_00                                      
 12465 root      20   0     0    0    0 R 31.4  0.0   5:18.51 socknal_sd01_02                                      
 12468 root      20   0     0    0    0 S 20.2  0.0   4:06.25 socknal_sd02_02                                      
 12463 root      20   0     0    0    0 R 18.4  0.0   3:43.80 socknal_sd01_00                                      
 12460 root      20   0     0    0    0 S 16.6  0.0   4:10.38 socknal_sd00_00                                      
 12470 root      20   0     0    0    0 S 10.1  0.0   1:17.31 socknal_sd03_01  

I'm also going to try setting the CPU pattern, since libcfs is breaking up a dual socket system (Intel E5-2640v2) into 4 partitions.

cat /proc/sys/lnet/cpu_partition_table 
0	: 0 1 2 3 
1	: 4 5 6 7 
2	: 8 9 10 11 
3	: 12 13 14 15 

Since this veers away from the original ticket by dealing with possible LNET and SMP tuning, I would be glad to open a new one.

Comment by Andreas Dilger [ 08/Feb/15 ]

I was mislead by the version field of this issue - I thought you were running 2.6.0, but I now see this wasn't originally your ticket

Good to hear that Alex's latest patch is helping out. It also would be a trivial change to move the IO stats as I'd suggested in my review, so that you would get an accurate value for the IO times in brw_stats. They are all currently reporting 1/1000s, which isn't very likely.

I would definitely ask that you file a separate ticket for the socklnd and cpu partition tunings.

Comment by Andreas Dilger [ 08/Feb/15 ]

Rick, have you tried increasing the maximum RPC size from the clients?

lctl set_param osc.*.max_pages_per_rpc=4M

That may increase your IO performance, it may not, depending on where the bottleneck is.

Have you tried iostat -x -k -z 1 to see what the utilization of the disks is? Presumably they are not hitting the peak, but I wonder if the IO load generated by Lustre RPCs is different than that from the local ZFS mount, or if the io submission at the SCSI level is getting chopped up?

Comment by Rick Wagner (Inactive) [ 08/Feb/15 ]

Andreas, after tracking rpc_stats for a while, I have max_pages_per_rpc = 1024 and max_rpcs_in_flight = 32. I believe this is why the disk I/O size reported by brw_stats started showing 4M reads, since that was changed after several tests.

I have been watch zpool iostat to see how much data is flowing from the pool. The number of operations agree with a 1024K record size, and what I saw from collectl showed 128K reads from the individual drives (we're currently using raidz2 with 8+2 zpools).

Below is a capture from zpool iostat during the test, after I changed the CPU partitioning:

$ zpool iostat 10
               capacity     operations    bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
...wait for test...
mdt         36.6M  59.5G      0      0      0      0
ost0         937G  35.3T  1.15K      0  1.15G      0
ost1         888G  35.4T  1.07K      0  1.07G      0
ost2         860G  35.4T  1.09K      0  1.08G      0
ost3         881G  35.4T  1.12K      0  1.11G      0
ost4         910G  35.4T  1.21K      0  1.21G      0
ost5         854G  35.4T  1.01K      0  1.01G      0
----------  -----  -----  -----  -----  -----  -----
mdt         36.6M  59.5G      0      0      0      0
ost0         937G  35.3T  1.05K      0  1.05G      0
ost1         888G  35.4T  1.14K      0  1.13G      0
ost2         860G  35.4T  1.18K      0  1.17G      0
ost3         881G  35.4T  1.16K      0  1.16G      0
ost4         910G  35.4T  1.15K      0  1.14G      0
ost5         854G  35.4T   1023      0  1017M      0
----------  -----  -----  -----  -----  -----  -----
mdt         36.6M  59.5G      0      0      0      0
ost0         937G  35.3T  1.16K      0  1.15G      0
ost1         888G  35.4T  1.06K      0  1.05G      0
ost2         860G  35.4T  1.11K      0  1.10G      0
ost3         881G  35.4T  1.06K      0  1.05G      0
ost4         910G  35.4T  1.16K      0  1.15G      0
ost5         854G  35.4T  1.02K      0  1.02G      0
----------  -----  -----  -----  -----  -----  -----
mdt         36.6M  59.5G      0      0      0      0
ost0         937G  35.3T  1.09K      0  1.08G      0
ost1         888G  35.4T  1.03K      0  1.02G      0
ost2         860G  35.4T  1.10K      0  1.09G      0
ost3         881G  35.4T  1.12K      0  1.11G      0
ost4         910G  35.4T  1.29K      0  1.29G      0
ost5         854G  35.4T  1.15K      0  1.14G      0

This still leaves about 2 GB/s of performance available when compared to native ZFS and the LNET self test. I expect that if I generate another perf analysis, it will show the time spent in handling the RPCs via socknal_sd*.

Comment by Gabriele Paciucci (Inactive) [ 08/Feb/15 ]

Hi Rick,
take a look of the /proc/sys/lnet/peers and see if your queue is big enough. If you find some minus values, please increase the peer_credits and credits value for LNET.
I can suggest as "gold" rule:
peer_credits=max_rpc_inflight
credits= 4x peer_credits

remember to export this value to all the cluster

Comment by Rick Wagner (Inactive) [ 08/Feb/15 ]

Gabriele, thanks. There are negative numbers in /proc/sys/lnet/peers, and even bumping up the credits on the server gave 10% or so improvement. I'll have to shift to another set of clients to test both sides, since I'm using a production system nodes as clients and can't reload the kernel modules. This would help explain the remaining bottleneck.

Comment by Gabriele Paciucci (Inactive) [ 09/Feb/15 ]

If you are using Ethernet, you should also tune the systctl.conf. Please refer to your Ethernet vendor. This is a good starting point from Mellanox but you can apply to other vendors.
http://www.mellanox.com/related-docs/prod_software/Performance_Tuning_Guide_for_Mellanox_Network_Adapters.pdf

Comment by Rick Wagner (Inactive) [ 10/Feb/15 ]

Andreas & Gabriele, I have moved my network tuning questions over to LU-6228.

Comment by Cliff White (Inactive) [ 16/Apr/15 ]

I am a bit confused by all the network tuning comments. Are there patches available that have not landed in 2.7.52?

Comment by Cliff White (Inactive) [ 16/Apr/15 ]

Watchdogs continue with prefetch disabled.


LNet: Service thread pid 64826 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Pid: 64826, comm: ll_ost03_025

Call Trace:
[<ffffffffa054f790>] ? vdev_mirror_child_done+0x0/0x30 [zfs]
[<ffffffff8152acee>] ? mutex_lock+0x1e/0x50
[<ffffffff8152acee>] ? mutex_lock+0x1e/0x50
[<ffffffff81529e83>] io_schedule+0x73/0xc0
[<ffffffffa044b596>] cv_wait_common+0xa6/0x120 [spl]
[<ffffffff8109afa0>] ? autoremove_wake_function+0x0/0x40
[<ffffffffa044b628>] __cv_wait_io+0x18/0x20 [spl]
[<ffffffffa058c81b>] zio_wait+0xfb/0x1c0 [zfs]
[<ffffffffa04f673a>] dbuf_read+0x47a/0x7f0 [zfs]
[<ffffffffa04fed98>] dmu_buf_hold+0x108/0x1d0 [zfs]
[<ffffffffa0555ab2>] zap_get_leaf_byblk+0x52/0x300 [zfs]
[<ffffffffa0554584>] ? zap_idx_to_blk+0xe4/0x150 [zfs]
[<ffffffffa0555dca>] zap_deref_leaf+0x6a/0x80 [zfs]
[<ffffffffa0556430>] fzap_lookup+0x60/0x120 [zfs]
[<ffffffffa05598f8>] ? zap_name_alloc+0x88/0xf0 [zfs]
[<ffffffffa055ba21>] zap_lookup_norm+0xe1/0x190 [zfs]
[<ffffffffa055bb63>] zap_lookup+0x33/0x40 [zfs]
[<ffffffffa108afa5>] osd_fid_lookup+0xb5/0x2f0 [osd_zfs]
[<ffffffffa1084a1c>] osd_object_init+0x19c/0x6c0 [osd_zfs]
[<ffffffffa03bb798>] ? libcfs_log_return+0x28/0x40 [libcfs]
[<ffffffffa0fac9d9>] ? ofd_object_init+0x99/0x180 [ofd]
[<ffffffffa07c6318>] lu_object_alloc+0xd8/0x320 [obdclass]
[<ffffffffa07c7821>] lu_object_find_try+0x151/0x260 [obdclass]
[<ffffffffa07c79e1>] lu_object_find_at+0xb1/0xe0 [obdclass]
[<ffffffffa03bf161>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
[<ffffffffa07c7a26>] lu_object_find+0x16/0x20 [obdclass]
[<ffffffffa0fc3215>] ofd_object_find+0x35/0xf0 [ofd]
[<ffffffffa0fc5b0b>] ofd_precreate_objects+0x1fb/0x19e0 [ofd]
[<ffffffffa03bf161>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
[<ffffffffa0fd2928>] ? ofd_grant_create+0x2b8/0x450 [ofd]
[<ffffffffa0fb6ca6>] ofd_create_hdl+0x566/0x25c0 [ofd]
[<ffffffffa09e78c0>] ? lustre_pack_reply_v2+0x220/0x280 [ptlrpc]
[<ffffffffa0a4946e>] tgt_request_handle+0x8be/0x1000 [ptlrpc]
[<ffffffffa09f8e61>] ptlrpc_main+0xe41/0x1960 [ptlrpc]
[<ffffffffa09f8020>] ? ptlrpc_main+0x0/0x1960 [ptlrpc]
[<ffffffff8109abf6>] kthread+0x96/0xa0
[<ffffffff8100c20a>] child_rip+0xa/0x20
[<ffffffff8109ab60>] ? kthread+0x0/0xa0
[<ffffffff8100c200>] ? child_rip+0x0/0x20

LustreError: dumping log to /tmp/lustre-log.1429199475.64826

Comment by Cliff White (Inactive) [ 16/Apr/15 ]

Lustre-log dumped by watchdog, all files under /proc/spl on OST

Comment by Cliff White (Inactive) [ 16/Apr/15 ]

/proc/spl from the MDS

Comment by Alex Zhuravlev [ 17/Apr/15 ]

the following isn't exactly the same, but looks very similar:

13:06:17:INFO: task txg_sync:16276 blocked for more than 120 seconds.
13:06:17: Tainted: P --------------- 2.6.32-504.12.2.el6_lustre.g036b949.x86_64 #1
13:06:17:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
13:06:17:txg_sync D 0000000000000001 0 16276 2 0x00000080
13:06:17: ffff88006de4b890 0000000000000046 ffff88006de4b820 ffffffff81041e98
13:06:17: 00000000ffffffff 000007051e05132a 0000000000000000 ffff88007918e980
13:06:17: 00000000002301ae ffffffffaad2f4da ffff88006fc5bab8 ffff88006de4bfd8
13:06:17:Call Trace:
13:06:17: [<ffffffff81041e98>] ? pvclock_clocksource_read+0x58/0xd0
13:06:17: [<ffffffff810aaa21>] ? ktime_get_ts+0xb1/0xf0
13:06:17: [<ffffffff8152aad3>] io_schedule+0x73/0xc0
13:06:17: [<ffffffffa0145596>] cv_wait_common+0xa6/0x120 [spl]
13:06:17: [<ffffffff8109eb00>] ? autoremove_wake_function+0x0/0x40
13:06:17: [<ffffffffa0145628>] __cv_wait_io+0x18/0x20 [spl]
13:08:18: [<ffffffffa028f81b>] zio_wait+0xfb/0x1c0 [zfs]
13:08:18: [<ffffffffa029134b>] zio_free+0xab/0xe0 [zfs]
13:08:18: [<ffffffffa02279a1>] dsl_free+0x11/0x20 [zfs]
13:08:18: [<ffffffffa021b102>] dsl_dataset_block_kill+0x352/0x380 [zfs]
13:08:18: [<ffffffffa0214bee>] free_blocks+0x6e/0xb0 [zfs]
13:08:18: [<ffffffffa0215838>] dnode_sync+0x4c8/0xac0 [zfs]
13:08:18: [<ffffffffa01fb3fb>] ? dbuf_sync_list+0x7b/0x80 [zfs]
13:08:18: [<ffffffffa01f034a>] ? arc_write+0xea/0x100 [zfs]
13:08:18: [<ffffffffa0204e49>] dmu_objset_sync_dnodes+0x89/0xb0 [zfs]
13:08:18: [<ffffffffa020503a>] dmu_objset_sync+0x1ca/0x2d0 [zfs]
13:08:18: [<ffffffffa02040c0>] ? dmu_objset_write_ready+0x0/0x50 [zfs]
13:08:18: [<ffffffffa0205140>] ? dmu_objset_write_done+0x0/0x70 [zfs]
13:08:18: [<ffffffffa0222b8b>] dsl_pool_sync+0x2ab/0x3f0 [zfs]
13:08:18: [<ffffffffa023b8bf>] spa_sync+0x40f/0xa70 [zfs]
13:08:18: [<ffffffffa0245771>] ? spa_txg_history_set+0xc1/0xf0 [zfs]
13:08:18: [<ffffffffa0248c7d>] txg_sync_thread+0x30d/0x520 [zfs]
13:08:18: [<ffffffff8105c2f9>] ? set_user_nice+0xc9/0x130
13:08:18: [<ffffffffa0248970>] ? txg_sync_thread+0x0/0x520 [zfs]

https://testing.hpdd.intel.com/test_logs/c128d706-e305-11e4-a348-5254006e85c2/show_text

Comment by Gerrit Updater [ 09/Jun/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13612/
Subject: LU-5278 echo: request pages in batches
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 89021de564c27f38a4146357e58dd80ddf68e246

Comment by Peter Jones [ 09/Jun/15 ]

Landed for 2.8

Generated at Sat Feb 10 01:50:05 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.