[LU-2325] ZFS object creation is slow, causing many "Slow creates" messages Created: 14/Nov/12 Updated: 23/Jul/13 Resolved: 17/Jan/13 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Prakash Surya (Inactive) | Assignee: | Alex Zhuravlev |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | shh | ||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 5556 | ||||||||||||
| Description |
|
What is the usefulness of these messages: Nov 13 16:29:07 grove402 kernel: Lustre: lstest-OST0192: Slow creates, 128/512 objects created at a rate of 2/s Nov 13 16:42:05 grove402 kernel: Lustre: lstest-OST0192: Slow creates, 128/256 objects created at a rate of 2/s Nov 13 16:48:25 grove501 kernel: Lustre: lstest-OST01f5: Slow creates, 128/512 objects created at a rate of 2/s We see them constantly, and my assumption is they're due to normal load on the system. If that's the case, they should be removed. |
| Comments |
| Comment by Alex Zhuravlev [ 14/Nov/12 ] |
|
I'd consider this as a bad sign.. 2 creates/second (even under a load) doesn't sound very nice. but you're the customer |
| Comment by Prakash Surya (Inactive) [ 14/Nov/12 ] |
|
Generally I would agree as it does sound unreasonably low, but it's happening constantly on production and development machines (with different hardware and different applications). |
| Comment by Christopher Morrone [ 14/Nov/12 ] |
|
Alex, if you are not seeing that, then you guys really need to take a hard look at your testing strategy. |
| Comment by Christopher Morrone [ 14/Nov/12 ] |
|
And for all of these messages that we are opening tickets on, we do need to understand the problem, not just remove the message without investigating why they are happening. There are many, many messages that need attention. |
| Comment by Prakash Surya (Inactive) [ 14/Nov/12 ] |
|
Right, my intention is to only remove the messages once the root cause is understood and deemed not worthy of a console message. I can change the wording to be more inquisitive if you think that's better? |
| Comment by Andreas Dilger [ 14/Dec/12 ] |
|
It would be useful to find out where in the code it is taking so long. The main possible point of contention I can think of is the RW locks on the parent directories, which are unfortunately locked only inside the ZAP and could not be held once at the start of precreate batch and all objects within that parent created at one time, as I had thought to do for ldiskfs at times. Another possible point of contention is if the precreate batch is creating such a large transaction that it causes the declarations to stall waiting for enough space to precreate all of the objects in the batch. It is possible to change the batch size via lctl set_param ofd.*.precreate_batch=32 or similar (temporarily only, on all OSSes) to see whether the problem goes away. If the batch is made larger (e.g. 256), it will hide the message (since it can only possibly be hit on the second time through the precreate loop) but it won't necessarily tell us why the creates are slow. |
| Comment by Prakash Surya (Inactive) [ 14/Dec/12 ] |
|
Well, actually, I was able to correlate some of those messages on Grove to ZIO delays reported by zpool events. I dug a little deeper using blktrace and determined certain IO requests were getting starved in the IO scheduler. We were using the CFQ scheduler, but have just moved to using NOOP for our ZFS OSTs and MDS (which should have been the case all along). We also see this message on our production ldiskfs file systems, which probably suffer from the same CFQ scheduler starvation issue. More investigation is needed to determine if we see these messages on Grove using the NOOP scheduler. |
| Comment by Alex Zhuravlev [ 17/Jan/13 ] |
|
Andreas, ZAP does use hashed locks internally, so I'd think locking should not be a problem. |
| Comment by Alex Zhuravlev [ 17/Jan/13 ] |
|
Prakash, any new observations since the system is using NOOP ? |
| Comment by Prakash Surya (Inactive) [ 17/Jan/13 ] |
|
Back when I was testing for this on Grove, using NOOP seemed to help considerably. With CFQ, I would sporadically see specific ZIOs take hundreds of seconds to complete, whereas none would take longer than second with NOOP. Also, grep'ing the console logs on Grove, I see the "Slow creates" message only twice since changing to NOOP. Whereas I see it hundreds of times in the older logs. I don't believe we've switched to using NOOP on our ldiskfs systems yet. |
| Comment by Alex Zhuravlev [ 17/Jan/13 ] |
|
thanks. so what's your opinion on this issue? should we keep the ticket open? |
| Comment by Prakash Surya (Inactive) [ 17/Jan/13 ] |
|
Let's go ahead and close this. If we still see these on ldiskfs once we switch over to noop or deadline, we can re-evaluate what the problem might be. And just for future reference, I'm working on getting a patch landed to set the "correct" scheduler for ldiskfs automatically at mount time: http://review.whamcloud.com/4853 |
| Comment by Alex Zhuravlev [ 17/Jan/13 ] |
|
given the specific symptom described in the ticket seem to be fixed, I'm closing this one. to track and resolve rather slow object creation with zfs backend let's use more generic ticket |
| Comment by Andreas Dilger [ 18/Jan/13 ] |
|
Just to clarify in this ticket to close out the issue, the upstream ZFS Git repo had a patch to automatically change the underlying block devices to use "noop" scheduler since zfs-0.5.2-61-g6839eed. There was an additional fix in zfs-0.6.0-rc12-30-g84daadd to ensure that "noop" was also set as the scheduler for DM devices. I can't imagine that Grove was running zfs-0.5.2 as recently as two months ago, so either you were running ZFS on DM devices, or there is something wrong in the code from commit 6839eed? Presumably, since Prakash pushed 84daadd only 4h after narrowing it down to the CFQ elevator, that you are running ZFS on DM devices (for multipath)? |
| Comment by Christopher Morrone [ 18/Jan/13 ] |
|
Yes, you are correct, we're using DM for multipath. Each OSS has two IB links, one to each of two controllers on in the NetApp enclosure. |